# Functions

So far, in our adventures into data types, data structures, and file handling we've used Python's built-in functions as well as some added functionality from importing other packages.  For example, you may remember using these functions:

- **len()**
  - Return the length (number of items) in the list.  Returns an integer.
- **max()**
  - Return the largest item in the list.  For numeric items, this is the largest number.  For text items, this is the last item when the items are sorted alphabetically.
- **min()**
  - Return the smallest item in the list.  For numeric items, this is the smallest number.  For text items, this is the first item when the items are sorted alphabetically.
- **sorted()**
  - Return a sorted version of the list.  This **does not** happen in place and the list itself is unaffected.  Numeric values are sorted lowest to highest and text values are sorted alphabetically.
- **sum()**
  - Return the sum of all items in a list.  Only valid if all items are numeric.
  
- **range(start, stop)**
  - Returns a list of all integers between start and stop (with half-open intervals!!)
  
A more complete list of Python's built-in functions can be found here: https://docs.python.org/3/library/functions.html

It may be the case, however, that none of these functions may meet your specific need.  For example, there is no Python function called **analyzeRNAseqdata()**.  At least, not until you write it.  That's what we will cover here.  How to write (and use) your own, custom-built functions.

## Defining our own functions

Defining functions is done with the following syntax:

    def functionname(arguments):
        maybe do something
        maybe do something else
        maybe return something
        
        
There's a reason each line here starts with "maybe."  It's because lines that perform actions aren't necessarily required in a function.  Similarly, returning the output of a function (done with the last line) isn't even required.  You don't even need arguments (these are the items that the function will use and operate on)!

Let's start with perhaps one of the simplest possible functions.

In [2]:
#Our first function

#It has no arguments, so the area between the parentheses is blank.
def sayhello():
    print('Hello world!')
    
#OK now we've defined what the function is, so let's call it.
sayhello()

Hello world!


## Slightly more complicated functions

OK nice, we've defined our function called sayhello.  Every time we call it by invoking sayhello(), it will print the string 'Hello world!'.  Pretty neat, and also pretty cheerful, but not super useful.  Let's write a function now that takes in an argument and does something with it.  Arguments, you'll recall, are items (they can be just about anything...) that the function recieves and often upon which the function performs actions.

In [3]:
#Slightly more complicated

def greeter(name):
    print('Well hello, {0}.'.format(name))
    
#Function is defined.  Let's call it. Remember, this function requires one argument.  It must be provided.
greeter('Srinivas')

Well hello, Srinivas.


## Returning the output of a function

Most of the time, functions will return their output instead of simply printing something in the middle.  This is done using the **return** keyword at the end of the function.

In [4]:
#A function to add two numeric quantities

def plus(a, b):
    return a + b

#Simple enough, now call it
mysum = plus(2, 3)
print(mysum)

5


In [8]:
#Often it's good to write code that will catch your errors

#We've learned that, in Python, you can "add" strings to combine them with the '+' operator.
#However, adding strings and integers doesn't make sense.
#So maybe we wan't to limit this function to only allowing addition in the mathematical sense.
#That means only numeric things (integers and floats) can be added here.

def plus(a, b):
    #Check the type of a and b
    #Notice the parentheses in the if statement
    if (type(a) == int or type(a) == float) and (type(b) == int or type(b) == float):
        return a + b
    else:
        return 'Both a and b must be numeric!'
    
mysum = plus(3, 4)
print(mysum)
mysum = plus('three', 4)
print(mysum)

7
Both a and b must be numeric!


## Documenting functions

We've learned how to document code.  A line can be "commented out" by starting it with '#'.  Groups of lines can also be ignored by the compiler and left as plain English (or whatever it is you write) by surrounding them with triple quotes (''').  Generally it is a **GREAT** idea to document your code.  You will find that it is hard to remember exactly what you meant each line to do.  Imagine, then, try to figure out what *someone else's* code is trying to do when you are debugging it or expanding on it.  I believe it was Aristotle who said "Hell is other people's code."

**ALWAYS** insert comments on your code whereever you can.  For functions, inserting a small vignette about the purpose of the function and what it returns will save you time in the future.

In [1]:
def plus(a, b):
    '''
    This function takes in two values, a and b, and evaluates if there are both either
    integers or floats. If so, it adds them.  If not, it returns a message stating that it needs
    integers or floats.
    
    Returns:
    int or float if a and b are ints and/or floats
    string if either a or b is not an int or float
    '''
    
    #Check the type of a and b
    #Notice the parentheses in the if statement
    if (type(a) == int or type(a) == float) and (type(b) == int or type(b) == float):
        return a + b
    else:
        return 'Both a and b must be numeric!'
    
mysum = plus(3, 4)
print(mysum)
mysum = plus('three', 4)
print(mysum)

7
Both a and b must be numeric!


## Scope of variables

Up until now, we have generally assumed that a variable defined somewhere in our code is recognized throughout the script.  Once we start using functions, this is not true.  This has to do with the *scope* of the variable.

Generally, variables that are defined *within* a function are only valid within that function.  They cannot be accessed outside of the function and are therefore *local* variables.  Those that are defined outside of functions are called *global* variables and can be accessed by any function within the script.

In [9]:
initialx = 5

#Make a function
def double(x):
    x = x*2
    return x

#initialx is a global variable
print(initialx)

#run the function on initialx
double(initialx)

#has initialx changed? No, it was only changed *inside* the function
print(initialx)

#Storing the output of the function shows us its effect
doubledinitalx = double(initialx)
print(doubledinitalx)

#Can we access the variable x that is inside the function?  No.  We are outside of the function now.
print(x)

5
5
10


NameError: name 'x' is not defined

## The zip function

Now that we've learned how to make our own functions, let's talk about four built-in functions that are often useful in genomic-style analyses.  The first is the <font color = 'red'>zip()</font> function.

This function takes two iterables and combines them together into an *iterator*.  We don't have time to talk about what an iterator is or it's advantages.  For our purposes, just know that iterators can be converted to lists or dictionaries by wrapping them with list() or dict(), respectively.

How is this useful?  Well, what if you had two related lists?  Say, a list of all codons and a string of all amino acids.  Combining these into a dictionary that shows their relationship could be very useful.  


In [12]:
codons = ['UUU', 'UUC', 'UUA', 'UUG', 'UCU', 'UCC', 'UCA', 'UCG', 'UAU', 'UAC', 'UAA', 'UAG', 'UGU', 'UGC', 'UGA', 'UGG', 
          'CUU', 'CUC', 'CUA', 'CUG', 'CCU', 'CCC', 'CCA', 'CCG', 'CAU', 'CAC', 'CAA', 'CAG', 'CGU', 'CGC', 'CGA', 'CGG', 
          'AUU', 'AUC', 'AUA', 'AUG', 'ACU', 'ACC', 'ACA', 'ACG', 'AAU', 'AAC', 'AAA', 'AAG', 'AGU', 'AGC', 'AGA', 'AGG', 
          'GUU', 'GUC', 'GUA', 'GUG', 'GCU', 'GCC', 'GCA', 'GCG', 'GAU', 'GAC', 'GAA', 'GAG', 'GGU', 'GGC', 'GGA', 'GGG']

aminoacids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'

#Both lists and strings are iterables. The order of the amind acids is the same as the order of the codons.
#We can combine them into a single dictionary that shows their relationships using zip()

codontable = zip(codons, aminoacids)
#codontable is now an iterator
#Let's turn it into a dictionary
codontable = dict(codontable)

print(codontable)

{'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'UAU': 'Y', 'UAC': 'Y', 'UAA': '*', 'UAG': '*', 'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M', 'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'}


## The map function

The map function applies a function to every item of an iterable.  Yes, you could use a **for** loop for the same effect. 

In [16]:
#Get the lengths of each of these oligos in this list

oligos = ['CTGTACGATCGA', 'CTAGCTAG', 'TACGTAGCTAATTAACGACTG']

#This will call the function len() on each item of the iterable and return an iterator of the results.
oligolengths = map(len, oligos)
#Turn it into a list
oligolengths = list(oligolengths)
print(oligolengths)


#This is equivalent to the following loop
oligolengths = []
for oligo in oligos:
    oligolengths.append(len(oligo))

print(oligolengths)


[12, 8, 21]
[12, 8, 21]


## Combinations

Let's say you're a virologist so you care the immune response.  You did an RNAseq experiment before/after treating cells with some virus.  You saw that 100 genes were upregulated upon treatment.  15 of those were immune related.  Is that more than you would expect by chance?

One way to do this would be to take a list of all genes (or, more correctly, all of those that were expressed in your sample) and get all possible combinations of 100 genes taken from that list.  Then you ask how many of those possible combinations have at least 15 immune-related genes.

OK, so how do we get all possible combinations of 100 genes?  You could maybe imagine writing some for loop to come up with this.  However, this might be a little painful.  Thankfully, Python has a built-in function made to deal with such a case.  This function is called <font color = 'red'>combinations()</font> and it lives in the <font color = 'red'>itertools</font> package.

Generally speaking, combinations are a choice of *n* items from a list (or any iterable) of *k* items.  The combinations() function takes two required arguments: an iterable (k) and an integer number of items to draw from it (n). Typing

    combinations(k, n)
    
returns a "combinations object" that contains all possible groups of length n in the iterable k. It is easily turned into a list for inspection.

    list(combinations(k, n))

In [8]:
#How many ways can you arrange people at a dinner party (perhaps a...final...dinner) into tables?

guests = ['Peter', 'Andrew', 'James', 'John', 'Philip', 'Thaddeus', 
          'Bartholomew', 'OtherJames', 'Matthew', 'Simon', 'Judas']

#4 guests per table

#We have to import combinations from the itertools package
from itertools import combinations as comb

tables = comb(guests, 4)
#tables is now a "combinations object" but we can turn it into a list
tables = list(tables)
print(tables)
#Each table in tables is a 'tuple'. 
#We didn't cover what these are, but they are essentially equivalent to lists but they are *immutable*.
print(len(tables))

[('Peter', 'Andrew', 'James', 'John'), ('Peter', 'Andrew', 'James', 'Philip'), ('Peter', 'Andrew', 'James', 'Thaddeus'), ('Peter', 'Andrew', 'James', 'Bartholomew'), ('Peter', 'Andrew', 'James', 'OtherJames'), ('Peter', 'Andrew', 'James', 'Matthew'), ('Peter', 'Andrew', 'James', 'Simon'), ('Peter', 'Andrew', 'James', 'Judas'), ('Peter', 'Andrew', 'John', 'Philip'), ('Peter', 'Andrew', 'John', 'Thaddeus'), ('Peter', 'Andrew', 'John', 'Bartholomew'), ('Peter', 'Andrew', 'John', 'OtherJames'), ('Peter', 'Andrew', 'John', 'Matthew'), ('Peter', 'Andrew', 'John', 'Simon'), ('Peter', 'Andrew', 'John', 'Judas'), ('Peter', 'Andrew', 'Philip', 'Thaddeus'), ('Peter', 'Andrew', 'Philip', 'Bartholomew'), ('Peter', 'Andrew', 'Philip', 'OtherJames'), ('Peter', 'Andrew', 'Philip', 'Matthew'), ('Peter', 'Andrew', 'Philip', 'Simon'), ('Peter', 'Andrew', 'Philip', 'Judas'), ('Peter', 'Andrew', 'Thaddeus', 'Bartholomew'), ('Peter', 'Andrew', 'Thaddeus', 'OtherJames'), ('Peter', 'Andrew', 'Thaddeus', 'Matt

## Permutations

Permutations of an iterable are shuffles of its order.

The itertools function <font color = 'red'>permutations()</font> will return a "permutations object", that just like the combinations object, can be easily turned into a list of all possible permutations of an iterable.

In [12]:
from itertools import permutations as perm

#How many different permutations are there of the sequence ATCGAT
seq = 'ATCGAT'
perms = perm(seq) #this is a permutations object
#Turn it into a list
perms = list(perms)
print(perms)
print(len(perms))

[('A', 'T', 'C', 'G', 'A', 'T'), ('A', 'T', 'C', 'G', 'T', 'A'), ('A', 'T', 'C', 'A', 'G', 'T'), ('A', 'T', 'C', 'A', 'T', 'G'), ('A', 'T', 'C', 'T', 'G', 'A'), ('A', 'T', 'C', 'T', 'A', 'G'), ('A', 'T', 'G', 'C', 'A', 'T'), ('A', 'T', 'G', 'C', 'T', 'A'), ('A', 'T', 'G', 'A', 'C', 'T'), ('A', 'T', 'G', 'A', 'T', 'C'), ('A', 'T', 'G', 'T', 'C', 'A'), ('A', 'T', 'G', 'T', 'A', 'C'), ('A', 'T', 'A', 'C', 'G', 'T'), ('A', 'T', 'A', 'C', 'T', 'G'), ('A', 'T', 'A', 'G', 'C', 'T'), ('A', 'T', 'A', 'G', 'T', 'C'), ('A', 'T', 'A', 'T', 'C', 'G'), ('A', 'T', 'A', 'T', 'G', 'C'), ('A', 'T', 'T', 'C', 'G', 'A'), ('A', 'T', 'T', 'C', 'A', 'G'), ('A', 'T', 'T', 'G', 'C', 'A'), ('A', 'T', 'T', 'G', 'A', 'C'), ('A', 'T', 'T', 'A', 'C', 'G'), ('A', 'T', 'T', 'A', 'G', 'C'), ('A', 'C', 'T', 'G', 'A', 'T'), ('A', 'C', 'T', 'G', 'T', 'A'), ('A', 'C', 'T', 'A', 'G', 'T'), ('A', 'C', 'T', 'A', 'T', 'G'), ('A', 'C', 'T', 'T', 'G', 'A'), ('A', 'C', 'T', 'T', 'A', 'G'), ('A', 'C', 'G', 'T', 'A', 'T'), ('A', '

# <font color = 'red'> Exercises </font>

There are three exercises below, each with an empty code block after them. Fill the code blocks with code to answer each exercise.

## Exercise 1

Write a function that takes in a DNA sequence, "transcribes" it to RNA, and then translates it into a peptide sequence.  Use this function to translate the sequence 'ATTAACAGCGAACGCACCTGCCTGGAAGTGGAACGCATGGAAAGCAGCGCGGGCGAACATGAACGCGAA'.

Here's some pseudocode to help you out.

Make a dictionary that relates codons and amino acids by zipping the two lists provided in the code block below.
Give your function that dictionary and a sequence.
Have your function "transcribe" the sequence by turning Ts into Us.
Make a list of all codons in the '0' open reading frame.
Using your dictionary, find the amino acid corresponding to each codon.

In [None]:
codons = ['UUU', 'UUC', 'UUA', 'UUG', 'UCU', 'UCC', 'UCA', 'UCG', 'UAU', 'UAC', 'UAA', 'UAG', 'UGU', 'UGC', 'UGA', 'UGG', 
          'CUU', 'CUC', 'CUA', 'CUG', 'CCU', 'CCC', 'CCA', 'CCG', 'CAU', 'CAC', 'CAA', 'CAG', 'CGU', 'CGC', 'CGA', 'CGG', 
          'AUU', 'AUC', 'AUA', 'AUG', 'ACU', 'ACC', 'ACA', 'ACG', 'AAU', 'AAC', 'AAA', 'AAG', 'AGU', 'AGC', 'AGA', 'AGG', 
          'GUU', 'GUC', 'GUA', 'GUG', 'GCU', 'GCC', 'GCA', 'GCG', 'GAU', 'GAC', 'GAA', 'GAG', 'GGU', 'GGC', 'GGA', 'GGG']

aminoacids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'

## Exercise 2

You have a bunch of currency of various demoninations in your pocket.  Specifically, you have three $20 bills, five $10 bills, two $5 bills, and five $1 bills.  Using these bills, how many different ways can you make $100?

Here's some pseudocode to help out.

Make a list of all the bills you have.
Using the combinations function, get all possible combinations of these bills.
Sum the denominations in each combination and keep the combinations whose sum is 100.

## Exercise 3

In your favorite protein there is a stretch of six amino acids that contain two proline residues.  Curiously, in this particular 6mer, the two prolines are right next to each other.  You wonder how rare this is.  You have gathered all 6mer sequences in the entire proteome and found that in only 1% of these 6mer sequences are the prolines next to each other.  For a 6mer amino acid sequence that contains two prolines, if amino acids were purely randomly distributed, what is the expected frequency of 6mers that contain consecutive prolines?  Are consecutive prolines depleted from naturally occuring protein sequence?

Here's some pseudocode.

6mer sequences that contain two prolines can be represented as 'XXPPXX'. The identity of non-proline residues doesn't matter, just so long as we represent them as being non-proline.

Get all permutations of the 6mer 'XXPPXX'.  Count the number of those that contain consecutive prolines.  Divide that by the total number of permutations to get a frequency.

240 720 0.3333333333333333
