## Last time we covered
* ### Numpy indexing and vectorizing code
* ### Matplotlib: figure and axis objects, scatter plots
* ### Data input and output - with built in python functions, and with numpy

## Today we will cover:
* ### HW1 ORF problem review
* ### Writing functions in python
* ### Getting started with pandas

## HW1 ORF problem review

In [None]:
import numpy as np

### First generate the random sequence. The np.random.choice function is useful here:

In [None]:
N = 500; # define sequence length
bases = np.array(['A','T','G','C'])
rand_seq = np.random.choice(bases,N) #this makes a numpy array with 500 random bases
print(''.join(rand_seq)) #print it nicely. 

### Lets first make a numpy array with all the codons. Here we will do this in a straightforward way with loops. 

In [None]:
allCodons = [] #empty list to store the codons

#step through the sequence storing one codon at a time as a string. 
for ii in range(len(rand_seq)-2):
    allCodons.append(''.join(rand_seq[ii:(ii+3)]))
    
allCodons = np.array(allCodons) #convert to numpy array

### Here is a vectorized way to make the same thing

In [None]:
#first make the codon list using np.char.add (elementwise string concatenation) 
# and slicing the rand_seq variable. 
allCodons = np.char.add(np.char.add(rand_seq[:-2],rand_seq[1:-1]),rand_seq[2:]) #this does element wise concatenation

### Now lets find the stop and start codons:

In [None]:
starts = []
stops = []
for n, cod in enumerate(allCodons):
    if cod == 'ATG':
        starts.append(n)
    elif cod == 'TAA' or cod == 'TGA' or cod == 'TAG':
        stops.append(n)
starts = np.array(starts)
stops = np.array(stops)

### Or vectorized code using numpy indexing:

In [None]:
#get the stops and starts
starts = np.nonzero(allCodons == 'ATG')[0]
stops = np.nonzero(np.in1d(allCodons,['TAA','TGA','TAG']))[0]

### For each start codon, let's find the first stop codon that is a multiple of 3 base pairs away

In [None]:
firstStop = np.zeros(len(starts))
for n,sta in enumerate(starts):
    for sto in stops:
        if sto > sta and (sto-sta)%3 == 0: #stop must be after start and a multiple of 3 away
            firstStop[n] = sto
            break    #once we find the first stop codon meeting this criteria, we can move on the next one

### This has defined all the open reading frames. Now let's find the longest one and print the details

In [None]:
ORFLengths = firstStop - starts
indMax = np.argmax(ORFLengths)
longestLength = ORFLengths[indMax]

if longestLength > 0:
    print('The longest open reading frame is of length ' + str(int(longestLength))\
          + '. Starts at ' + str(int(starts[indMax])) \
          + '. Stops at ' + str(int(firstStop[indMax])))
else:
    print('No ORF found')

## Write functions in python:

* ### You have used many functions written in python, for example range, len, np.sqrt etc. 
* ### Writing your own functions allows you to avoid writing the same block of code many times
* ### Each function gets its own workspace avoiding variable name conflicts

### Let's start with a simple example:


In [None]:
def divideTwoNumbers(num1,num2):
    s = num1/num2
    return s

In [None]:
divideTwoNumbers(3,4)

### num1 and num2 are called arguments. These are positional arguments because the function infers their value from their position in a list. It matters what order we put them:

In [None]:
divideTwoNumbers(4,3)

### but we could also set them explicitly using their names in any order

In [None]:
divideTwoNumbers(num2 = 4, num1 = 3)

### Note that the variables num1 and num2 only have meaning inside the function. They are set while the function is running but are not included in the workspace once the function is finished running:

In [None]:
num2

### Variables that are defined outside the function and not passed to the function may not be accessible from inside the function

In [None]:
def divideNumbers(num1,num2):
    num3 = num3 + 1
    s = num1/(num2*num3)
    return s

In [None]:
num3 = 5
divideNumbers(num1,num2)

### One way around this is to define num3 as global but this is not recommended in general:

In [None]:
def divideNumbers_wGlobal(num1,num2):
    global num3 
    num3 = num3 + 1
    s = num1/(num2*num3)
    return s

In [None]:
divideNumbers_wGlobal(3,4)

### A quirk of python is that if you use a variable without assigning it or passing as an argument, it assumes it to be global (this didn't work in the divide numbers function because the statement num3 = ... creates the assumption that it is local: 

In [None]:
def testfunc():
    print(num3)

In [None]:
testfunc()

In [None]:
### A much better solution is to simply include it as another argument:
def divideThreeNumbers(num1,num2,num3):
    s = num1/(num2*num3)
    return s

### You can also assign default values to arguments then you don't have to supply them

In [None]:
def divideNumbersWithDefaults(num1 = 2, num2 = 3):
    s = num1/num2
    return s

In [None]:
divideNumbersWithDefaults()

In [None]:
divideNumbersWithDefaults(3)

### You can also supply some with defaults and some without but those with defaults, which are optional, must come after:

In [None]:
#This won't work, gives an error
def divideNumbersOneDefault(num1 = 2, num2):
    s = num1/num2
    return s

In [None]:
def divideNumbersOneDefault(num1, num2 = 2):
    s = num1/num2
    return s

In [None]:
divideNumbersOneDefault(6)

In [None]:
divideNumbersOneDefault(12,17)

### You can have an arbitrary number of non-keyword arguments using the \*arg notation. Once inside the function arg will be a tuple containing all of these arguments

In [None]:
def multiplyNumbers(*arg):
    prod = 1
    for x in arg:
        prod = prod*x
    return prod

In [None]:
multiplyNumbers(2,3,4)

### You can include *arg after positional or keyword arguments and it will contain all the remaining arguments

In [None]:
def addOrMultiply(doAddition,*args):
    if doAddition:
        out = 0
        for x in args:
            out = out + x
    else:
        out = 1
        for x in args:
            out = out*x
    return out

In [None]:
addOrMultiply(True,2,3,4)

In [None]:
addOrMultiply(False,2,3,4)

### We could combine position, keyward and variable arguments as in: In this case, the first position argument will be in do Addition, the others in args, and divideBy must be specified at the end using the keyword (if we don't specify the keyword, the last number will go into args)

In [None]:
def addOrMutipleAndDivide(doAddition,*args,divideBy = 1):
    if doAddition:
        out = 0
        for x in args:
            out = out + x
    else:
        out = 1
        for x in args:
            out = out*x
    return out/divideBy

In [None]:
addOrMutipleAndDivide(True, 2, 3, 4) # divideBy was set to 2 by position

In [None]:
addOrMutipleAndDivide(True,2,3,4,divideBy=3)

### We can also have variable numbers of arguments with keywords as in:

In [None]:
def variableKeywords(**kwargs):
    for k,v in kwargs.items():
        print(k + ", " + v)

In [None]:
variableKeywords(foo = "str1", foo2 = "str2", foo3 = "str3")

In [None]:
def divideNumbers(a,b,**kwargs):
    s = a/b
    if "invert" in kwargs.keys() and kwargs["invert"] == True:
        s = 1/s
    if "takelog" in kwargs.keys() and kwargs["takelog"] == True:
        s = np.log(s)
    return s

In [None]:
divideNumbers(9,3)

In [None]:
divideNumbers(9,3, invert = True)

In [None]:
divideNumbers(9,3, takelog = True)

In [None]:
divideNumbers(9,3, takelog = True, invert = True)

In [None]:
divideNumbers(9,3, takelog = True, invert = True, extraArg = "yes")

## Starting with pandas:

### The pandas library is very widely used in data analysis. If you need to install it, do "conda install pandas" from your conda terminal. 
### pandas is imported with this standard convention:

In [None]:
import pandas as pd

### Pandas has two basic datatypes, the series and the dataframe which are for one and multi dimensional data. Today we will talk about series. 

### Image we have gene expression data for 3 genes in two different conditions but the formatting is not conistent:

In [None]:
data1 = pd.Series([23, 99, 1], index = ["Gene1","Gene2","Gene3"])

In [None]:
data1

### Notice that the series is like a numpy array, it has a dtype, but it also has an index. 

### You can name the index:

In [None]:
data1.index.name = "Gene name"

In [None]:
data1

### You can then access data by index:

In [None]:
data1["Gene2"]

### You can also access by number as in a numpy array:

In [None]:
data1[1]

In [None]:
data2 = pd.Series([77, 27, 3], index = ["Gene3","Gene1","Gene2"])

In [None]:
data2

### Let's say we want to take the combined expression from these two conditions:

In [None]:
data1+data2

### Notice what has happened - the values were added by index even though they were in a different order. This wouldn't be possible in numpy and this gives the wrong answer:

In [None]:
data1.to_numpy()+data2.to_numpy()

### Numpy like indexing and filtering also works:

In [None]:
data1 > 50

In [None]:
data1[data1 > 50]

In [None]:
### You can make a Series from a list or tuple without specifying an index and it will have a default index:

In [None]:
pd.Series([1, 2, 3, 4, 5])

In [None]:
### If you make a series from a dictionary, the keys will become the index in the order you specified them:

In [None]:
pd.Series({"gene1":12,"gene2":16,"gene3":98})

In [None]:
### if you want a different index ordering or to add extra values, you can give it an index directly. Note how missing values were handled:

In [None]:
pd.Series({"gene1":12,"gene2":16,"gene3":98}, index = ["gene1","gene2","gene3","gene4"])

### You can add values to a series directly:

In [None]:
data1["gene5"] = 66

In [None]:
data1

### Pandas will add Series with different indexes but some data will be missing:

In [None]:
dataAdd = data1+data2

In [None]:
dataAdd

### See the missing values with the isna method:

In [None]:
dataAdd.isna()