## Last time we covered
* ### Numpy indexing and vectorizing code
* ### Matplotlib: figure and axis objects, scatter plots
* ### Data input and output - with built in python functions, and with numpy

## Today we will cover:
* ### HW1 ORF problem review
* ### Writing functions in python
* ### Getting started with pandas

## HW1 ORF problem review

In [2]:
import numpy as np

### First generate the random sequence. The np.random.choice function is useful here:

In [3]:
N = 500; # define sequence length
bases = np.array(['A','T','G','C'])
rand_seq = np.random.choice(bases,N) #this makes a numpy array with 500 random bases
print(''.join(rand_seq)) #print it nicely. 

GGTAATACGTAACCAATCTCTGCTCTTTCGGAATTCCATAAGGACCGAACATGTCCAATCAGCACAGGCCAGATATCCAACAATAGTTTGAAAATCATCGTATTGGTCAATGTCATCCGGTGGGCCAAATCGCTGGGGTTCTCTAACCCTAAGAGGGACTGTTGTAACCTACTGACTAGTTTTTGTCACTTAGCTTGGCATGCCCACTAAGACTGACAAGTCGACCAATCCTGGCATGAACAGACGGGTAACCTCCCATTCCAAGTAGGGATGGACACTTCTCTCAATAGAGGAATATTACCATTCTGCTGAGCGGCCTCAAAGGATAGATTACCTAGTGCCTAAAGTGTGGGTGAGAGTGGGTCCGTCGTTTAGCGCGTCCTCAAGACCTCGCATGTCGACTTGATGTGGTAAGGCGGTGTACTGTTTCACGTCTAGCCAATCCTCCGTATATGTACTTAGAAAGAACAAGTGCATTACCGTGGTAGGTGAAGGCCGCC


### Lets first make a numpy array with all the codons. Here we will do this in a straightforward way with loops. 

In [4]:
allCodons = [] #empty list to store the codons

#step through the sequence storing one codon at a time as a string. 
for ii in range(len(rand_seq)-2):
    allCodons.append(''.join(rand_seq[ii:(ii+3)]))
    
allCodons = np.array(allCodons) #convert to numpy array

In [6]:
allCodons == 'ATG'

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

### Now lets find the stop and start codons:

In [7]:
starts = []
stops = []
for n, cod in enumerate(allCodons):
    if cod == 'ATG':
        starts.append(n)
    elif cod == 'TAA' or cod == 'TGA' or cod == 'TAG':
        stops.append(n)
starts = np.array(starts)
stops = np.array(stops)

In [9]:
stops

array([  2,   9,  38,  83,  88, 143, 149, 164, 172, 176, 190, 207, 213,
       236, 248, 265, 287, 309, 326, 335, 342, 353, 372, 403, 411, 435,
       459, 485, 489])

### For each start codon, let's find the first stop codon that is a multiple of 3 base pairs away

In [10]:
firstStop = np.zeros(len(starts))
for n,sta in enumerate(starts):
    for sto in stops:
        if sto > sta and (sto-sta)%3 == 0: #stop must be after start and a multiple of 3 away
            firstStop[n] = sto
            break    #once we find the first stop codon meeting this criteria, we can move on the next one

In [12]:
starts

array([ 50, 109, 199, 235, 270, 394, 405, 452])

In [11]:
firstStop

array([ 83., 172., 265., 265., 309., 403., 411., 485.])

### This has defined all the open reading frames. Now let's find the longest one and print the details

In [13]:
ORFLengths = firstStop - starts
indMax = np.argmax(ORFLengths)
longestLength = ORFLengths[indMax]

if longestLength > 0:
    print('The longest open reading frame is of length ' + str(int(longestLength))\
          + '. Starts at ' + str(int(starts[indMax])) \
          + '. Stops at ' + str(int(firstStop[indMax])))
else:
    print('No ORF found')

The longest open reading frame is of length 66. Starts at 199. Stops at 265


## Write functions in python:

* ### You have used many functions written in python, for example range, len, np.sqrt etc. 
* ### Writing your own functions allows you to avoid writing the same block of code many times
* ### Each function gets its own workspace avoiding variable name conflicts

### Let's start with a simple example:


In [14]:
def divideTwoNumbers(num1,num2):
    s = num1/num2
    return s

In [15]:
divideTwoNumbers(3,4)

0.75

### num1 and num2 are called arguments. These are positional arguments because the function infers their value from their position in a list. It matters what order we put them:

In [16]:
divideTwoNumbers(4,3)

1.3333333333333333

### but we could also set them explicitly using their names in any order

In [17]:
divideTwoNumbers(num2 = 4, num1 = 3)

0.75

### Note that the variables num1 and num2 only have meaning inside the function. They are set while the function is running but are not included in the workspace once the function is finished running:

In [18]:
num2

NameError: name 'num2' is not defined

### Variables that are defined outside the function and not passed to the function may not be accessible from inside the function

In [20]:
def divideNumbers(num1,num2):
    num3 = num3 + 1
    s = num1/(num2*num3)
    return s

In [21]:
num1 = 2
num2 = 4
num3 = 5
divideNumbers(num1,num2)

UnboundLocalError: cannot access local variable 'num3' where it is not associated with a value

### One way around this is to define num3 as global but this is not recommended in general:

In [22]:
def divideNumbers_wGlobal(num1,num2):
    global num3 
    num3 = num3 + 1
    s = num1/(num2*num3)
    return s

In [24]:
divideNumbers_wGlobal(3,4)

0.10714285714285714

### A quirk of python is that if you use a variable without assigning it or passing as an argument, it assumes it to be global (this didn't work in the divide numbers function because the statement num3 = ... creates the assumption that it is local: 

In [25]:
def testfunc():
    print(num3)

In [26]:
testfunc()

7


In [27]:
### A much better solution is to simply include it as another argument:
def divideThreeNumbers(num1,num2,num3):
    s = num1/(num2*num3)
    return s

### You can also assign default values to arguments then you don't have to supply them

In [28]:
def divideNumbersWithDefaults(num1 = 2, num2 = 3):
    s = num1/num2
    return s

In [29]:
divideNumbersWithDefaults()

0.6666666666666666

In [30]:
divideNumbersWithDefaults(3)

1.0

### You can also supply some with defaults and some without but those with defaults, which are optional, must come after:

In [31]:
#This won't work, gives an error
def divideNumbersOneDefault(num1 = 2, num2):
    s = num1/num2
    return s

SyntaxError: non-default argument follows default argument (1377247247.py, line 2)

In [32]:
def divideNumbersOneDefault(num1, num2 = 2):
    s = num1/num2
    return s

In [33]:
divideNumbersOneDefault(6)

3.0

In [34]:
divideNumbersOneDefault(12,17)

0.7058823529411765

### You can have an arbitrary number of non-keyword arguments using the \*arg notation. Once inside the function arg will be a tuple containing all of these arguments

In [35]:
def multiplyNumbers(*arg):
    print(arg)
    prod = 1
    for x in arg:
        prod = prod*x
    return prod

In [37]:
multiplyNumbers(2,3,4,7)

(2, 3, 4, 7)


168

### You can include *arg after positional or keyword arguments and it will contain all the remaining arguments

In [39]:
def addOrMultiply(doAddition,*args):
    if doAddition:
        out = 0
        for x in args:
            out = out + x
    else:
        out = 1
        for x in args:
            out = out*x
    return out

In [41]:
addOrMultiply(True,2,3,4,7)

16

In [42]:
addOrMultiply(False,2,3,4)

24

### We could combine position, keyward and variable arguments as in: In this case, the first position argument will be in do Addition, the others in args, and divideBy must be specified at the end using the keyword (if we don't specify the keyword, the last number will go into args)

In [None]:
def addOrMutipleAndDivide(doAddition,*args,divideBy = 1):
    if doAddition:
        out = 0
        for x in args:
            out = out + x
    else:
        out = 1
        for x in args:
            out = out*x
    return out/divideBy

In [None]:
addOrMutipleAndDivide(True, 2, 3, 4) # divideBy was set to 2 by position

In [None]:
addOrMutipleAndDivide(True,2,3,4,divideBy=3)

### We can also have variable numbers of arguments with keywords as in:

In [43]:
def variableKeywords(**kwargs):
    for k,v in kwargs.items():
        print(k + ", " + v)

In [44]:
variableKeywords(foo = "str1", foo2 = "str2", foo3 = "str3")

foo, str1
foo2, str2
foo3, str3


In [45]:
def divideNumbers(a,b,**kwargs):
    s = a/b
    if "invert" in kwargs.keys() and kwargs["invert"] == True:
        s = 1/s
    if "takelog" in kwargs.keys() and kwargs["takelog"] == True:
        s = np.log(s)
    return s

In [46]:
divideNumbers(9,3)

3.0

In [47]:
divideNumbers(9,3, invert = True)

0.3333333333333333

In [48]:
divideNumbers(9,3, takelog = True)

1.0986122886681098

In [49]:
divideNumbers(9,3, takelog = True, invert = True)

-1.0986122886681098

In [50]:
divideNumbers(9,3, takelog = True, invert = True, extraArg = "yes")

-1.0986122886681098

## Starting with pandas:

### The pandas library is very widely used in data analysis. If you need to install it, do "conda install pandas" from your conda terminal. 
### pandas is imported with this standard convention:

In [52]:
import pandas as pd

### Pandas has two basic datatypes, the series and the dataframe which are for one and multi dimensional data. Today we will talk about series. 

### Image we have gene expression data for 3 genes in two different conditions but the formatting is not conistent:

In [53]:
data1 = pd.Series([23, 99, 1], index = ["Gene1","Gene2","Gene3"])

In [54]:
data1

Gene1    23
Gene2    99
Gene3     1
dtype: int64

### Notice that the series is like a numpy array, it has a dtype, but it also has an index. 

### You can name the index:

In [55]:
data1.index.name = "Gene name"

In [56]:
data1

Gene name
Gene1    23
Gene2    99
Gene3     1
dtype: int64

### You can then access data by index:

In [57]:
data1["Gene2"]

99

### You can also access by number as in a numpy array:

In [58]:
data1[1]

99

In [59]:
data2 = pd.Series([77, 27, 3], index = ["Gene3","Gene1","Gene2"])

In [60]:
data2

Gene3    77
Gene1    27
Gene2     3
dtype: int64

### Let's say we want to take the combined expression from these two conditions:

In [61]:
data1+data2

Gene1     50
Gene2    102
Gene3     78
dtype: int64

### Notice what has happened - the values were added by index even though they were in a different order. This wouldn't be possible in numpy and this gives the wrong answer:

In [62]:
data1.to_numpy()+data2.to_numpy()

array([100, 126,   4])

### Numpy like indexing and filtering also works:

In [63]:
data1 > 50

Gene name
Gene1    False
Gene2     True
Gene3    False
dtype: bool

In [64]:
data1[data1 > 50]

Gene name
Gene2    99
dtype: int64

In [65]:
### You can make a Series from a list or tuple without specifying an index and it will have a default index:

In [66]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [67]:
### If you make a series from a dictionary, the keys will become the index in the order you specified them:

In [68]:
pd.Series({"gene1":12,"gene2":16,"gene3":98})

gene1    12
gene2    16
gene3    98
dtype: int64

### if you want a different index ordering or to add extra values, you can give it an index directly. Note how missing values were handled:

In [70]:
pd.Series({"gene1":12,"gene2":16,"gene3":98}, index = ["gene1","gene2","gene3","gene4"])

gene1    12.0
gene2    16.0
gene3    98.0
gene4     NaN
dtype: float64

### You can add values to a series directly:

In [71]:
data1["gene5"] = 66

In [72]:
data1

Gene name
Gene1    23
Gene2    99
Gene3     1
gene5    66
dtype: int64

### Pandas will add Series with different indexes but some data will be missing:

In [73]:
dataAdd = data1+data2

In [74]:
dataAdd

Gene1     50.0
Gene2    102.0
Gene3     78.0
gene5      NaN
dtype: float64

### See the missing values with the isna method:

In [75]:
dataAdd.isna()

Gene1    False
Gene2    False
Gene3    False
gene5     True
dtype: bool