# Day 3 - Introduction to Python II

## List comprehension
Remember that a list is defined by items within square brackets `[]`

In python you can use 'list comprehension' to create new lists algorithmically by iterating over another list. 
This is a conceptual extension of a `for` loop, with differences in syntax (how it is written) and that it automatically returns a `list`.


In [None]:
# Example for loop
solutions = []
for x in range(10):
    solutions.append(5*x)

print(solutions)

In [None]:
# List comprehension version
solutions = [5*x for x in range(10)] # More concise, and faster!

print(solutions)

List comprehension statements can also include an `if` conditional statement, just like `for` loops.


In [None]:
[num for num in solutions] #returns itself

In [None]:
[num for num in solutions if num%2 == 0] #returns only values that are even

The `range` function seen above generates an object of type `range` that can be used as an iterable in a variety of contexts. It takes up to three arguments, which are the lower bound (inclusive), the upper bound (exclusive, just like slicing), and a step size. This concept works only with integers - you can't define a `range` between two floats!

In [None]:
type(range(0,10)) #if you don't provide a step size, it defaults to 1

print(range(0,10)) #a range object

print(list(range(0,10))) #can be expanded by casting to a list

## Exercises

In [None]:
5//2+(5%2)/2 #integer division and modulo

In [None]:
# examples of range(x,y,z)
seq = [x for x in range(20)] #If you only provide range() with one integer, it defaults to starting at 0

# use range to find every third number between 1 and 30


# use list comprehension to find the odd elements within that set



### List comprehension examples using variables other than numbers


In [None]:
Folks = [
("Luke", "BME", "student"),
("Sheridan", "Neuroscience", "student"),
("Loyal", "Neuroscience", "professor"),
("Winston", "BME", "professor")
]

[x[1] for x in Folks] ## access the second element in each tuple

## Use list comprehension and fstring formating to write a descriptive sentence about each person using the info above and print it to the output cell.


## Functions
Many times when writing programs, you will find yourself doing the same task(s) over and over again. Functions allow you to create reuseable pieces of your programs so you can run a block of specificed code anywhere in your program, as many times as you need, with any modifications or variables that you might need. As an example, lets start by figuring out what the GC% is for a given nucleic acid string.  To do this, we will need to learn the length of the oligonucleotide sequence, as well as the number of 'G' and 'C' bases.  Lets start by figuring out a way to do this for one DNA sequence.

In [None]:
dna = 'ATTAGCGTATTCGAGCTATCGATCTAGCGAGCTAGCTATCAGCGACGTACG'

dnaLength = len(dna)
print(f'Length of dna: {dnaLength}')

nG = dna.count('G')
print(f'Number of Gs: {nG}')

nC = dna.count('C')
print(f'Number of Cs: {nC}')

nGC = nG + nC

GC_content = (nGC/dnaLength)*100
print(GC_content)

While that's not terribly tedious, what if we needed to do this for 100,000 different dna sequences (not an uncommon task or scale in genomics)? 

We will definitely need a way to more quickly perform these actions. 

What we need to do is 'functionalize' this process so we can effectively turn it into a reuseable command. 

So lets create a function to do all of this for us. 

We start by defining the function with the `def` statement to define the function's name, followed by a pair of parentheses `()` where we can name a few 'parameters' (another name for variables passed to a function) that we will use within our function. This is then followed by the code block to execute when we 'call' the function. At the end of the function, we specify what values we want to return with `return()`.

In [None]:
def getGC(seq):
    dnaLength = len(seq)
    nG = seq.count('G')
    nC = seq.count('C')
    nGC = nG+nC
    GC_content = (nGC/dnaLength)*100
    return(GC_content)

Once we've created our function, we can then call it as many times as we want with different values passed to the `seq` argument as needed. 

*Note the terminology difference: the names given in the function definition are called parameters, but the values you supply in the function call are called arguments.*

In [None]:
dna = 'ATTAGCGTATTCGAGCTATCGATCTAGCGAGCTAGCTATCAGCGACGTACG'

gc = getGC(dna)
print(gc)

Now try and use your function within a `for` loop to iterate over a `list` of different DNA sequences

In [None]:
myDNASet = [
            'ACTGATGCTAGCTGACTGATCTAGCTGA',
            'TGCATTTTCGAGCTATCGAGCATTCTACGTACT',
            'CACTATCTACGGATCGGAGCGGATTCGTAGCTATGC',
            'GTATCGGATCTAGCGGCGGCATTATCG'
           ] # this is a list of strings, each containing a DNA sequence.

## your code that loop thorough the myDNASet list



When you declare variables within a function definition, they become completely isolated from any variables you may have created outside the definition. In this case, all variables created within a function definition are considered 'local'.  You cannot access them from outside the function definition, and they don't override any other variables either. This is called the 'scope' of a variable. All variables have the scope of the code block in which they are declared. There are ways to change this, but for now we just need to realize that there are in fact different scopes.

When defining a function, you may often want to set default values for your parameters, and/or make some parameters optional. You can also use `keyword` arguments to specify them directly when you call a function like so:

In [None]:
def checkDNA(seq, query='AGC'): #checks to see if the DNA sequence has a specific substring
    return(query in seq)

print(checkDNA(dna))

print(checkDNA(dna,query='TGCA'))

print(checkDNA(dna, 'TGCA')) # the keywords can be omitted, with the same result

print(checkDNA(query = "TGCA", seq = dna)) # with keywords, the positions of the arguments don't matter

    

By default, unnamed arguments passed to a function will be interpreted positionally. Above, notice that we can omit `seq = dna` in the function call. 

However, you also instead be explicit with the keywords, in which case the positions of the arguments aren't meaningful. An important rule is that positional arguments have to be provided before named arguments. For example, the following code will generate an error:

In [None]:
print(checkDNA(seq = dna,"TGCA"))

## Group Exercise:

Now lets try and create a new function that takes a DNA sequence as an argument and calculates the melting temperature (the temperature at which two DNA strands will separate from each other). 

This property is a function of the DNA sequence itself. For sequences less than 14 nucleotides the formula is: $T_m(\degree C) = (A+T) * 2 + (G+C) * 4$. For sequences longer than 13 nucleotides in length, there is a different equation: $T_m(\degree C) = 64.9 +\frac{41*(G+C-16.4)}{(A+T+G+C)}$. 

In each equation, `A, T, G, and C` correspond to the number of each nucleotide in the DNA string. Please define a single function that can test a DNA sequence of any length?

In [None]:
# Define and use your function. Test your function by looping through the sequences in the myDNASet list
myDNASet = [
            'ACTGATGCTAGCTGACTGATCTAGCTGA',
            'TGCATTTTCGAGCTATCGAGCATTCTACGTACT',
            'CACTATCTACGGATCGGAGCGGATTCGTAGCTATGC',
            'GTATCGGATCTAGCGGCGGCATTATCG'
           ]



## Importing modules
One of the more useful features of python, as with many other languages, is that you can add features, tools, data, etc to extend the functionality of python using _modules_. Python has a very large and diverse user base that has already developed hundreds (thousands?) of modules to perform broadly useful tasks, or very specialized tasks for different needs. You can also develop your own modules to split projects/workflows in to manageable pieces for easier maintenance and reusability.

A module is nothing more than python code. Within a module, you can define classes (objects), functions, and variables. A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended. Any existing python file can be referenced as a module and the elements within can then be used. For example, if you have a python file named `sequencing.py` you can import the file/module with the name `sequencing`.

To use a module in your code you must first tell python where to find the module. To do this, we use the command `import`

In [None]:
import math

This tells the python interpreter to load all of the functions and elements found in the `math` module (located within the python path) into the current session. `math` is a 'standard module' that is included with python.  To use a specific function or variable defined in `math` you can call it using the dot (`.`) operator along with the module name

In [None]:
a = 25
math.sqrt(a)

In [None]:
math.cos(a)

In [None]:
math.factorial(a)

It is important to be aware of modules as they can often save you from having to re-write or re-invent code that others have already solved. It is arguably the basis for the relevance of python as a programming language in the sciences as well!

There are a few syntactically different ways to import a module. You've already seen the direct import method above, but you can also import only the functions you need from a module by using the `from .. import` syntax:

In [None]:
from math import log10, log
log10(a)

Here we've imported just two of the functions in the `math` module. Notice that this time, you don't have to use `math.` before calling these functions. You can access them directly without using the module name.

You can rename the module for brevity or any other reason by using `as` with your import statement.

In [None]:
import datetime as dt

today = dt.date.today()
print(f"Today is {today}")

tomorrow = today + dt.timedelta(days=1)
print(f"Tomorrow is {tomorrow}")

Here, we imported the datetime module as `dt` . Notice that that the `dt` module defines a class called `date`. Then, we used the `date.today()` method (a function designed for a specific class) to get the current local date, which is then printed with some accessory text using the `print()` function.

The `import` function performs two operations: 1) a _search_ for a given package/module and 2) the binding of that module (and its functions/variables/classes) to the current python session. When a module is imported, the interpreter first searches for a built-in module with that name. If it doesn't find one, it then looks for a file with the module name and `.py` in a list of directories identified by the `sys.path` variable

In [None]:
import sys
sys.path

## import the 'os' module and print your current working directory with the getcwd() function
import os
print(os.getcwd())

### Exercise
1. Create a function to calculate the volume of a sphere using the radius as an arugment. (hint: use a module to find $\pi$)

## Working with files

### Reading/opening files
Reading and writing to files in python is achieved through the `open()` function. When you 'open' a file, you are creating a device to communicate with the file on disk. We will need to specify how we would like to interact with the file.  Do we want to read the file ('r') or write to the file ('w'). This is the 'mode' that we will need to specify when we make the call to `open()`

In [None]:
file = open('files/Foxp1.gbk',mode = 'r')

# iterate through all lines of a file
for line in file:
    print(line)

In [None]:
# Read only the first n lines of a file
nLines = 5
i=0

# Need to re-open the file since we've already iterated through it completely
file = open('files/Foxp1.gbk',mode = 'r')

while i < nLines:
    print(next(file))
    i += 1

file.close()

### Fetching files/data from the internet

In [None]:
## lets deconstruct what's happening here.

import urllib
import gzip

url = "https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/chromosomes/chrIII.fa.gz"
urllib.request.urlretrieve(url, "chrIII.fa.gz")

chrIII = gzip.open('chrIII.fa.gz',mode = 'rb') # note the additional use of 'b' here meaning we are going to 'read' a 'binary' (zipped) file.

print(next(chrIII))

numLines = 10

head = [next(chrIII).decode().strip() for x in range(numLines)] # Using list comprehension to return a list of values from some operation

print(head)

# .decode() to convert byte-string (b'') to a string ('')

# .strip() to remove the newline character ('\n') from the end of each line

chrIII.close()

### Writing output to a file
When we write to a file, we need to indicate whether we are going to overwrite anything in an existing file, or append (`mode = a`) the content to the end of the existing file. When we are done with a file, we should always close it with `close()`

In [None]:
genes = ['Gapdh','Mef2c','Pax6','Cxcl1','Msi1']

file = open('myOutput.txt','w')
file.write("Here is my awesome file content!\n")
file.write("It's probably the most important information I'll need for my thesis!\n")
for gene in genes:
    file.write(f'{gene}\n')
file.close()

file = open('myOutput.txt','a') # Try changing the mode from 'a' to 'w'
file.write("I forgot to add this as well\n")
file.close()

file = open('myOutput.txt','r')

for line in file:
    print(line.strip())


In [None]:
file = open('myOutput.txt','a') # Try changing the mode from 'a' to 'w'
file.write("I forgot to add this as well\n")
file.close()

In [None]:
file = open('myOutput.txt','r')

for line in file:
    print(line)


## Intro to Python Scripting


 Now we're going to tie all of this together into a python script that is a self-contained program. So far we've used python interactively with notebooks, executing code block by block. Scripts are just text files with a set of instructions that are run when the script is called. By convention, these files have the suffix `.py`. 




Let's make a real, functional script that can translate a DNA sequence into its corresponding protein.

In [None]:
#This dictionary represents the 'genetic code' of DNA codons and their corresponding amino acids
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

# First let's write a function to return the amino acid associated with a given codon.
def lookup_AA(codon):
    # code
    return(aa)

# Let's dissect this function here to see what's going on and then we can make a reuseable script out of this
def translate_dna(dna):
    # code
    return(protein)


In [None]:
translate_dna('ATGATGATGATGATGATGATGATGATGATG')

### Passing arguments to scripts via the command line

Similar to how we provided arguments to functions, we can also flexibly provide python scripts with arguments. The syntax for this is:

**python my_script.py arg1 arg2 arg3 ...**

Now we just need to know how to access these arguments from within the script. From the `sys` module we used above, arguments can be accessed `sys.argv`. The first element (`sys.argv[0]`) will be the name of the script that was called. The rest of the elements (`sys.argv[1], sys.argv[2], ..., sys.argv[n]`) are the command line arguments, represented as strings. 

With this knowledge (and everything you've learned so far!) we should be able to make our script interactive.

## Review
- Questions?
- Homework