# Introduction to Python

## Stephen Weston and Robert Bjornson  
## Yale Center for Research Computing  
## Jan 2017

## What is the Yale Center for Research Computing?


- Independent center under the Provost's office
- Created to support your research computing needs
- Focus is on high performance computing and storage
- ~15 staff, including applications specialists and system engineers
- Available to consult with and educate users
- Manage compute clusters and support users
- Located at 160 St. Ronan st, at the corner of Edwards and St. Ronan
- http://research.computing.yale.edu



## Why Python?
- Free, portable, easy to learn
- Wildly popular, huge and growing community
- Intuitive, natural syntax
- Ideal for rapid prototyping but also for large applications
- Very efficient to write, reasonably efficient to run as is
- Can be very efficient (numpy, cython, ...)
- Huge number of packages (modules)


## You can use Python to...
- Convert or filter files
- Automate repetitive tasks
- Compute statistics
- Build processing pipelines
- Build simple web applications
- Perform large numerical computations
- ...

You can use Python instead of bash, Java, or C

Python can be run interactively or as a program

![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 1")


## Different ways to run Python

1. Create a file using editor, then:

   ```$ python myscript.py```

1. Run interpreter interactively

   ```$ python ```

1. Use a python environment, e.g. Anaconda


## Basic Python Types

In [2]:
radius=2
pi=3.14
diam=radius*2
area=pi*(radius**2)
title="fun with strings"
pi="cherry"
longnum=31415926535897932384626433832795028841971693993751058\
2097494459230781640628620899862803482534211706798214808651
delicious=True


- variables do not need to be declared or typed
- integers and floating points can be used together
- the same variable can hold different types
- lines can be broken using \
- python supports arbitrary length integer numbers


In [6]:
area**2



157.7536

## Other Python Types: _lists_

Lists are like arrays in other languages.  



In [2]:
l=[1,2,3,4,5,6,7,8,9,10]
l[5]

6

In [11]:
l[0]


1

In [None]:
l[5:]

In [None]:
l[5:-3]

In [3]:
>>> l[2]=3.14
>>> l[3]="pi"
>>> l

[1, 2, 3.14, 'pi', 5, 6, 7, 8, 9, 10]

In [None]:
>>> len(l)

In [None]:
l=range(1,10)
print (l)
print (l[4:6])
l[-6:-3]

## Lists are more flexible than arrays, e.g.:
- Insert or append new elements
- remove elements
- nest lists
- combine values of different types into lists


In [None]:
>>> l=[1,2,3,4,5,6,7,8,9]
>>> l+[11,12,13]

In [None]:
>>> l[3:6]=['four to six']
>>> l


## Other Python Types: _tuples_

tuples are like lists, but not modifiable

In [None]:
t=(1,2,3,4,5,6,7,8,9)
t

In [None]:
t[4:6]

In [None]:
t[5]=99

## Other Python Types: _strings_
Strings are fully featured types in python.

- strings are defined with ' or "
- strings cannot be modified
- strings can be concatenated and sliced much like lists
- strings are objects with lots of useful methods


In [5]:
s="Donald Duck"
s

'Donald Duck'

In [None]:
s[7:]

In [None]:
s[0]='Cl'

In [7]:
s.upper()

'DONALD DUCK'

## Other Python Types _dictionaries_

Dicts are what python calls "hash tables"

- dicts associate keys with values, which can be of (almost) any type
- dicts have length, but are not ordered
- looking up values in dicts is very fast, even if the dict is BIG.



In [10]:
>>> coins={'penny':1, 'nickle':5, 'dime':10, 'quarter':25}
>>> coins['penny']

1

In [None]:
>>> coins.keys()

In [None]:
>>> coins.values()

In [None]:
>>> coins['half']=50
>>> coins

## Control Flow Statements: _if_

- if statements allow you to do a test, and do something based on the result
- _else_ is optional





In [None]:
import random
v=random.randint(0,100)
if v < 50:
    print ("small", v)
else:
    print ("big", v)
print ("after else")

## Control Flow Statements: _while_

- While statements execute one or more statements repeatedly until the
test is false

In [None]:
>>> import random
>>> count=0
>>> while count<100:
...    count=count+random.randint(0,10)
...    print (count)
...


## Control Flow Statements: _for_

For statements take some sort of iterable object and loop once for
every value.

In [None]:
>>> for fruit in ['apple', 'orange', 'banana']:
...    print (fruit)

In [None]:
>>> for i in range(7):
...    print (i)

## Using ```for``` loops and ```dicts```
If you loop over a dict, you'll get just keys.  Use items() for keys and values.

In [13]:
>>> for denom in coins:  
...    print (denom)
      


penny
quarter
dime
nickle


In [16]:
>>> for denom, value in coins.items():  
...    print (denom, value)

penny 1
quarter 25
dime 10
nickle 5


## Control Flow Statements: altering loops
While and For loops can skip steps (continue) or terminate early (break).

In [17]:
>>> for i in range(10):
...    if i%2 != 0: continue
...    print (i)


0
2
4
6
8


In [18]:
>>> for i in range(10):
...    if i>5: break
...    print (i)


0
1
2
3
4
5


## Note on code blocks

In the previous example:
```
>>> for i in range(10):
...    if i>5: break
...    print (i)
```

How did we know that ``` print (i)``` was part of the loop?

Many programming languages use \{ \} or Begin End to delineate blocks of
code to treat as a single unit.

Python uses white space (blanks).  To define a block of code, indent the block.

By convention and for readability, indent a consistent number,
usually 3 or 4 spaces.  Many editors will do this for you.


In [28]:
>>> for i in range(10):
...    if i>5: break
...    print (i)


0
1
2
3
4
5


## Functions
Functions allow you to write code once and use it many times.

Functions also hide details so code is more understandable.


In [24]:
>>> def area(w, h):
...    return w*h

>>> area(6, 4) 

24

## Summary of basic elements of Python
- 4 basic types: int, float, boolean, string
- 3 complex types: list, dict, tuple
- 4 control constructs: if, while, for, def



## Example 1: File ReFormatter

Task: given a file of hundreds or thousands of lines:

```
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGATCGC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTCTAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCCTCT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGTAGA,None,N,Eland-rna,Mei,Jon_mix10
...
```

Remove the last 3 letters from the 5th column:

```
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGT,None,N,Eland-rna,Mei,Jon_mix10
...
```

In tnis example, we'll show:
- reading lines of a file
- parsing and modifying the lines
- writing them back oput
- creating a script to do the above and running it

## In pseudocode 
```
 open the input file
 read the first header line, and print it out
 for each remaining line in the file
   read the line
   find the value in the 5th column
   truncate it by removing the last three letters
   put the line back together
   print it out
```



## In Python
```
import sys
fp=open(sys.argv[1])
print (fp.readline().strip())
for l in fp:
   flds=l.strip().split(',')
   flds[4]=flds[4][:-3]
   print (','.join(flds))
```

## Step 1: open the input file

```
import sys
fp=open(sys.argv[1])
```

Sys is a system module with a number of useful methods.\

sys.argv() returns the command line as an array of strings.

sys.argv[0] is the command, sys.argv[1] is the first
argument, etc.

Open takes a filename, and returns a ``file pointer''.

We'll use that to read from the file.


## Step 2: read the first header line, and print it out}
```
print (fp.readline().strip())
```

We'll call readline() on the file pointer to get a single line from the file.
(the header line).

Strip() removes the return at the end of the line.

Then we print it.


## Step 3: for each remaining line in the file, read the line

for l in fp:
  ...


A file pointer is an example of an iterator.

Instead of explicitly calling readline() for each line, we can just loop on the file
pointer, getting one line each time.

Since we already read the header, we
won't get that line.

## Step 4: find the value in the 5th column, and remove last 3 letters

```
    flds=l.strip().split(',')
    flds[4]=flds[4][:-3]
```

Like before, we strip the return from the line.

We split it into
individual elements where we find commas.

The 5th field is referenced by
flds[4], since python starts indexing with 0.  [:-3] takes all characters
of the string until the last 3.



## Step 5: put the line back together, and print it

```
    print (','.join(flds))
```

Join takes a list of strings, and combines them into one string using the
string provided. Then we just print that string.


## Reviewing
```
import sys
fp=open(sys.argv[1])
print fp.readline().strip()
for l in fp:
   flds=l.strip().split(',')
   flds[4]=flds[4][:-3]
   print ','.join(flds)
```

We would invoke it like this:
```
$ python fixfile.py badfile.txt

$ python fixfile.py badfile.txt > fixedfile.txt
```

## Example 2: directory walk

Imagine you have a directory tree with many subdirectories.

In those directories are files named *.fastq.  You want to:

- find them
- compress them to fastq.qp using a program
- delete them if the conversion was successful

In this example, we'll demonstrate:

- traversing an entire directory tree
- executing a program on files in that tree
- testing for successful program execution



 ## In psuedocode
```    
for each directory
   get a list of files in that directory
   for each file in that directory
     if that file's name ends with .fastq
       create a new file name with .qp added
       create a command to do the compression
       run that command and check for success
       if success
         delete the original
       else
         stop
```
The conversion command is: 
```gzip -c file.fastq > file.fastq.gz```


## Step 1: directory traversal

We need a way to traverse all the files and directories.
```os.walk(dir)``` starts at dir and visits every subdirectory below it.
It returns a list of files and subdirectories at each subdirectory.

For example, imagine we have the following dirs and files:

```
d1
d1/d2
d1/d2/f2.txt
d1/f1.txt
```



In [27]:
>>> import os
>>> for d , dirs, files in os.walk('d1'):
...    print (d, dirs, files)

d1 ['d2'] ['f1.txt']
d1/d2 [] ['f2.txt']


## Step 2: Invoking other programs from python

The subprocess module has a variety of ways to do this. A simple one:

```
import subprocess

ret=subprocess.call(cmd, shell=True)

```

ret is 0 on success, non-zero error code on failure.



In [31]:
import subprocess
ret=subprocess.call('gzip -c myfile.fastq > myfile.fastq.gz', shell=True)
ret 

0

This is some weird text
more

In [None]:
# Lists
l=[1,2,3,4,"a", '"b"', "c"]
l

In [None]:
# Tuples
t=(1,2,3,4,5,6)
t

In [None]:
# Strings
s="Donald"
s

In [None]:
# Hash Example
coins={"penny":1, "nickle":5, "dime":10, "quarter":25}
coins

In [None]:
import random
random.randint(88, 100)

In [None]:
# Example of while statement
import random
count=0
while count<100:
    count=count+random.randint(0,10)
    print count,
    count=count-4
print "here"
print "\nall done"

In [None]:
# Example of for statement
for fruit in ['apple', 'orange','banana']:
    print fruit,
print
for i in range(5):
    print i,

In [None]:
range(1,5)

In [None]:
# Example of looping over dictionary
for denom, val in coins.iteritems(): 
    print denom, val

In [None]:
# Example of function definition
def area(w, h):
    return w*h
    
print area(4,4)

In [None]:
s="160212,1,A1,human,TAAGGCGA-TAGATCGC,None,N,Eland-rna,Mei,Jon_mix10"
s.split(',')[4][:-3]


In [None]:
# File Formatter example
import sys
fp=open('badfile.txt')
print fp.readline().strip()
for l in fp:
   flds=l.strip().split(',')
   flds[4]=flds[4][:-3]
   print ','.join(flds)

In [None]:
# OS walk example
import os
for d, dirs, files in os.walk('d1'):
    print d, dirs, files

In [None]:
# Interval trees
from intervaltree import IntervalTree
it = IntervalTree()
it[4:7]='I1'
it[5:10]='I2'
it[1:11]='I3'

print it[8]

In [None]:
import sys
from intervaltree import IntervalTree

print "initializing"
genefinder={}
for line in open('knownGene.txt'):
    genename, chrm, strand, start, end = line.split()[0:5]
    if not chrm in genefinder:
        genefinder[chrm]=IntervalTree()
    genefinder[chrm][int(start):int(end)]=genename

print "reading sequences"
for line in open('sample_hits.sam'):
    tag, flag, chrm, pos, mapq, cigar, rnext, \
        pnext, tlen, seq, qual = line.split()[0:11]
    genes=genefinder[chrm][int(pos):int(pos)+len(seq)]
    if genes:
        print tag
        for gene in genes:
            print '\t',gene.data


In [None]:
genefinder['chr22'][16242753]


In [None]:
d={0: 8633, 1: 951, 2: 1166, 3: 2085, 4: 1916, 5: 8518, 6: 10255, 7: 10697, 8: 55921, 9: 25955, 10: 44636, 11: 55644, 12: 56152, 13: 51422, 14: 36350, 15: 19657, 16: 11452, 17: 5670, 18: 4922, 19: 2292, 20: 1652, 21: 1411, 22: 650, 23: 744, 24: 459, 25: 226, 26: 322, 27: 109, 28: 26, 29: 37, 30: 45, 31: 10, 32: 8, 33: 3, 34: 4}
d
bins=sorted(d.keys())
vals=[d[k] for k in bins]
bins, vals
import pylib