# Research Programming in the Life Sciences
## Advanced Data Structures and Scientific Computing 

- David L. Bernick, PhD
- Biomolecular Engineering
- Baskin School of Engineering
- UCSC

# Homework
 
## Reading
 - Functions (and methods) - Model Ch 2. pp 24-29
 - Modules - Model Ch 2.  pp 34-41, 44
 - Namespaces - Model Ch 2. pp 21-22, 27, 34-37
 - Scipy documentation http://docs.scipy.org/doc/scipy/reference/stats.html
 
## Lab
 - Lab 5 is posted
 - Project Abstracts due on Monday
 - submit in “assignments” section of Canvas

# Overview
 - Quiz results
 - High throughput data
 - Advanced Data structures
 - Statistics and Scientific computing

# Quiz 4

 - Q3: functions must have at least one parameter?  
 - Q4: the same name used in different namespaces is never assigned to the same object
 - Q10: What is Cusp used for?

In [None]:
import matplotlib.pyplot as plt
x = [i for i in range(4,11)]
y = [1,3,3,12,20,18,34]
plt.bar(x,y,0.2)
plt.xlabel('Scores', fontsize=18)
plt.title('Quiz 4 results',fontsize=18)
plt.figtext(.3,0.7,'mean=86%', fontsize=18, ha='center')
plt.show()

# High Throughput Data
 - Microarrays
 - Sequencers
     - Genome sequencing
     - RNA-seq
     - ChIP-seq
     - Metagenomics
 - SNP chips

# Microarrays: Overview
Multiplexed assay that quantifies the amounts of thousands of biomolecules present in a sample
 - measures multiple molecules at same time
 - functional view of biological activity within sample
 - cells, tissues (blood, skin, muscle), organs
 
Techniques developed for examining:
 - DNA
 - RNA 
 - proteins 
 - lipids
 

# Microarrays: Steps
 - Obtain microarray chip containing probe molecules
 - Extract mRNA from cells in sample
 - Convert mRNA to cDNA
 - Apply cDNA mixture to chip and allow to hybridize
 - Scan chip with multicolor laser
 - Quantify fluorescent intensity of colors

http://www.bio.davidson.edu/Courses/genomics/chip/chip.html


# Introduction to DNA Microarrays
![MicroArrays](Lecture10arrays.png)

# Motivation
 - Suppose you are studying a strain of heat resistant bacteria, and want to find the genes that are responsible for the heat resistance
 - You want to find the genes the bacteria express at higher levels when exposed to heat
 - You have two samples of bacteria, one exposed to heat and one not
 - You use a green (no heat) and red (heat) spotted array to measure the expression levels of the genes in the bacteria’s genome

# Hybridizing Microarrays
![Hybridizing](Lecture10Hybridizing.png)

# One Chip = 1 Array of Data
## Heat exposed vs. not heat exposed cells
![Spots to Data](Lecture10spot2data.png)

# Hybridizing Microarrays
![Heat and Growth](Lecture10heatMulti.png)

# DNA Microarrays
Arrange expression data across multiple experiments in a heat map.
![heat map](Lecture10heatMap.png)

# Current Generation of μArrays
Whole Genome Direct Hybridization (Illumina)
![bead array](Lecture10beadArray.png)

# Microarray Data File
 - The data from a microarray run will be stored in a tab-delimited file 
     - in Python, '\t' is a tab
![data file](Lecture10uArrayData.png)

# Loading Data into a 2D Array
![data matrix](Lecture10uDataMatrix.png)

In [None]:
# define input filename
fileName = 'arrayDataV1.txt'
# create empty list, represents rows
inputArray = []
# open file and go through each line, i.e. rows
with open(fileName) as fh:
    for line in fh:
        # split line into columns and append to array
        arrayCols = line.rstrip().split('\t')
        inputArray.append(arrayCols)

# Loading Data Challenge
 - Load data that has:
     - a header line
     - accounts for row labels
     - make sure data is float
![data matrix](Lecture10uDataMatrix.png)    

In [None]:
# define input filename
fileName = 'arrayDataV1.txt'
# create empty list, represents rows
inputArray = []
# open file and go through each line, i.e. rows
with open(fileName) as fh:
    for line in fh:
        # split line into columns and append to array
        arrayCols = line.rstrip().split('\t')
        inputArray.append(arrayCols)

# 454 and SOLiD Sequencing
![454SOLiD](Lecture10454SOLiD.png)

# Illumina sequencing
![Illumina](Lecture10Illumina.png)
 - HiSeq 2500
 - 2 x 150bp reads
 - 3B paired reads
 - 11 day runtime
 - 90% bases > Q30


180M read-pairs =~ $2500 

# Sequencing Data

In [None]:
@DJB775P1:392:D1R59ACXX:3:1101:1122:2040 1:N:0:
NACATGGGCGACGAGCATCCGATCGACGAGTCAGCCATCGAAGCCGCAGCCGAACCAATCGATGGCGAGGCCCTCGCGNNTCTCTCGNNNNNCNNNNNNN
+
#1=DDDDFHGHHFHJJJIJJJJJJGIJJJICHIJIJHFHHFFFDDDBDBDDBB<<BDD>C@DDABDDDDDDDDBDDD#######################
@DJB775P1:392:D1R59ACXX:3:1101:1190:2041 1:N:0:
NAAAAACATGTAGCAGTTCGGCTCTGCTTGTGCAGACGCTTGCTACCTGCGAGTTCTCACTCCGGATTCAGTCTCCCGNNCTCAAAGNNACCGCCCCTTN
+
#1=BDFFFHHFHHJIJJIJJHJJIIJJJJJJIIJJJJJJGJJJIIJJEFHJIJADGGGHHGFHFFDCDEDDEDDDCDD##,5<?BDC##++8?@DDDBD#

# RNA-seq
![RNA sequencing](Lecture10RNAseq.pdf)
 - Overview
     - fragment RNA
     - synthesize cDNA
     - add adapters
     - amplify

# RNA-seq analysis
 - each sample
     - map reads to features
     - count
     - normalize
 - replicates
     - mean, variance normalized counts
     - t-test for significance
     - bonferroni or FDR correction
     
http://docs.scipy.org/doc/

# Advanced Data Structures
 - Lists of lists
 - Arrays (numpy)
 - Dictionaries of lists, tuples
 - Dictionaries of dictionaries

# Arrays in Numpy

In [None]:
import numpy
numpy.array()    # define array
numpy.loadtxt()  # input data array

loadtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, 
        converters=None, skiprows=0, usecols=None, unpack=False)

# Lists
We have seen lists and tuples

## Example:

In [None]:
myTuple = ( 1, 2, 3 )  # tuple
myList = [ 1, 2, 3 ]      # list
# list modification 
myList.append (4)
[ 1, 2, 3, 4 ]
myList [2] = 17
[ 1, 2, 17, 4 ]

# Lists
 - Lists are generalized containers of objects
 - We can make a list of lists 

In [None]:
theScoreList = [  ['Henry', 82, 91, 88 ],
                  ['Danilo', 81, 92, 87 ],
                  ['Paloma', 81, 92, 87 ]  ]
theScoreList.append ( [ 'David', 47, 53, 21 ] )

we can index the list using:

In [None]:
print (theScoreList [3][0])

# List Comprehension
This is a Python shorthand for generating a list

In [None]:
seq = 'ACGTACGTACGTACGT'
codonList = [ seq[p:p+3] for p in range(0, len(seq), 3) ]
print (codonList)

# Dictionary of lists

In [5]:
theScores = { 'Henry' : [82, 91, 88 ],
              'Danilo' :   [89, 90, 89 ],
              'Paloma' :   [81, 92, 87 ]  }
theScores ['David'] = [ 47, 53, 21 ] 

and index with:

In [None]:
print (theScores ['David'])

# List of lists
List of codons, organized by frame

Note:  this is a list of lists

In [2]:
frame = 1
codonList = [ [], [], [] ]                   # initialize
codonList [frame] = [ 23, 35, 46 ] 
codonList [frame].append (57)
codonList [frame] = []
# Sorting lists using fields
records =   [  ['Paloma', 82, 91, 88 ],
               ['Henry', 89, 100, 87 ] ,
               ['Danilo', 81, 92, 87 ] ,
               ['David', 47, 53, 21] ]

records.sort (key= lambda entry: sum(entry[1:4]), reverse=True)
for record in records:
    print (record)


['Henry', 89, 100, 87]
['Paloma', 82, 91, 88]
['Danilo', 81, 92, 87]
['David', 47, 53, 21]


# Dictionary of dictionaries
3 level dictionary

In [1]:
codonTable = {'G':
              {'G':
                 {'G': 'Gly', 'A': 'Gly', 'C': 'Gly', 'U': 'Gly'},
              'A':
                 {'G': 'Glu', 'A': 'Glu', 'C': 'Asp', 'U': 'Asp'},
              'C':
                 {'G': 'Ala', 'A': 'Ala', 'C': 'Ala', 'U': 'Ala'},
              'U':
                 {'G': 'Val', 'A': 'Val', 'C': 'Val', 'U': 'Val'}
              },
            'A':
              {'G':
                 {'G': 'Arg', 'A': 'Arg', 'C': 'Ser', 'U': 'Ser'},
              'A':
                 {'G': 'Lys', 'A': 'Lys', 'C': 'Asn', 'U': 'Asn'},
              'C':
                 {'G': 'Thr', 'A': 'Thr', 'C': 'Thr', 'U': 'Thr'},
              'U':
                 {'G': 'Met', 'A': 'Ile', 'C': 'Ile', 'U': 'Ile'}
               },
            'C':
              {'G':
                 {'G': 'Arg', 'A': 'Arg', 'C': 'Arg', 'U': 'Arg'},
              'A':
                 {'G': 'Gln', 'A': 'Gln', 'C': 'His', 'U': 'His'},
              'C':
                 {'G': 'Pro', 'A': 'Pro', 'C': 'Pro', 'U': 'Pro'},
              'U':
                 {'G': 'Leu', 'A': 'Leu', 'C': 'Leu', 'U': 'Leu'}},
            'U':
              {'G':
                 {'G': 'Trp', 'A': '---', 'C': 'Cys', 'U': 'Cys'},
              'A':
                 {'G': '---', 'A': '---', 'C': 'Tyr', 'U': 'Tyr'},
              'C':
                 {'G': 'Ser', 'A': 'Ser', 'C': 'Ser', 'U': 'Ser'},
              'U':
                 {'G': 'Leu', 'A': 'Leu', 'C': 'Phe', 'U': 'Phe'}}
            }
print (codonTable['G']['G']['G'])

Gly


# Statistics Outline
 - Loose Definition:
     - Set of mathematical operations/calculations used to describe and interpret data
 - Where/How Used:
     - Design of experiments and surveys
     - Modeling or predicting trends/outcomes using data
 - Types:
     - Descriptive – summarize data
     - Inferential – model or predict based on data

# Describing Central Tendency
 - Arithmetic Mean $$\mu = \frac{1}{n}\sum_{i=0}^{n-1}x_{i}$$
 

In [None]:
mean = sum(x) / len(x)

 - Geometric Mean $$\mu = \left(\prod_{i=0}^{n-1}x_{i}\right)^\frac{1}{n}$$
 

In [None]:
def prod(L): 
    p=1 
    for i in L: 
        p= *= i 
    return p
mean = prod(x) ** (1/len(x))

 - Median
     - Central value in a dataset

# Median
The value of the middle number

To identify the median, you need to have a sorted list of numbers

Does the following code work in general?

In [None]:
x = [5, 4, 3, 2, 1]
x.sort()
mid = len(x)//2
median = x[mid]

In [None]:
x = [5, 4, 3, 2, 1]
x.sort()
mid = len(x)/2
if len(x) % 2 == 0:
    median = (x[mid] + x[mid-1]) / 2.
else:
    median = x[mid]

# Describing Spread of Data
 - Variance
$$ \sigma^{2} = \frac{1}{n-1}\sum_{i=0}^{n-1}(x_{i}-\mu)^{2} $$
 - Standard Deviation
$$ \sigma - \sqrt{\sigma^{2}} $$

# T-statistic
$$ t = \frac{\mu-\mu_{0}}{\sigma/\sqrt{n}} $$

# Test for Mean Difference

# SciComp Modules in Python
 - Scipy - scientific python
 - Numpy - numeric python

In [None]:
import scipy
import numpy
from scipy import stats
scipy.mean()
scipy.median()
scipy.std()
stats.ttest_ind()
stats.chisquare()
numpy.array()
...