# Manipulating Files Cont.

Today we'll be working with a series of simulated datasets! See the `Scripts/` directory to see how I generated these files. 


# Outline:
## Review reading files

 - How to ignore lines we don't want using `continue`
   - Fasta


## Reading different common filetypes with `.split(delim)`
 - bedfiles (tab-separated)
   - are binding sites in `regions.bed` the same size as non-binding sites?
   - do binding sites tend to open, close, or stay the same?
 

## Using dictionaries to key values

# Break up below function into steps for practicing reading file
## Exercise:
 - Are bound vs. unbound peaks larger or smaller?
 - Do bound vs. unbound peaks tend to open, close, stay static?

### Questions:
1. do we need to ignore any lines? 
1. What's a good way to make sense of each column? ** USE DICTIONARY**
1. How can we split up peaks into two groups? 
1. How can we split up peaks into two groups? 

In [150]:
bed = open('Data/chip-seq.bed', 'r')
entries = bed.readlines()
bed.close()

# Column names assigned to 'col' dictionary to make indexing more human-readable
col = {'chr':0, 'start':1, 'end':2, 'behavior':3, 'bound':4}

bedList = []
for line in entries:
    peak = line.rstrip().split('\t') 
    bedList.append(peak)

boundPeaks = []
unboundPeaks = []
for peak in bedList:
    if peak[col['bound']] == 'bound':
        boundPeaks.append(peak)
    elif peak[col['bound']] == 'unbound':
        unboundPeaks.append(peak)
    else:
        print(peak)
        print('ERROR! No peak Binding data') 
        

boundWidth = []
for bound in boundPeaks:
    start = int(bound[col['start']])
    end = int(bound[col['end']])
    width = end - start
    boundWidth.append(width)
    
unboundWidth = []
for unbound in unboundPeaks:
    start = int(unbound[col['start']])
    end = int(unbound[col['end']])
    width = end - start
    unboundWidth.append(width)
    
meanBound = sum(boundWidth)/len(boundWidth)
meanUnbound = sum(unboundWidth)/len(unboundWidth)

print('Average size of bound peaks: {0} bp'.format(round(meanBound)))
print('Average size of unbound peaks: {0} bp'.format(round(meanUnbound)))

# What types of behavior do bound peaks have?
peakOpen = 0 
peakClose = 0 
peakStatic = 0 
for peak in boundPeaks:
    behavior = peak[col['behavior']]
    if behavior == 'opening':
        peakOpen += 1
    elif behavior == 'closing':
        peakClose += 1
    elif behavior == 'static':
        peakStatic += 1
    else:
        print(peak)
        print('ERROR! unknown behavior')

print('\nBound Peaks:')
print('Opening: {0}'.format(peakOpen))              
print('Closing: {0}'.format(peakClose))              
print('Static: {0}'.format(peakStatic))              
        
    
peakOpen = 0 
peakClose = 0 
peakStatic = 0 
for peak in unboundPeaks:
    behavior = peak[col['behavior']]
    if behavior == 'opening':
        peakOpen += 1
    elif behavior == 'closing':
        peakClose += 1
    elif behavior == 'static':
        peakStatic += 1
    else:
        print(peak)
        print('ERROR! unknown behavior')

print('\nUnbound Peaks:')
print('Opening: {0}'.format(peakOpen))              
print('Closing: {0}'.format(peakClose))              
print('Static: {0}'.format(peakStatic))              

Average size of bound peaks: 60 bp
Average size of unbound peaks: 767 bp

Bound Peaks:
Opening: 107
Closing: 23
Static: 16

Unbound Peaks:
Opening: 88
Closing: 55
Static: 211


# Functions

In the above script, we wind up repeating a lot of code. 
We can fix this by writing **Functions** which are small bits of code that can be run on many different inputs.


### Simple example of how functions work
# NEED EXAMPLE

## Let's write a function to simplify this part of the script

In [149]:
boundWidth = []
for bound in boundPeaks:
    start = int(bound[col['start']])
    end = int(bound[col['end']])
    width = end - start
    boundWidth.append(width)
    
unboundWidth = []
for unbound in unboundPeaks:
    start = int(unbound[col['start']])
    end = int(unbound[col['end']])
    width = end - start
    unboundWidth.append(width)
    
meanBound = sum(boundWidth)/len(boundWidth)
meanUnbound = sum(unboundWidth)/len(unboundWidth)

print('Average size of bound peaks: {0} bp'.format(round(meanBound)))
print('Average size of unbound peaks: {0} bp'.format(round(meanUnbound)))

Average size of bound peaks: 60 bp
Average size of unbound peaks: 767 bp


### Some things to think about:
1. What output do we want?
 - What steps do we need to take to get the output?
1. What are the parts that are reused?
 - How can we generalize them?

In [148]:
# Scratch space for making script in class

boundWidth = []
for bound in boundPeaks:
    start = int(bound[col['start']])
    end = int(bound[col['end']])
    width = end - start
    boundWidth.append(width)
    
unboundWidth = []
for unbound in unboundPeaks:
    start = int(unbound[col['start']])
    end = int(unbound[col['end']])
    width = end - start
    unboundWidth.append(width)
    
meanBound = sum(boundWidth)/len(boundPeaks)
meanUnbound = sum(unboundWidth)/len(unboundPeaks)

print('Average size of bound peaks: {0} bp'.format(round(meanBound)))
print('Average size of unbound peaks: {0} bp'.format(round(meanUnbound)))

Average size of bound peaks: 60 bp
Average size of unbound peaks: 767 bp


...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

In [147]:
def getMeanWidths(peaks, colNames):
    widths = []
    for peak in peaks:
        start = int(peak[colNames['start']])
        end = int(peak[colNames['end']])
        width = end - start
        widths.append(width)
        
    meanWidth = sum(widths) / len(widths)
    return(meanWidth)
        
    
meanBound = getMeanWidths(boundPeaks, col) 
meanUnbound = getMeanWidths(unboundPeaks, col)

print('Average size of bound peaks: {0} bp'.format(round(meanBound)))
print('Average size of unbound peaks: {0} bp'.format(round(meanUnbound)))

Average size of bound peaks: 60 bp
Average size of unbound peaks: 767 bp


## Exercise:
### Write a function that will simplify this part of the script

In [140]:
peakOpen = 0 
peakClose = 0 
peakStatic = 0 
for peak in boundPeaks:
    behavior = peak[col['behavior']]
    if behavior == 'opening':
        peakOpen += 1
    elif behavior == 'closing':
        peakClose += 1
    elif behavior == 'static':
        peakStatic += 1
    else:
        print(peak)
        print('ERROR! unknown behavior')

print('\nBound Peaks:')
print('Opening: {0}'.format(peakOpen))              
print('Closing: {0}'.format(peakClose))              
print('Static: {0}'.format(peakStatic))              
        
    
peakOpen = 0 
peakClose = 0 
peakStatic = 0 
for peak in unboundPeaks:
    behavior = peak[col['behavior']]
    if behavior == 'opening':
        peakOpen += 1
    elif behavior == 'closing':
        peakClose += 1
    elif behavior == 'static':
        peakStatic += 1
    else:
        print(peak)
        print('ERROR! unknown behavior')

print('\nUnbound Peaks:')
print('Opening: {0}'.format(peakOpen))              
print('Closing: {0}'.format(peakClose))              
print('Static: {0}'.format(peakStatic))              


Bound Peaks:
Opening: 107
Closing: 23
Static: 16

Unbound Peaks:
Opening: 88
Closing: 55
Static: 211


In [33]:
def countPeakBehavior(peakList):
    peakOpen = 0 
    peakClose = 0 
    peakStatic = 0 
    
    for peak in peakList:
        behavior = peak[col['behavior']]
        if behavior == 'opening':
            peakOpen += 1
        elif behavior == 'closing':
            peakClose += 1
        elif behavior == 'static':
            peakStatic += 1
        else:
            print(peak)
            print('ERROR! unknown behavior')
            
    output = {'opening':peakOpen, 'closing':peakClose, 'static':peakStatic}
    return(output)

boundBehaviorCounts = countPeakBehavior(boundPeaks)
unboundBehaviorCounts = countPeakBehavior(unboundPeaks)

names = ['Bound', 'Unbound']
counts = [boundBehaviorCounts, unboundBehaviorCounts]

for name, count in zip(names, counts):
    print('\n{0} Peaks:'.format(name))
    print('Opening: {0}'.format(count['opening']))
    print('Closing: {0}'.format(count['closing']))
    print('Static: {0}'.format(count['static']))



Bound Peaks:
Opening: 107
Closing: 23
Static: 16

Unbound Peaks:
Opening: 88
Closing: 55
Static: 211


# Dictionaries are awesome

In [34]:
L1 = [1,2,3]
L2 = [4,5,6]
L3 = [7,8,9]
d = {'one':L1, 'two':L2, 'three':L3}

print(d['one'])

[1, 2, 3]


# Make PWM 
## Exercise? Homework question?
Makes position weight matrix from equal length sequences in fasta file:

<http://weblogo.berkeley.edu/logo.cgi>

In [3]:
sequences = []
with open('Data/PWM.fa', 'r') as infile:
    for line in infile:
        entry = line.rstrip().split()[0]
        if entry[0] == '>':
            # Skip header lines
            continue
        else:
            # Save lines containing DNA sequence
            sequences.append(entry)
            
if len(min(sequences)) != len(max(sequences)):
       print('unequal sequence length!')
else:
      sequence_length = len(min(sequences)) 
       
# Create lists which will track number of times that base occurrs at a given position
A = []            
T = []            
G = []            
C = []            

for i in range(0, sequence_length):
    A.append(0)
    T.append(0)
    G.append(0)
    C.append(0)
 
# Count number of times each basepair appears at each position
for seq in sequences:
    sequence = seq.split()[0]
    
    for position in range(0, len(sequence)):
        base = sequence[position]  
        
        if base == "A":
            A[position] += 1 
        elif base == "T":
            T[position] += 1 
        elif base == "G":
            G[position] += 1 
        elif base == "C":
            C[position] += 1 
        else:
            print("Unknown base at position: {0}".format(position))
            
baseOrder = ['A', 'T', 'G', 'C']
bpCount = [A, T, G, C]
pwm = bpCount[:] 

# Calculate frequency of each nucleotide at a given position
for base in pwm:
    for position in range(0, len(base)):
        _bpCount = base[position]
        bpFreq = _bpCount / len(sequences)
        base[position] = bpFreq

# Simple way to print output:
for base, bp in zip(baseOrder, pwm):
    print('{0}: {1}'.format(base, bp))

A: [0.7, 0.04, 0.09, 0.74, 0.05, 0.13]
T: [0.3, 0.11, 0.79, 0.13, 0.13, 0.87]
G: [0.0, 0.76, 0.05, 0.07, 0.14, 0.0]
C: [0.0, 0.09, 0.07, 0.06, 0.68, 0.0]


# Outside of Class
# Regular Expressions

# ?? 