In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import os
from functools import reduce
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats.stats import pearsonr
from scipy import stats
from scipy.stats import ks_2samp
from scipy.stats import entropy

## Datafiles
There are 96 separate datafiles, each representing the processed output of an RNAseq experiment looking at mRNA levels in wild-type (48 samples) or SNF2 deletion (48 samples) yeast cells grown under standard condition.

## Reading individual files with Pandas

The individual files are in the directory "data/rawdata/" and are tab-delimeted text files. These are basically raw spreadsheets stored as text, with individual lines separated by a **'\n'** which indicated 'new line', and columns within each line separated by a **'\t'** or 'tab' character. (We will also deal with files where the column is separated by a comma.

Like any good program for working with spreadsheet data, Pandas knows how to read this kind of file if we tell it what to look for. To do this we use its [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, specifying first the filepath, and then a few details about this file.  

**sep="\t"** tells Pandas that the data separator is a tab character (as opposed to other common separators like comma or space)  
**header=None** tells Pandas that the first line of the file is not a header (this is what it typically expects)  
**names = ..** because there is no header, we tell Pandas what to call the columns

In [3]:
pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
...,...,...
7126,no_feature,414503
7127,ambiguous,1387186
7128,too_low_aQual,0
7129,not_aligned,0


If we just run that command, Pandas loads in the file and displays it. But we want to store the data, so we assign the data to a variable. 

In [4]:
ge = pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

Now we can look at the file in different ways. The **head(n)** command shows us the first n rows.

In [5]:
ge.head(10)

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
5,NME1,33
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


Similarly the **tail(n)** command shows us the last n rows.

In [6]:
ge.tail(10)

Unnamed: 0,Gene,Expression
7121,tY(GUA)J2,0
7122,tY(GUA)M1,0
7123,tY(GUA)M2,0
7124,tY(GUA)O,0
7125,tY(GUA)Q,0
7126,no_feature,414503
7127,ambiguous,1387186
7128,too_low_aQual,0
7129,not_aligned,0
7130,alignment_not_unique,0


You can already see one (typical) problem, which is that the file is not just data. It also how some summary statistics that we don't care about. So we have to filter the rows in some manner. Here we use a bit of data about gene nomenclature in yeast. 

Most of the genes have this naming format:

Systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast'); the second letter denotes the chromosome number ('A' is chr I, 'B' is chr II, etc.); the third letter is either 'L' or 'R' for left or right chromosome arm; next is a three digit number indicating the order of the ORFs on that arm of a chromosome starting from the centromere, irrespective of strand; finally, there is an additional letter indicating the strand, either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere).

You can read about the yeast gene naming system here - http://seq.yeastgenome.org/help/community/nomenclature-conventions.

We can look for rows that match this style by using the built in filtering system in Pandas. This specific command asks Pandas to return all rows where the Gene column starts with the character "Y"

In [8]:
ge[ge.Gene.str.startswith("Y")]

Unnamed: 0,Gene,Expression
61,YAL001C,232
62,YAL002W,245
63,YAL003W,8073
64,YAL004W,0
65,YAL005C,7471
...,...,...
6741,YPR201W,45
6742,YPR202W,22
6743,YPR203W,18
6744,YPR204C-A,0


That accounts for 6685 of the 7130 rows. But what about the rest? We can start by reversing that filter to ask for rows that do not start with "Y". That work is done here with the **~** symbol.

In [9]:
ge[~ge.Gene.str.startswith("Y")]

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
...,...,...
7126,no_feature,414503
7127,ambiguous,1387186
7128,too_low_aQual,0
7129,not_aligned,0


This is going to look a bit messy, but we can ask Pandas to return all of the Gene names for these remaining rows, just so we can get a sense of what we're dealing with.

In [10]:
ge[~ge.Gene.str.startswith("Y")].Gene.values

array(['15S_rRNA', '21S_rRNA', 'HRA1', 'ICR1', 'LSR1', 'NME1', 'PWR1',
       'Q0010', 'Q0017', 'Q0032', 'Q0045', 'Q0050', 'Q0055', 'Q0060',
       'Q0065', 'Q0070', 'Q0075', 'Q0080', 'Q0085', 'Q0092', 'Q0105',
       'Q0110', 'Q0115', 'Q0120', 'Q0130', 'Q0140', 'Q0142', 'Q0143',
       'Q0144', 'Q0160', 'Q0182', 'Q0250', 'Q0255', 'Q0275', 'Q0297',
       'RDN18-1', 'RDN18-2', 'RDN25-1', 'RDN25-2', 'RDN37-1', 'RDN37-2',
       'RDN5-1', 'RDN5-2', 'RDN5-3', 'RDN5-4', 'RDN5-5', 'RDN5-6',
       'RDN58-1', 'RDN58-2', 'RNA170', 'RPM1', 'RPR1', 'RUF20', 'RUF21',
       'RUF22', 'RUF23', 'RUF5-1', 'RUF5-2', 'SCR1', 'SRG1', 'TLC1',
       'snR10', 'snR11', 'snR128', 'snR13', 'snR14', 'snR161', 'snR17a',
       'snR17b', 'snR18', 'snR189', 'snR19', 'snR190', 'snR191', 'snR24',
       'snR3', 'snR30', 'snR31', 'snR32', 'snR33', 'snR34', 'snR35',
       'snR36', 'snR37', 'snR38', 'snR39', 'snR39B', 'snR4', 'snR40',
       'snR41', 'snR42', 'snR43', 'snR44', 'snR45', 'snR46', 'snR47',
       'snR

There are several types of things in here.  
  
**ribosomal RNAs** '15S_rRNA', '21S_rRNA'  
**non-coding RNAs** 'HRA1', 'ICR1', 'LSR1', 'NME1', 'PWR1', etc..    
**dubious open-reading frames** 'Q0010', 'Q0017', 'Q0032', 'Q0045', etc...  
**snoRNAs** snR10', 'snR11', 'snR128', 'snR13', 'snR14', 'snR161', 'snR17a', etc...  
**tRNAs** 'tA(AGC)D', 'tA(AGC)F','tA(AGC)G', 'tA(AGC)H', 'tA(AGC)J', 'tA(AGC)K1', etc...  
**summary rows - not genes** 'no_feature', 'ambiguous', 'too_low_aQual','not_aligned', 'alignment_not_unique'

For many subsequent analyses, we'll stick to protein-coding genes. But for now, lets just get rid of the superfluous stuff.

In [8]:
rows_to_drop = ['no_feature', 'ambiguous', 'too_low_aQual','not_aligned', 'alignment_not_unique']

In [9]:
ge = ge[~ge.Gene.isin(rows_to_drop)]

In [10]:
ge.head(10)

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
5,NME1,33
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


So, we've got 96 files like this. For what we want to do, we want to combine these all into one giant table. Fortunately, pandas has a relative easy way to merge dataframes.

To illustrate this, first lets load two:

In [14]:
ge1 = pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])
ge1 = ge1[~ge1.Gene.isin(rows_to_drop)]
ge2 = pd.read_csv("data/rawdata/Snf2_rep03_MID28_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])
ge2 = ge2[~ge2.Gene.isin(rows_to_drop)]

In [15]:
ge1.head(10)

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
5,NME1,33
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


In [16]:
ge2.head(10)

Unnamed: 0,Gene,Expression
0,15S_rRNA,5
1,21S_rRNA,23
2,HRA1,3
3,ICR1,211
4,LSR1,159
5,NME1,17
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


The merge syntax for merging two dataframes df1, df2 is pretty easy  

**df1.merge(df2,on="Gene",how="outer")**

In addition to the two dataframes, you have to specify what column of the dataframes you want to merge on. When pandas finds the same value in this column between two dataframes, it will combine them into one row. There are different ways this can be done - basically you can choose whether to use any value from either dataframe (how="outer") only values found in both dataframes (how="inner") only values in the first dataframe (how="left") or in the second (how="right").

In [17]:
ge1.merge(ge2,on="Gene",how='outer')

Unnamed: 0,Gene,Expression_x,Expression_y
0,15S_rRNA,29,5
1,21S_rRNA,221,23
2,HRA1,0,3
3,ICR1,70,211
4,LSR1,314,159
...,...,...,...
7121,tY(GUA)J2,0,0
7122,tY(GUA)M1,0,1
7123,tY(GUA)M2,0,0
7124,tY(GUA)O,0,0


The nice thing about merge is that it doesn't care what order the data are in. We can show this by sorting ge1 and ge2 on in different ways and remerge.

In [19]:
ge1 = ge1.sort_values('Expression')
ge2 = ge2.sort_values('Gene')

In [21]:
ge1.merge(ge2,on="Gene",how='outer').sort_values('Gene')

Unnamed: 0,Gene,Expression_x,Expression_y
1549,15S_rRNA,29,5
3614,21S_rRNA,221,23
216,HRA1,0,3
1985,ICR1,70,211
4389,LSR1,314,159
...,...,...,...
186,tY(GUA)J2,0,0
172,tY(GUA)M1,0,1
217,tY(GUA)M2,0,0
144,tY(GUA)O,0,0


How do we do this for 96 files?  

First, let's get a list of all the files. To do this we use the 

In [25]:
os.listdir('data/rawdata')

['Snf2_rep39_MID19_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep18_MID48_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep17_MID66_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep31_MID77_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep43_MID73_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep35_MID60_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep28_MID74_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep08_MID78_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep30_MID16_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep08_MID20_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep40_MID12_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep40_MID18_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep28_MID64_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep06_MID30_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep14_MID33_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep03_MID51_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep01_MID96_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep31_MID43_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep42_MID10_allLanes_tophat2.0.5.bam.gbgout',
 'WT_rep30_

Would be nicer if this were sorted

In [28]:
sorted(os.listdir('data/rawdata'))

['Snf2_rep01_MID96_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep02_MID21_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep03_MID28_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep04_MID11_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep05_MID82_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep06_MID80_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep07_MID26_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep08_MID78_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep09_MID04_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep10_MID38_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep11_MID01_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep12_MID02_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep13_MID59_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep14_MID33_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep15_MID75_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep16_MID72_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep17_MID89_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep18_MID40_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep19_MID84_allLanes_tophat2.0.5.bam.gb

The files contain information about the experiment, which we want to extract. We'll do this using the 'split' command, which takes a string and splits it up based on some character in the string. Here we care about the first two parts of the file name, whether it's WT or mutant, and which replicate it is.  

Let's illustrate what we're going to do with an example.

In [29]:
filename = 'WT_rep25_MID61_allLanes_tophat2.0.5.bam.gbgout'

We want to split the string using the '_' character.

In [30]:
filename.split('_')

['WT', 'rep25', 'MID61', 'allLanes', 'tophat2.0.5.bam.gbgout']

In [31]:
filesplit = filename.split('_')
exp_type = filesplit[0]
rep = filesplit[1]

print ('Exp type: ' + exp_type)
print ('Rep: ' + rep)

Exp type: WT
Rep: rep25


In [46]:
# get a list of files

files = [f for f in sorted(os.listdir('data/rawdata')) if f.endswith(".gbgout")]

# start with first file

file = files[0]
filesplit = file.split('_')
exp_type = filesplit[0]
rep = filesplit[1]

exp_id = exp_type + "_" + rep

filepath = os.path.join('data/rawdata/', file)

# rather than using Gene, Expression we change the columns to Gene, Experiment ID
expression_all = pd.read_csv(filepath, sep="\t", header = None, names=['Gene',exp_id])

# now loop over remainder of files and merge to data frame

for file in files[1:]:
    print (file)
    filesplit = file.split('_')
    exp_type = filesplit[0]
    rep = filesplit[1]
    exp_id = exp_type + "_" + rep
    filepath = os.path.join('data/rawdata/', file)
    
    expression_all = expression_all.merge(pd.read_csv(filepath, sep="\t", header = None, names=['Gene',exp_id]), on="Gene", how="outer")


Snf2_rep02_MID21_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep03_MID28_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep04_MID11_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep05_MID82_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep06_MID80_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep07_MID26_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep08_MID78_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep09_MID04_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep10_MID38_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep11_MID01_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep12_MID02_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep13_MID59_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep14_MID33_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep15_MID75_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep16_MID72_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep17_MID89_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep18_MID40_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep19_MID84_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep20_MID15_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep21_MID31_allLanes_tophat2.0.5.bam.gbgout
Snf2_rep22_MID53_all

In [47]:
expression_all

Unnamed: 0,Gene,Snf2_rep01,Snf2_rep02,Snf2_rep03,Snf2_rep04,Snf2_rep05,Snf2_rep06,Snf2_rep07,Snf2_rep08,Snf2_rep09,...,WT_rep39,WT_rep40,WT_rep41,WT_rep42,WT_rep43,WT_rep44,WT_rep45,WT_rep46,WT_rep47,WT_rep48
0,15S_rRNA,4,2,5,5,46,3,5,4,2,...,0,49,9,4,11,12,1,22,12,4
1,21S_rRNA,31,18,23,44,356,62,35,33,13,...,10,274,49,30,72,58,21,159,107,70
2,HRA1,5,1,3,1,2,1,1,4,4,...,5,3,6,5,2,2,2,5,2,1
3,ICR1,205,196,211,252,127,146,275,160,190,...,85,177,137,118,113,81,142,94,187,106
4,LSR1,210,103,159,260,298,522,303,96,132,...,66,385,232,149,114,81,109,132,243,128
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7126,no_feature,888995,641333,544060,955003,616339,764571,1290592,598769,827934,...,403527,1035268,759722,578053,615809,446796,683906,462390,901045,639100
7127,ambiguous,730793,665470,709428,821933,1171511,529946,925353,669715,717502,...,583744,2002465,865662,646805,570375,1011281,899586,596335,1321941,803015
7128,too_low_aQual,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7129,not_aligned,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The files are incredible simple. Two columns, separated by tabs. The first column is the gene name, the second column the normalized counts for that gene in that sample (read the paper if you're interested in how they did the counting and normalization).

YBL013W 39 YBL014C 127 YBL015W 732 YBL016W 309 YBL017C 1613 YBL018C 174 YBL019W 117 YBL020W 258 YBL021C 248 YBL022C 1168 YBL023C 331 YBL024W 451 YBL025W 64 YBL026W 206 YBL027W 9723 YBL028C 77 YBL029C-A 157



In [48]:
expression_all = expression_all[~expression_all.Gene.isin(rows_to_drop)]

In [49]:
expression_all

Unnamed: 0,Gene,Snf2_rep01,Snf2_rep02,Snf2_rep03,Snf2_rep04,Snf2_rep05,Snf2_rep06,Snf2_rep07,Snf2_rep08,Snf2_rep09,...,WT_rep39,WT_rep40,WT_rep41,WT_rep42,WT_rep43,WT_rep44,WT_rep45,WT_rep46,WT_rep47,WT_rep48
0,15S_rRNA,4,2,5,5,46,3,5,4,2,...,0,49,9,4,11,12,1,22,12,4
1,21S_rRNA,31,18,23,44,356,62,35,33,13,...,10,274,49,30,72,58,21,159,107,70
2,HRA1,5,1,3,1,2,1,1,4,4,...,5,3,6,5,2,2,2,5,2,1
3,ICR1,205,196,211,252,127,146,275,160,190,...,85,177,137,118,113,81,142,94,187,106
4,LSR1,210,103,159,260,298,522,303,96,132,...,66,385,232,149,114,81,109,132,243,128
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7121,tY(GUA)J2,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7122,tY(GUA)M1,0,0,1,0,2,0,0,0,0,...,0,1,1,0,1,0,0,0,1,0
7123,tY(GUA)M2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7124,tY(GUA)O,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


In [51]:
# Save

expression_all.to_csv("Barton_combined.txt", sep='\t')