In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import os
from functools import reduce
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats.stats import pearsonr
from scipy import stats
from scipy.stats import ks_2samp
from scipy.stats import entropy

## Datafiles
There are 96 separate datafiles, each representing the processed output of an RNAseq experiment looking at mRNA levels in wild-type (48 samples) or SNF2 deletion (48 samples) yeast cells grown under standard condition.

#### Reading individual files with Pandas

The individual files are in the directory "data/rawdata/" and are tab-delimeted text files. These are basically raw spreadsheets stored as text, with individual lines separated by a **'\n'** which indicated 'new line', and columns within each line separated by a **'\t'** or 'tab' character. (We will also deal with files where the column is separated by a comma.

Like any good program for working with spreadsheet data, Pandas knows how to read this kind of file if we tell it what to look for. To do this we use its [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, specifying first the filepath, and then a few details about this file.  

**sep="\t"** tells Pandas that the data separator is a tab character (as opposed to other common separators like comma or space)  
**header=None** tells Pandas that the first line of the file is not a header (this is what it typically expects)  
**names = ..** because there is no header, we tell Pandas what to call the columns

In [8]:
pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
...,...,...
7126,no_feature,414503
7127,ambiguous,1387186
7128,too_low_aQual,0
7129,not_aligned,0


If we just run that command, Pandas loads in the file and displays it. But we want to store the data, so we assign the data to a variable. 

In [24]:
ge = pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

Now we can look at the file in different ways. The **head(n)** command shows us the first n rows.

In [11]:
ge.head(10)

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
5,NME1,33
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


Similarly the **tail(n)** command shows us the last n rows.

In [13]:
ge.tail(10)

Unnamed: 0,Gene,Expression
7121,tY(GUA)J2,0
7122,tY(GUA)M1,0
7123,tY(GUA)M2,0
7124,tY(GUA)O,0
7125,tY(GUA)Q,0
7126,no_feature,414503
7127,ambiguous,1387186
7128,too_low_aQual,0
7129,not_aligned,0
7130,alignment_not_unique,0


You can already see one (typical) problem, which is that the file is not just data. It also how some summary statistics that we don't care about. So we have to filter the rows in some manner. Here we use a bit of data about gene nomenclature in yeast. 

Most of the genes have this naming format:

Systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast'); the second letter denotes the chromosome number ('A' is chr I, 'B' is chr II, etc.); the third letter is either 'L' or 'R' for left or right chromosome arm; next is a three digit number indicating the order of the ORFs on that arm of a chromosome starting from the centromere, irrespective of strand; finally, there is an additional letter indicating the strand, either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere).

You can read about the yeast gene naming system here - http://seq.yeastgenome.org/help/community/nomenclature-conventions.

We can look for rows that match this style by using the built in filtering system in Pandas. This specific command asks Pandas to return all rows where the Gene column starts with the character "Y"

In [17]:
ge[ge.Gene.str.startswith("Y")]

Unnamed: 0,Gene,Expression
61,YAL001C,232
62,YAL002W,245
63,YAL003W,8073
64,YAL004W,0
65,YAL005C,7471
...,...,...
6741,YPR201W,45
6742,YPR202W,22
6743,YPR203W,18
6744,YPR204C-A,0


That accounts for 6685 of the 7130 rows. But what about the rest? We can start by reversing that filter to ask for rows that do not start with "Y". That work is done here with the **~** symbol.

In [18]:
ge[~ge.Gene.str.startswith("Y")]

Unnamed: 0,Gene,Expression
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
...,...,...
7126,no_feature,414503
7127,ambiguous,1387186
7128,too_low_aQual,0
7129,not_aligned,0


This is going to look a bit messy, but we can ask Pandas to return all of the Gene names for these remaining rows, just so we can get a sense of what we're dealing with.

In [20]:
ge[~ge.Gene.str.startswith("Y")].Gene.values

array(['15S_rRNA', '21S_rRNA', 'HRA1', 'ICR1', 'LSR1', 'NME1', 'PWR1',
       'Q0010', 'Q0017', 'Q0032', 'Q0045', 'Q0050', 'Q0055', 'Q0060',
       'Q0065', 'Q0070', 'Q0075', 'Q0080', 'Q0085', 'Q0092', 'Q0105',
       'Q0110', 'Q0115', 'Q0120', 'Q0130', 'Q0140', 'Q0142', 'Q0143',
       'Q0144', 'Q0160', 'Q0182', 'Q0250', 'Q0255', 'Q0275', 'Q0297',
       'RDN18-1', 'RDN18-2', 'RDN25-1', 'RDN25-2', 'RDN37-1', 'RDN37-2',
       'RDN5-1', 'RDN5-2', 'RDN5-3', 'RDN5-4', 'RDN5-5', 'RDN5-6',
       'RDN58-1', 'RDN58-2', 'RNA170', 'RPM1', 'RPR1', 'RUF20', 'RUF21',
       'RUF22', 'RUF23', 'RUF5-1', 'RUF5-2', 'SCR1', 'SRG1', 'TLC1',
       'snR10', 'snR11', 'snR128', 'snR13', 'snR14', 'snR161', 'snR17a',
       'snR17b', 'snR18', 'snR189', 'snR19', 'snR190', 'snR191', 'snR24',
       'snR3', 'snR30', 'snR31', 'snR32', 'snR33', 'snR34', 'snR35',
       'snR36', 'snR37', 'snR38', 'snR39', 'snR39B', 'snR4', 'snR40',
       'snR41', 'snR42', 'snR43', 'snR44', 'snR45', 'snR46', 'snR47',
       'snR

There are several types of things in here.  
  
**ribosomal RNAs** '15S_rRNA', '21S_rRNA'  
**non-coding RNAs** 'HRA1', 'ICR1', 'LSR1', 'NME1', 'PWR1', etc..    
**dubious open-reading frames** 'Q0010', 'Q0017', 'Q0032', 'Q0045', etc...  
**snoRNAs** snR10', 'snR11', 'snR128', 'snR13', 'snR14', 'snR161', 'snR17a', etc...  
**tRNAs** 'tA(AGC)D', 'tA(AGC)F','tA(AGC)G', 'tA(AGC)H', 'tA(AGC)J', 'tA(AGC)K1', etc...  
**summary rows - not genes** 'no_feature', 'ambiguous', 'too_low_aQual','not_aligned', 'alignment_not_unique'

For now, we're just going to not use these, but you are free to come back later and incorporate them into our analyses.

In [31]:
ge = ge[ge.Gene.str.startswith("Y")]

In [32]:
ge.head(10)

Unnamed: 0,Gene,Expression
61,YAL001C,232
62,YAL002W,245
63,YAL003W,8073
64,YAL004W,0
65,YAL005C,7471
66,YAL007C,550
67,YAL008W,228
68,YAL009W,155
69,YAL010C,176
70,YAL011W,203


So, we've got 96 files like this. We want to com

In [33]:
ge1 = pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])
ge2 = pd.read_csv("data/rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

The files are incredible simple. Two columns, separated by tabs. The first column is the gene name, the second column the normalized counts for that gene in that sample (read the paper if you're interested in how they did the counting and normalization).

YBL013W 39 YBL014C 127 YBL015W 732 YBL016W 309 YBL017C 1613 YBL018C 174 YBL019W 117 YBL020W 258 YBL021C 248 YBL022C 1168 YBL023C 331 YBL024W 451 YBL025W 64 YBL026W 206 YBL027W 9723 YBL028C 77 YBL029C-A 157



In [8]:
df.head(10)

Unnamed: 0,0,1
0,15S_rRNA,29
1,21S_rRNA,221
2,HRA1,0
3,ICR1,70
4,LSR1,314
5,NME1,33
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


In [8]:
df = pd.read_csv("rawdata/WT_rep16_MID92_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

In [10]:
df.sort_values('Expression')

Unnamed: 0,Gene,Expression
7130,alignment_not_unique,0
444,YBR121C-A,0
6717,YPR177C,0
6709,YPR170W-A,0
3166,YIL082W,0
...,...,...
440,YBR118W,109711
3012,YHR174W,120322
2639,YGR192C,173478
7126,no_feature,414503


In [11]:
df2 = pd.read_csv("rawdata/Snf2_rep17_MID89_allLanes_tophat2.0.5.bam.gbgout",sep="\t",header=None,names=["Gene","Expression"])

In [12]:
df2.head(10)

Unnamed: 0,Gene,Expression
0,15S_rRNA,50
1,21S_rRNA,391
2,HRA1,2
3,ICR1,203
4,LSR1,439
5,NME1,52
6,PWR1,0
7,Q0010,0
8,Q0017,0
9,Q0032,0


In [13]:
df.merge(df2,on="Gene",how='outer')

Unnamed: 0,Gene,Expression_x,Expression_y
0,15S_rRNA,29,50
1,21S_rRNA,221,391
2,HRA1,0,2
3,ICR1,70,203
4,LSR1,314,439
...,...,...,...
7126,no_feature,414503,904605
7127,ambiguous,1387186,1687147
7128,too_low_aQual,0,0
7129,not_aligned,0,0


In [15]:
sorted(os.listdir("rawdata"))

['Snf2_rep01_MID96_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep02_MID21_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep03_MID28_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep04_MID11_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep05_MID82_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep06_MID80_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep07_MID26_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep08_MID78_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep09_MID04_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep10_MID38_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep11_MID01_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep12_MID02_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep13_MID59_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep14_MID33_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep15_MID75_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep16_MID72_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep17_MID89_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep18_MID40_allLanes_tophat2.0.5.bam.gbgout',
 'Snf2_rep19_MID84_allLanes_tophat2.0.5.bam.gb

In [16]:
s = "WT_rep40_MID18_allLanes_tophat2.0.5.bam.gbgout"

In [20]:
filesplit = s.split('_')

In [37]:
filesplit[0:2]

['WT', 'rep40']

In [38]:
"_".join(filesplit[0:2])

'WT_rep40'

In [42]:
# 
# Input data are in individual tab-delimited files, one for each experiment
# Here I load each one into its own Pandas dataframe and create a list of frames
#

frames = []

for file in sorted(os.listdir("rawdata")):
    if file.endswith(".gbgout"):
        fs = file.split('_')
        name = "_".join(fs[0:2])
        filepath = os.path.join("rawdata/", file)
        frames.append(pd.read_csv(filepath, sep="\t", header = None, names=['Gene',name]))

In [60]:
l = [1,2,3,5,7,11,13,17]

for x in l:
    print (x)

1
2
3
5
7
11
13
17


In [61]:
df = frames[0]
for rdf in frames[1:]:
    df = df.merge(rdf,on="Gene",how='outer')



In [62]:
df

Unnamed: 0,Gene,Snf2_rep01,Snf2_rep02,Snf2_rep03,Snf2_rep04,Snf2_rep05,Snf2_rep06,Snf2_rep07,Snf2_rep08,Snf2_rep09,...,WT_rep39,WT_rep40,WT_rep41,WT_rep42,WT_rep43,WT_rep44,WT_rep45,WT_rep46,WT_rep47,WT_rep48
0,15S_rRNA,4,2,5,5,46,3,5,4,2,...,0,49,9,4,11,12,1,22,12,4
1,21S_rRNA,31,18,23,44,356,62,35,33,13,...,10,274,49,30,72,58,21,159,107,70
2,HRA1,5,1,3,1,2,1,1,4,4,...,5,3,6,5,2,2,2,5,2,1
3,ICR1,205,196,211,252,127,146,275,160,190,...,85,177,137,118,113,81,142,94,187,106
4,LSR1,210,103,159,260,298,522,303,96,132,...,66,385,232,149,114,81,109,132,243,128
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7126,no_feature,888995,641333,544060,955003,616339,764571,1290592,598769,827934,...,403527,1035268,759722,578053,615809,446796,683906,462390,901045,639100
7127,ambiguous,730793,665470,709428,821933,1171511,529946,925353,669715,717502,...,583744,2002465,865662,646805,570375,1011281,899586,596335,1321941,803015
7128,too_low_aQual,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7129,not_aligned,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
frames[-1]

Unnamed: 0,Gene,WT_rep48
0,15S_rRNA,4
1,21S_rRNA,70
2,HRA1,1
3,ICR1,106
4,LSR1,128
...,...,...
7126,no_feature,639100
7127,ambiguous,803015
7128,too_low_aQual,0
7129,not_aligned,0


In [15]:
#
# This combines all of the individual frames into one dataframe, combining on the "Gene" column
#

df = reduce(lambda  left,right: pd.merge(left,right,on=['Gene'], how='outer'), frames)

In [16]:
#
# Sort columns by name
#

df = df.reindex(sorted(df.columns),axis=1)

In [17]:
df

Unnamed: 0,Gene,Snf2_rep01,Snf2_rep02,Snf2_rep03,Snf2_rep04,Snf2_rep05,Snf2_rep06,Snf2_rep07,Snf2_rep08,Snf2_rep09,...,WT_rep39,WT_rep40,WT_rep41,WT_rep42,WT_rep43,WT_rep44,WT_rep45,WT_rep46,WT_rep47,WT_rep48
0,15S_rRNA,4,2,5,5,46,3,5,4,2,...,0,49,9,4,11,12,1,22,12,4
1,21S_rRNA,31,18,23,44,356,62,35,33,13,...,10,274,49,30,72,58,21,159,107,70
2,HRA1,5,1,3,1,2,1,1,4,4,...,5,3,6,5,2,2,2,5,2,1
3,ICR1,205,196,211,252,127,146,275,160,190,...,85,177,137,118,113,81,142,94,187,106
4,LSR1,210,103,159,260,298,522,303,96,132,...,66,385,232,149,114,81,109,132,243,128
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7126,no_feature,888995,641333,544060,955003,616339,764571,1290592,598769,827934,...,403527,1035268,759722,578053,615809,446796,683906,462390,901045,639100
7127,ambiguous,730793,665470,709428,821933,1171511,529946,925353,669715,717502,...,583744,2002465,865662,646805,570375,1011281,899586,596335,1321941,803015
7128,too_low_aQual,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7129,not_aligned,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
df['Gene'].str.startswith('Y')

0       False
1       False
2       False
3       False
4       False
        ...  
7126    False
7127    False
7128    False
7129    False
7130    False
Name: Gene, Length: 7131, dtype: bool

In [68]:
df[df['Snf2_rep01'] > 1000]

Unnamed: 0,Gene,Snf2_rep01,Snf2_rep02,Snf2_rep03,Snf2_rep04,Snf2_rep05,Snf2_rep06,Snf2_rep07,Snf2_rep08,Snf2_rep09,...,WT_rep39,WT_rep40,WT_rep41,WT_rep42,WT_rep43,WT_rep44,WT_rep45,WT_rep46,WT_rep47,WT_rep48
60,TLC1,2314,1587,1926,2159,1537,920,3054,1476,1935,...,872,1802,1564,1144,1264,698,1577,978,1823,1297
63,YAL003W,7296,6129,6464,7278,5613,3046,7687,6714,8382,...,9291,12055,13542,8441,7337,8639,14315,6618,14745,9878
65,YAL005C,9851,10226,12006,10714,8575,2791,11764,8237,12213,...,10177,21176,13270,12337,14303,14268,14993,9607,16075,14797
66,YAL007C,1519,1276,1364,1611,1116,591,1951,1002,1665,...,753,1399,1095,818,755,753,1137,695,1346,910
71,YAL012W,7277,7124,7545,8022,5875,7223,8985,6530,8068,...,4421,6788,5605,4121,3639,5998,7428,2768,7148,4273
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6738,YPR198W,1009,879,833,1116,749,761,1355,649,1117,...,416,850,568,448,569,399,663,349,691,569
6745,YPR204W,1849,1544,1611,1834,1138,2344,2367,1619,1481,...,1010,1419,1005,778,596,1448,1602,597,1393,1010
6824,snR86,1328,897,1063,1265,935,1498,1757,823,869,...,424,1125,776,593,881,437,796,653,1045,819
7126,no_feature,888995,641333,544060,955003,616339,764571,1290592,598769,827934,...,403527,1035268,759722,578053,615809,446796,683906,462390,901045,639100


In [69]:
df = df[df['Gene'].str.startswith('Y')]

In [70]:
#
# Use only genes starting with "Y" which are protein-coding genes
#

df = df[df['Gene'].str.startswith('Y')]

In [71]:
#
# Index dataframe on 'Gene'
#

df = df.set_index('Gene')

In [72]:
# Save

df.to_csv("Barton_combined_Ygenes.txt", sep='\t')

In [73]:
df.to_pickle("Barton_combined_Ygenes.pkl")