# Exploring DREAM3 dataset

Third annual DREAM challenge (DREAM3) (2008), a gene expression
prediction challenge.

The original challenge was to predict a block 
of missing data. Here we use the same dataset 
for a different purpose: to discover a gene regulatory
network for yeast

yeast (saccharomyces cerevisiae) About 6,000 genes 

ecoli has 5,000 genes 
but is a prokaryote. yeast is an eukaryote, like plants, and humans

humans ~ 21,000 genes 

fruit flies (drosophila) ~ 14,000 genes

GAT1, GCN4 and LEU3 are TFs (Transcription Factors, i.e.,  proteins that latch onto a DNA segment)

4 yeast strains
1. wild type (wt) 
2. GAT1 deletion strain (gat1$\Delta$)
3. GCN4 deletion strain (gcn4$\Delta$) 
4. LEU3 deletion strain (leu3$\Delta$)

time course (8 times): 
T= 0, 10, 20, 30, 45, 60, 90 and 120 minutes

T: time since added 3-aminotriazole (3AT), which is an
inhibitor of an enzyme in the histidine biosynthesis pathway.
No 3AT at T=0

data: expression levels for these 4 different strains of yeast, with missing data

missing data: for gat1$\Delta1$ strain, block of data of shape 50 by 8 (50 genes and 8 times).

The dataset contains 9,335 rows even though yeast have ~ 6,000 genes. The reason for 
9,335 is explained in misc/Affymetric-microarray.md

The dataset contains 35 columns: (4 strains) X (8 times) + 3 = 35

The 3 extra columns are:
1. probeID
2. geneName
3. L0,  expression level (arbitrary units) for probeID of parental strain at t=0

The other columns are log_2(L/L0) where L is the expression level 

slots for missing data have been filled with "PREDICT" string


Refs.
-----

https://dreamchallenges.org/dream-3-gene-expression-prediction/

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0008944

https://github.com/s-nakhawa/DREAM3-Gene-Expression-Prediction-Challenge

https://en.wikipedia.org/wiki/DREAM_Challenges

In [1]:
# this makes sure it starts looking for things from the project folder down.
import os
import sys
os.chdir('../')
sys.path.insert(0,os.getcwd())
print(os.getcwd())

C:\Users\rrtuc\Desktop\backed-up\python-projects\gene_causal_mapper


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/DREAM3_GeneExpressionChallenge_ExpressionData_UPDATED.txt', sep='\t')
print(df.shape)
df.head(15)

(9335, 35)


Unnamed: 0,ProbeID,geneName,"absolute expression, parental strain t=0 (arbitrary units)",wt t=0,wt t=10,wt t=20,wt t=30,wt t=45,wt t=60,wt t=90,...,gcn4-deletion t=90,gcn4-deletion t=120,leu3-deletion t=0,leu3-deletion t=10,leu3-deletion t=20,leu3-deletion t=30,leu3-deletion t=45,leu3-deletion t=60,leu3-deletion t=90,leu3-deletion t=120
0,10000_at,,0.194,0,0.080088,-0.02148,-0.207893,0.070967,-0.066261,-0.063503,...,0.081614,-0.079975,-0.185232,-0.205393,-0.071763,-0.060739,0.0,0.042457,-0.146655,-0.246104
1,10001_at,MID2,1.48,0,-0.466758,-0.37629,-0.619178,-0.199123,-0.107018,-0.14274,...,0.302226,-0.147958,-0.149259,-0.382944,0.182786,-0.007196,0.02034,-0.052416,0.074001,0.187707
2,10002_i_at,,13.96,0,-0.312665,-0.146655,-0.640621,-0.097611,-0.200379,-0.304511,...,-0.478195,-0.6107,-0.385155,-0.590722,-0.319618,-0.109695,-0.468844,-0.442545,-0.284514,-0.356144
3,10003_f_at,RPS25B,132.6,0,-0.182692,-0.105678,-0.447844,-0.119024,-0.167358,-0.244887,...,-0.302173,-0.473008,-0.284514,-0.492622,-0.358396,-0.404903,-0.499782,-0.343692,-0.263034,-0.37629
4,10004_at,,0.148,0,0.164884,0.182786,0.186065,0.103147,0.129734,0.107803,...,0.112475,0.349235,0.036526,0.013043,-0.005759,0.057392,0.052895,0.163268,-0.02432,-0.064883
5,10005_at,NUP2,3.83,0,-0.238787,-0.140124,-0.80488,-0.153157,-0.164786,-0.358396,...,-0.128293,-0.599318,0.005782,-0.059355,0.239566,-0.062122,0.054392,-0.15056,0.128156,0.101598
6,10006_at,SGD1,1.125,0,0.54372,-0.178874,-0.718964,-0.052416,-0.209141,-0.178874,...,-0.303342,-0.321928,-0.439357,-0.246104,-0.147958,-0.112367,-0.24123,-0.24245,-0.197865,-0.331132
7,10007_at,VRP1,2.045,0,0.011588,-0.132248,-0.120352,-0.166073,-0.257011,-0.442545,...,-0.473008,-0.590722,-0.316146,0.078564,0.22095,-0.543991,0.018878,-0.489543,-0.266637,-0.097611
8,10008_at,,0.346,0,0.163268,0.205896,0.537424,0.186065,0.177882,0.080088,...,0.058894,0.246395,-0.14274,-0.166073,-0.097611,-0.00863,0.036526,0.030619,-0.16092,-0.053806
9,10009_at,,0.926,0,0.438307,0.16004,0.272297,0.352916,0.312939,-0.191563,...,-0.018634,-0.230203,0.189351,-0.07861,-0.073135,-0.073135,-0.053806,0.305788,0.110916,-0.052416


In [4]:
df = pd.read_csv('data/DREAM3GoldStandard_ExpressionChallenge.txt', sep='\t')
print(df.shape)
df.head(15)

(50, 10)


Unnamed: 0,probeID,gene_name,rank_time0,rank_time10,rank_time20,rank_time30,rank_time45,rank_time60,rank_time90,rank_time120
0,5646_at,ALD5,32,24,40,33,32,32,38,34
1,8593_at,ARG1,9,19,16,27,46,44,26,28
2,11114_at,ARG3,3,10,26,42,47,41,22,33
3,4537_at,ARG4,19,17,22,40,44,46,31,35
4,4568_at,ARN1,20,27,24,14,12,11,10,16
5,7291_at,BAP2,37,28,36,25,31,26,37,30
6,4321_at,BAT1,26,29,28,21,17,25,29,26
7,4899_at,CLB6,40,47,48,48,48,19,23,18
8,3190_f_at,COS8,46,46,47,44,43,38,44,44
9,10105_at,CRR1,47,43,43,46,23,22,9,6


In [5]:
df = pd.read_csv('data/wild.tsv', sep='\t')
print(df.shape)
df.head(15)

(5112, 9)


Unnamed: 0,geneName,wt t=0,wt t=10,wt t=20,wt t=30,wt t=45,wt t=60,wt t=90,wt t=120
0,MID2,0,-0.466758,-0.37629,-0.619178,-0.199123,-0.107018,-0.14274,-0.210389
1,RPS25B,0,-0.182692,-0.105678,-0.447844,-0.119024,-0.167358,-0.244887,-0.194087
2,NUP2,0,-0.238787,-0.140124,-0.80488,-0.153157,-0.164786,-0.358396,-0.182692
3,SGD1,0,0.54372,-0.178874,-0.718964,-0.052416,-0.209141,-0.178874,-0.115033
4,VRP1,0,0.011588,-0.132248,-0.120352,-0.166073,-0.257011,-0.442545,-0.546956
5,ATP14,0,-0.364012,-0.834307,-1.027154,-0.292782,-0.253384,-0.360645,-0.393965
6,YHC1,0,0.187707,-0.124328,-0.606916,-0.182692,-0.149259,-0.192825,-0.020058
7,ECM38,0,-0.72334,-0.52306,-0.387363,-0.361768,-0.668119,-0.737254,-0.716332
8,EXG1,0,-0.763836,-0.545968,-0.97306,-0.909197,-0.801573,-0.757877,-0.741575
9,MET17,0,-0.037031,0.222633,0.246395,0.172994,-0.064883,-0.04684,0.007232


In [6]:
df= pd.read_csv('data/gat1_d.tsv', sep='\t')
print(df.shape)
df.head(15)

(5062, 9)


Unnamed: 0,geneName,gat1-deletion t=0,gat1-deletion t=10,gat1-deletion t=20,gat1-deletion t=30,gat1-deletion t=45,gat1-deletion t=60,gat1-deletion t=90,gat1-deletion t=120
0,MID2,0.066427,0.013043,0.150401,0.224317,0.37707,0.237864,-0.035624,0.08467
1,RPS25B,-0.132248,-0.24732,-0.664483,-0.225275,-0.124328,-0.166073,-0.642471,-0.129613
2,NUP2,-0.060739,0.081614,0.452057,0.051399,0.194295,0.125006,-0.320773,-0.09356
3,SGD1,-0.075875,0.107803,0.140826,0.289827,0.541618,0.067939,-0.631337,-0.22033
4,VRP1,-0.339137,-0.380729,0.011588,-0.298658,-0.16221,0.002888,-0.386259,-0.543991
5,ATP14,0.078564,-0.264236,-0.632268,-0.341417,-0.00863,-0.180148,-0.829444,-0.553852
6,YHC1,0.02327,0.21927,-0.064883,-0.159629,-0.034216,-0.15056,-0.385155,-0.002883
7,ECM38,-0.42975,-0.901494,-0.950842,-0.896078,-0.695994,-0.292782,-0.56267,-0.666302
8,EXG1,0.351074,0.0,0.057392,-0.094912,0.08467,0.222633,-0.446786,-0.037031
9,MET17,-0.151859,-0.308011,-0.098958,-0.007196,0.064917,0.02034,-0.129613,0.004335


In [7]:
df = pd.read_csv('data/gcn4_d.tsv', sep='\t')
print(df.shape)
df.head(15)

(5112, 9)


Unnamed: 0,geneName,gcn4-deletion t=0,gcn4-deletion t=10,gcn4-deletion t=20,gcn4-deletion t=30,gcn4-deletion t=45,gcn4-deletion t=60,gcn4-deletion t=90,gcn4-deletion t=120
0,MID2,-0.146655,0.207561,0.367732,0.260152,0.312939,0.394032,0.302226,-0.147958
1,RPS25B,-0.326537,-0.298658,-0.214125,-0.074505,-0.423309,-0.534062,-0.302173,-0.473008
2,NUP2,-0.438293,0.021804,-0.228973,-0.301002,-0.284514,-0.123004,-0.128293,-0.599318
3,SGD1,-0.233888,0.698998,0.158429,-0.14274,-0.237564,-0.191563,-0.303342,-0.321928
4,VRP1,-0.195348,-0.202888,-0.186501,-0.273814,0.024737,-0.0229,-0.473008,-0.590722
5,ATP14,0.58208,-0.038436,0.107803,0.477944,0.650635,0.617056,0.067939,-0.313826
6,YHC1,-0.304511,0.039488,0.08467,0.089267,-0.098958,-0.225275,0.038006,-0.00863
7,ECM38,0.281036,-0.403813,-0.305679,-0.334568,-0.104337,0.08467,-0.325386,-0.804054
8,EXG1,0.253257,-0.141433,-0.090853,0.066427,-0.066261,0.086201,0.114035,-0.695994
9,MET17,-0.240009,-0.084064,-0.132248,-0.088142,-0.235114,-0.175045,-0.296311,0.318326


In [8]:
df = pd.read_csv('data/leu3_d.tsv', sep='\t')
print(df.shape)
df.head(15)

(5112, 9)


Unnamed: 0,geneName,leu3-deletion t=0,leu3-deletion t=10,leu3-deletion t=20,leu3-deletion t=30,leu3-deletion t=45,leu3-deletion t=60,leu3-deletion t=90,leu3-deletion t=120
0,MID2,-0.149259,-0.382944,0.182786,-0.007196,0.02034,-0.052416,0.074001,0.187707
1,RPS25B,-0.284514,-0.492622,-0.358396,-0.404903,-0.499782,-0.343692,-0.263034,-0.37629
2,NUP2,0.005782,-0.059355,0.239566,-0.062122,0.054392,-0.15056,0.128156,0.101598
3,SGD1,-0.439357,-0.246104,-0.147958,-0.112367,-0.24123,-0.24245,-0.197865,-0.331132
4,VRP1,-0.316146,0.078564,0.22095,-0.543991,0.018878,-0.489543,-0.266637,-0.097611
5,ATP14,-0.172488,-0.411426,-0.49057,-0.40163,-0.421156,-0.253384,-0.448901,-0.480265
6,YHC1,-0.001442,0.107803,-0.237564,0.090803,-0.176323,-0.005759,0.080088,0.186065
7,ECM38,0.526992,-0.140124,0.058894,-0.457332,-0.067639,-0.489543,-0.273814,-0.22033
8,EXG1,0.120294,-0.16092,0.018878,-0.292782,-0.028569,-0.136191,-0.146655,0.131313
9,MET17,-0.201634,-0.227741,-0.130931,-0.202888,-0.337996,-0.147958,-0.329985,-0.264236
