# Datasets: The Lee Yeast Connectivity Data

## Open Data Science Initiative

### 29th May 2014 Neil D. Lawrence

This data set collection is from an early publication on [Chromatin immunoprecipitation](http://en.wikipedia.org/wiki/Chromatin_immunoprecipitation) experiments to determine which transcription factors bind to which genes in yeast [Lee et al (2002)](http://www.sciencemag.org/content/298/5594/799).

In [1]:
import pods
import pylab as plt
%matplotlib inline

In [2]:
data = pods.datasets.lee_yeast_ChIP()

Acquiring resource: lee_yeast_ChIP

Details of data: 
Binding location analysis for 106 regulators in yeast. The data consists of p-values for binding of regulators to genes derived from ChIP-chip experiments.

Please cite:
Tong Ihn Lee, Nicola J. Rinaldi, Francois Robert, Duncan T. Odom, Ziv Bar-Joseph, Georg K. Gerber, Nancy M. Hannett, Christopher T. Harbison, Craig M. Thompson, Itamar Simon, Julia Zeitlinger, Ezra G. Jennings, Heather L. Murray, D. Benjamin Gordon, Bing Ren, John J. Wyrick, Jean-Bosco Tagne, Thomas L. Volkert, Ernest Fraenkel, David K. Gifford, Richard A. Young 'Transcriptional Regulatory Networks in Saccharomyces cerevisiae' Science 298 (5594) pg 799--804. DOI: 10.1126/science.1075090

After downloading the data will take up 1674161 bytes of space.

Data will be stored in /Users/neill/ods_data_cache/lee_yeast_ChIP.

Do you wish to proceed with the download? [yes/no]
yes
binding_by_gene.tsv
Downloading  http://jura.wi.mit.edu/young_public/regulatory_network/binding

The data consists of $p$-values for the hypothesized relationships between the transcription factors and the genes. There are 113 transcription factors represented in `data['transcription_factors']`.

In [3]:
print data['transcription_factors']


['ABF1', 'ACE2', 'ADR1', 'ARG80', 'ARG81', 'ARO80', 'ASH1', 'AZF1', 'BAS1', 'CAD1', 'CBF1', 'CHA4', 'CIN5', 'CRZ1', 'CUP9', 'DAL81', 'DAL82', 'DIG1', 'DOT6', 'ECM22', 'FHL1', 'FKH1', 'FKH2', 'FZF1', 'GAL4', 'GAT1', 'GAT3', 'GCN4', 'GCR1', 'GCR2', 'GLN3', 'GRF10(Pho2)', 'GTS1', 'HAA1', 'HAL9', 'HAP2', 'HAP3', 'HAP4', 'HAP5', 'HIR1', 'HIR2', 'HMS1', 'HSF1', 'IME4', 'INO2', 'INO4', 'IXR1', 'LEU3', 'MAC1', 'MAL13', 'MAL33', 'MATa1', 'MBP1', 'MCM1', 'MET31', 'MET4', 'MIG1', 'MOT3', 'MSN1', 'MSN2', 'MSN4', 'MSS11', 'MTH1', 'NDD1', 'NRG1', 'PDR1', 'PHD1', 'PHO4', 'PUT3', 'RAP1', 'RCS1', 'REB1', 'RFX1', 'RGM1', 'RGT1', 'RIM101', 'RLM1', 'RME1', 'ROX1', 'RPH1', 'RTG1', 'RTG3', 'RTS2', 'SFL1', 'SFP1', 'SIG1', 'SIP4', 'SKN7', 'SKO1', 'SMP1', 'SOK2', 'SRD1', 'STB1', 'STE12', 'STP1', 'STP2', 'SUM1', 'SWI4', 'SWI5', 'SWI6', 'THI2', 'UGA3', 'USV1', 'YAP1', 'YAP3', 'YAP5', 'YAP6', 'YAP7', 'YBR267W', 'YFL044C', 'YJL206C', 'ZAP1', 'ZMS1']


And the 6270 gene names and their annotations are given in `data['annotations']`.

A `pandas` data frame containing all the $p$-values for the binding between genes and transcription factors data is available in `data['Y']`.

In [4]:
data['Y'].describe()

Unnamed: 0,ABF1,ACE2,ADR1,ARG80,ARG81,ARO80,ASH1,AZF1,BAS1,CAD1,...,YAP1,YAP3,YAP5,YAP6,YAP7,YBR267W,YFL044C,YJL206C,ZAP1,ZMS1
count,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,...,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0,6270.0
mean,0.520917,0.4953694,0.467962,0.5647799,0.5676383,0.5386131,0.46732,0.524318,0.5387386,0.5160878,...,0.5186252,0.529313,0.559987,0.551269,0.526091,0.556157,0.539022,0.5175713,0.5671431,0.560589
std,0.270981,0.2823651,0.2792,0.2773574,0.2715823,0.2787993,0.29646,0.293448,0.274398,0.2852743,...,0.2882458,0.305721,0.268339,0.2570737,0.29869,0.308842,0.289868,0.3058519,0.2819108,0.314172
min,3.7e-05,4.6e-11,1e-06,8.1e-09,1.7e-15,6.4e-15,0.00033,0.00024,1.1e-16,1.2e-10,...,4.7e-08,0.0013,0.0,4.1e-15,0.0023,0.012,8e-05,4.3e-13,8.9e-16,0.00021
25%,0.38,0.26,0.24,0.36,0.37,0.33,0.18,0.28,0.32,0.3,...,0.28,0.25,0.38,0.39,0.26,0.27,0.31,0.26,0.35,0.3
50%,0.57,0.49,0.44,0.58,0.58,0.54,0.47,0.53,0.55,0.47,...,0.53,0.52,0.55,0.59,0.51,0.56,0.54,0.5,0.57,0.55
75%,0.72,0.73,0.68,0.81,0.8,0.78,0.7,0.78,0.77,0.76,...,0.76,0.8,0.77,0.75,0.79,0.84,0.7975,0.78,0.81,0.86
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
