# STATS 604: Gene expression in the Brain
## Team Members: Ben Agyare, Yumeng Wang, Jake Trauger, Yash Patel

## Code Setup
We start simply by loading in the data and organizing properties of subsequent interest.

In [10]:
import pickle
with open("data/brain.pkl", "rb") as f:
    dataset = pickle.load(f)

In [27]:
expression = dataset["expression"]
genes = dataset["genes"]
samples = dataset["samples"]

## Normalization
We need to do some normalization with the bacterial spike-ins. According to the following two sources:
- https://support.bioconductor.org/p/49150/
- https://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures/ma07b.pdf
two common normalization techniques are spike-in normalizatio and quantile normalization. The former doesn't really involve much assumption beyond the experimental consistency of spike-in amounts. Given that this itself could vary across labs, we should verify this behavior prior to applying spike-in normalization. Quantile normalization assumes that most genes are *not* differentially expressed (that only a handful are). 

Let's start with the simple technique.

In [42]:
nan_genes = genes[genes["chrom"].isna()].index.values # genes not associated with chromosomes
controls = [gene for gene in nan_genes if gene.startswith("AFFX")]
print(controls)

['AFFX-BioB-3_at', 'AFFX-BioB-3_st', 'AFFX-BioB-5_at', 'AFFX-BioB-5_st', 'AFFX-BioB-M_at', 'AFFX-BioB-M_st', 'AFFX-BioC-3_at', 'AFFX-BioC-3_st', 'AFFX-BioC-5_at', 'AFFX-BioC-5_st', 'AFFX-BioDn-3_st', 'AFFX-BioDn-5_at', 'AFFX-BioDn-5_st', 'AFFX-CreX-3_at', 'AFFX-CreX-3_st', 'AFFX-CreX-5_at', 'AFFX-CreX-5_st', 'AFFX-DapX-3_at', 'AFFX-DapX-5_at', 'AFFX-DapX-M_at', 'AFFX-HUMRGE/M10098_3_at', 'AFFX-HUMRGE/M10098_5_at', 'AFFX-HUMRGE/M10098_M_at', 'AFFX-LysX-3_at', 'AFFX-LysX-5_at', 'AFFX-LysX-M_at', 'AFFX-M27830_3_at', 'AFFX-M27830_5_at', 'AFFX-M27830_M_at', 'AFFX-MurFAS_at', 'AFFX-MurIL10_at', 'AFFX-MurIL2_at', 'AFFX-MurIL4_at', 'AFFX-PheX-3_at', 'AFFX-PheX-5_at', 'AFFX-PheX-M_at', 'AFFX-ThrX-3_at', 'AFFX-ThrX-5_at', 'AFFX-ThrX-M_at', 'AFFX-TrpnX-3_at', 'AFFX-TrpnX-5_at', 'AFFX-TrpnX-M_at', 'AFFX-YEL002c/WBP1_at', 'AFFX-YEL018w/_at', 'AFFX-YEL021w/URA3_at', 'AFFX-YEL024w/RIP1_at', 'AFFX-hum_alu_at']


In [32]:
expression

Unnamed: 0,1000_at,1001_at,1002_f_at,1003_s_at,1004_at,1005_at,1006_at,1007_s_at,1008_f_at,1009_at,...,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at,AFFX-YEL002c/WBP1_at,AFFX-YEL018w/_at,AFFX-YEL021w/URA3_at,AFFX-YEL024w/RIP1_at,AFFX-hum_alu_at
01_a_D_f_2.CEL,9.521934,7.453767,7.045636,7.743690,7.728274,8.081243,6.927978,9.294152,8.888816,8.290944,...,7.065597,6.803698,6.631766,6.981474,7.003319,6.999630,7.005495,6.864895,7.030665,14.394582
01_a_I_f_2.CEL,10.930684,9.455482,9.233291,9.777128,9.612399,9.845444,9.149639,10.580062,11.505571,9.846817,...,9.366604,9.067484,8.945584,9.118070,9.137492,9.067202,9.129742,9.256544,9.171725,15.494106
01_a_M_f_1.CEL,6.852731,5.298974,5.033266,6.040661,5.890083,5.810144,5.098923,6.240855,7.280948,6.124910,...,5.374457,5.165619,5.032617,5.382904,5.509690,5.113946,5.273243,5.321357,5.321636,12.856782
01_c_D_f_1.CEL,7.285181,6.258114,6.119443,6.631768,6.744592,6.589478,6.156638,7.417750,8.663882,8.203275,...,6.332162,6.278864,6.108778,6.341791,6.339638,6.167335,6.170734,6.734936,6.275118,13.683484
01_c_I_f_2.CEL,11.224543,9.800931,9.407753,10.113212,9.871853,10.495533,9.449701,11.084619,11.969619,10.479006,...,9.443910,9.242560,9.272848,9.435175,9.448927,9.395768,9.306998,9.863770,9.404475,15.494106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10_c_I_f_2.CEL,10.834992,9.375171,9.100567,9.766162,9.441591,10.245868,9.065520,10.590598,11.039345,10.225186,...,9.197899,8.932622,8.950766,9.120934,9.018108,9.007320,9.010432,9.093493,8.986246,15.493637
10_c_M_f_1.CEL,6.990507,5.671513,5.398236,6.182388,6.088754,6.074909,5.479757,6.328909,7.675669,6.726671,...,5.563038,5.453313,5.228861,5.498717,5.515519,5.379649,5.460068,5.455070,5.488273,12.665336
10_d_D_f_2.CEL,9.143998,6.862450,6.357731,7.134995,7.046711,7.730850,6.183967,9.414389,9.065153,8.863422,...,6.258799,6.027364,5.907490,6.171226,6.130500,6.080228,5.966166,6.135691,6.023620,14.244014
10_d_I_f_2.CEL,10.963519,9.338467,9.047548,9.696127,9.541697,9.792356,9.073002,10.775091,11.373533,10.350302,...,9.126488,8.891157,8.935299,9.080840,9.166819,9.042159,9.048415,9.007950,9.115568,15.493855


In [41]:
genes.loc["1155_at"]

sym      NaN
chrom    NaN
Name: 1155_at, dtype: object

In [34]:
samples

Unnamed: 0_level_0,patient,sex,region,lab,chip.version
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01_a_D_f_2.CEL,patient_01,female,A.C. cortex,Davis,v2
01_a_I_f_2.CEL,patient_01,female,A.C. cortex,Irvine,v2
01_a_M_f_1.CEL,patient_01,female,A.C. cortex,Michigan,v1
01_c_D_f_1.CEL,patient_01,female,cerebellum,Davis,v1
01_c_I_f_2.CEL,patient_01,female,cerebellum,Irvine,v2
...,...,...,...,...,...
10_c_I_f_2.CEL,patient_10,female,cerebellum,Irvine,v2
10_c_M_f_1.CEL,patient_10,female,cerebellum,Michigan,v1
10_d_D_f_2.CEL,patient_10,female,D.L.P.F. cortex,Davis,v2
10_d_I_f_2.CEL,patient_10,female,D.L.P.F. cortex,Irvine,v2
