In [4]:
import numpy as np
import pandas as pd
import pickle

Data are currently stored on L0's /fdata drive in the ohsu_data directory.

In [2]:
DATADIR = '/fdata/ohsu_data/'

## metadata

The 'final_frame.p' file contains a table of metadata for each sample (where a sample is a patient plus a drug), that tells you what patient and which drug were in the sample. It also contains a variety of closely related measures of the same outcome, cell death. To find the drug name, I use the 'new drug' column, which has been processed a bit to deal with some annoying ambiguity in the original labels. As outcome, I typically use 'IC50.'

In [5]:
meta_data = pickle.load(open(DATADIR+'final_frame.p','rb'))

In [6]:
meta_data

Unnamed: 0,patient_id,diagnosis,specific_diagnosis,lab_id,specimen_type,Panel_version,drug,replicant,IC10,IC25,...,Area_under_the_curve,Median_IC10,Median_IC25,Median_IC50,Median_IC75,Median_IC90,Median_area_under_the_curve,Model_Curve,new drug,seq_id
18923,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,Quizartinib (AC220),1,0.055029,0.308608,...,613.315238,0.016933,0.059876,0.340180,10.000000,10.000000,413.389659,2.24157311x^3 + 5.9876393x^2 - 15.01468363x + ...,Quizartinib,13-00098
18924,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,Quizartinib (AC220),2,0.009248,0.022329,...,657.542105,0.016933,0.059876,0.340180,10.000000,10.000000,413.389659,7.88481776x^3 + 36.65764011x^2 + 15.31003749x ...,Quizartinib,13-00098
18925,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,Quizartinib (AC220),3,0.007648,0.013755,...,424.350312,0.016933,0.059876,0.340180,10.000000,10.000000,413.389659,-8.5916776x^3 - 11.8196341x^2 - 3.75424343x + ...,Quizartinib,13-00098
18926,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,Roscovitine (CYC-202),1,4.885586,10.000000,...,917.676340,1.939227,7.984526,10.000000,10.000000,10.000000,843.363489,1.47109788x^3 - 2.67144602x^2 - 15.28992707x +...,Seliciclib,13-00098
18927,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,Tivozanib (AV-951),1,0.008470,4.819194,...,736.707350,0.009585,0.044267,1.462338,10.000000,10.000000,450.082158,-12.46228649x^3 - 26.77590505x^2 - 0.95445888x...,Tivozanib,13-00098
18928,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,BMS-345541,1,0.015437,0.308159,...,599.717792,0.688068,2.315533,5.260637,9.046746,10.000000,557.583397,-4.19860119x^3 - 7.55645437x^2 - 10.3254099x +...,BMS-345541,13-00098
18929,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,TG100-115,1,0.020214,7.948076,...,803.910607,0.459872,3.792677,10.000000,10.000000,10.000000,743.009723,-4.17787792x^3 - 8.61529358x^2 - 3.6153064x + ...,TG100-115,13-00098
18930,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,ABT-737,1,0.004682,0.801178,...,827.853609,0.030441,0.141805,1.181524,10.000000,10.000000,448.971722,18.78019266x^3 + 36.21899181x^2 - 25.16002439x...,ABT-737,13-00098
18931,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,SB-202190,1,0.024877,0.062050,...,598.950736,0.012540,1.326725,10.000000,10.000000,10.000000,763.559802,14.42165243x^3 + 43.17969746x^2 - 2.45465294x ...,SB-202190,13-00098
18932,1011,ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PREC...,AML with mutated NPM1,13-00098,Bone Marrow Aspirate,MarcTest 384 123v4,Vargetef,1,0.027093,0.133565,...,510.669946,0.030376,0.141904,1.490927,10.000000,10.000000,442.992256,1.92247807x^3 + 5.54182684x^2 - 16.9453122x + ...,Nintedanib,13-00098


## rna-seq

Each row of the rna_seq file includes the pre-processed RNA-seq data for the patient corresponding to that sample. Preprocessing is as follows (note: i can get you the raw rna-seq if you want it, this is probably just easier):

To ensure a quality signal for prediction while removing noise and batch effects, it is necessary to carefully preprocess RNA-seq gene expression data. For the biological data experiments, RNA-seq were preprocessed as follows:

1. First, raw transcript counts were converted to fragments per kilobase of exon model per million mapped reads (FPKM). FPKM is more reflective of the molar amount of a transcript in the original sample than raw counts, as it normalizes the counts for different RNA lengths and for the total number of reads \citep{Mortazavi2008}. FPKM is calculated as follows: $FPKM = \frac{X_i \times 10^9}{Nl_i}$, Where $X_i$ is the raw counts for a transcript, $l_i$ is the effective length of the transcript, and $N$ is the total number of counts.
2. Next, we removed non-protein-coding transcripts from the dataset.
3. We removed transcripts that were not meaningfully observed in our dataset by dropping any transcript where $>70\%$ measurements across all samples were equal to 0.
4. We $\log_2$ transformed the data
5. We standardized each transcript across all samples, such that the mean for the transcript was equal to zero and the variance of the transcript was equal to one $X_i^{'} = \frac{X_i - \mu_i}{\sigma_i}$, where $X_i$ is the expression for a transcript, $\mu_i$ is the mean expression of that transcript, and $\sigma_i$ is the standard deviation of that transcript across all samples.
6. Finally, we corrected for batch effects in the measurements using the ComBat tool available in the sva R package

In [7]:
# load data

X_rna_seq = pickle.load(open(DATADIR+'X_rna_seq.p','rb'))
X_rna_seq = X_rna_seq.loc[:,~X_rna_seq.columns.duplicated()]

In [8]:
X_rna_seq

Symbol,TSPAN6,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,GCLC,NFYA,STPG1,...,MTRNR2L6,ASB3,SRXN1,CTAGE6,MILR1,GTF2H5,NUDT3,MUSTN1,DOC2B,SNURF
18923,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18924,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18925,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18926,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18927,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18928,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18929,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18930,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18931,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
18932,-0.528576,1.052527,-0.379285,1.594793,-2.279940,-0.546372,-0.010282,0.646766,0.903511,-0.503991,...,-0.382778,1.954434,-0.324074,0.230234,-1.268304,-0.890903,-1.416095,-0.592036,-0.664326,-0.219599
