# Healthcare Hack Nights: Part I

In [None]:
## RNA-Seq
Questions that can be answered by RNA-seq:
    - What genes are differentially expressed between group samples?
    - How does gene expression change across time or conditions? (eg, in benign vs malignant tumors)
    - What pathways or processes are enriched under a condition?

In [151]:
%matplotlib inline

In [15]:
import pandas as pd

In [241]:
lihc = pd.read_csv('../lihc_rnaseq.csv')
lihc.set_index('bcr_patient_barcode', inplace=True)
lihc.shape

(423, 20531)

In this count matrix, each column represents an Ensembl gene transcript, each row a patient sequenced RNA library, and the values give the raw numbers of fragments that were uniquely assigned to the respective gene in each library. We also have additional information on each of the patients samples (the rows of the count matrix) and on each of the genes (the columns of the matrix).
                    
We now have all the ingredients to prepare our data object in a form that is suitable for analysis, namely:

 - countdata: a table with the fragment counts

 - rowdata: a table with information about the patient samples


In [18]:
lihc.head()

Unnamed: 0,bcr_patient_barcode,?|100130426,?|100133144,?|100134869,?|10357,?|10431,?|136542,?|155060,?|26823,?|280660,...,ZXDA|7789,ZXDB|158586,ZXDC|79364,ZYG11A|440590,ZYG11B|79699,ZYX|7791,ZZEF1|23140,ZZZ3|26009,psiTPTE22|387590,tAKR|389932
0,TCGA-2V-A95S-01A-11R-A37K-07,0.0,1.5051,3.7074,90.1124,1017.1038,0.0,141.3911,0.6516,0.0,...,24.7597,273.6602,794.2662,18.244,499.1041,3172.5037,890.0472,510.1808,3.9094,6.5157
1,TCGA-2Y-A9GS-01A-12R-A38B-07,0.0,26.412,2.6663,71.0054,639.2311,0.0,122.7206,1.4786,0.0,...,68.5067,632.8241,1153.7703,71.4638,1000.4929,5301.1336,755.5446,860.5224,6.4071,482.9966
2,TCGA-2Y-A9GT-01A-11R-A38B-07,0.0,0.0,4.4833,95.5122,742.4344,0.0,95.046,1.7933,0.8967,...,46.6263,1219.4575,1133.3782,12.5532,1289.397,3219.0092,860.7935,523.6494,14.3466,83.3894
3,TCGA-2Y-A9GU-01A-11R-A38B-07,0.0,5.7222,5.1216,61.6679,1186.9807,0.0,280.2709,0.8341,0.0,...,18.3511,285.2758,1150.2786,9.1755,941.7437,3092.9899,1339.6283,343.6655,2.5024,2.5024
4,TCGA-2Y-A9GV-01A-11R-A38B-07,0.0,11.4975,5.423,104.467,878.1726,0.0,282.5719,0.0,0.0,...,41.4552,999.154,1631.9797,4.2301,1380.7107,2902.7073,575.2961,665.8206,2.5381,119.2893


The dataset contains the number of counts for ~20k genes defined by their Entrez transcript ID x 423 deidentified patients. 

Entrez (https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many others. You can access Entrez from a web browser to manually enter queries, or you can use Biopython’s Bio.Entrez module for programmatic access to Entrez. 

Entrez gene IDs are unique gene identifiers that can be used to trace a particular gene or transcript to the genome.

In [19]:
# Get Entrez transcript IDs

ids = pd.Series(lihc.columns.values[1:]).apply(lambda x: x.split('|')[1]).values
ids[:5]

array(['100130426', '100133144', '100134869', '10357', '10431'],
      dtype=object)

# BioMart


In [2]:
from pybiomart import Server, Dataset

In [9]:
# Retrieving a dataset directly with known dataset name

dataset = Dataset(name='hsapiens_gene_ensembl',
                  host='http://www.ensembl.org')

dataset.query(
              filters={'chromosome_name': ['1','2']})

Unnamed: 0,Gene stable ID,Gene stable ID version,Transcript stable ID,Transcript stable ID version
0,ENSG00000200036,ENSG00000200036.1,ENST00000363166,ENST00000363166.1
1,ENSG00000252396,ENSG00000252396.1,ENST00000516587,ENST00000516587.1
2,ENSG00000252429,ENSG00000252429.2,ENST00000516620,ENST00000516620.2
3,ENSG00000221643,ENSG00000221643.1,ENST00000408716,ENST00000408716.1
4,ENSG00000264371,ENSG00000264371.1,ENST00000580572,ENST00000580572.1
...,...,...,...,...
36496,ENSG00000196290,ENSG00000196290.15,ENST00000409357,ENST00000409357.5
36497,ENSG00000196290,ENSG00000196290.15,ENST00000409129,ENST00000409129.2
36498,ENSG00000196290,ENSG00000196290.15,ENST00000409588,ENST00000409588.1
36499,ENSG00000196290,ENSG00000196290.15,ENST00000436412,ENST00000436412.1


The `attributes` attribute can be used to pull up a list of additional fields available from the dataset

In [7]:
list(dataset.attributes)

['ensembl_gene_id',
 'ensembl_gene_id_version',
 'ensembl_transcript_id',
 'ensembl_transcript_id_version',
 'ensembl_peptide_id',
 'ensembl_peptide_id_version',
 'ensembl_exon_id',
 'description',
 'chromosome_name',
 'start_position',
 'end_position',
 'strand',
 'band',
 'transcript_start',
 'transcript_end',
 'transcription_start_site',
 'transcript_length',
 'transcript_tsl',
 'transcript_gencode_basic',
 'transcript_appris',
 'transcript_mane_select',
 'external_gene_name',
 'external_gene_source',
 'external_transcript_name',
 'external_transcript_source_name',
 'transcript_count',
 'percentage_gene_gc_content',
 'gene_biotype',
 'transcript_biotype',
 'source',
 'transcript_source',
 'version',
 'transcript_version',
 'peptide_version',
 'phenotype_description',
 'Source_name',
 'study_external_id',
 'strain_name',
 'strain_gender',
 'p_value',
 'go_id',
 'name_1006',
 'definition_1006',
 'go_linkage_type',
 'namespace_1003',
 'goslim_goa_accession',
 'goslim_goa_description',


We can map the gene stable ID with the mappings from the to get a sense of which pathways are linked to a particular gene:

In [21]:
dataset.filters

{'link_so_mini_closure': <biomart.Filter name='link_so_mini_closure', type='list'>,
 'link_go_closure': <biomart.Filter name='link_go_closure', type='text'>,
 'link_ensembl_transcript_stable_id': <biomart.Filter name='link_ensembl_transcript_stable_id', type='text'>,
 'gene_id': <biomart.Filter name='gene_id', type='text'>,
 'transcript_id': <biomart.Filter name='transcript_id', type='text'>,
 'link_ensembl_gene_id': <biomart.Filter name='link_ensembl_gene_id', type='text'>,
 'chromosome_name': <biomart.Filter name='chromosome_name', type='text'>,
 'start': <biomart.Filter name='start', type='text'>,
 'end': <biomart.Filter name='end', type='text'>,
 'band_start': <biomart.Filter name='band_start', type='drop_down_basic_filter'>,
 'band_end': <biomart.Filter name='band_end', type='drop_down_basic_filter'>,
 'marker_start': <biomart.Filter name='marker_start', type='drop_down_basic_filter'>,
 'marker_end': <biomart.Filter name='marker_end', type=''>,
 'hsapiens_encode.type': <biomart.Fi

In [86]:
ensmbl_entrez_gene_ids = dataset.query(attributes=['ensembl_transcript_id', 'entrezgene_id'])
ensmbl_entrez_gene_ids.tail(10)

Unnamed: 0,Transcript stable ID,NCBI gene ID
251176,ENST00000644207,
251177,ENST00000647544,
251178,ENST00000642596,
251179,ENST00000643537,
251180,ENST00000644633,
251181,ENST00000642800,
251182,ENST00000645112,56169.0
251183,ENST00000642712,56169.0
251184,ENST00000646090,56169.0
251185,ENST00000643960,56169.0


In [88]:
ensmbl_entrez_gene_ids.dropna(inplace=True)
ensmbl_entrez_gene_ids['NCBI gene ID'] = ensmbl_entrez_gene_ids['NCBI gene ID'].astype(int)
ensmbl_entrez_gene_ids = ensmbl_entrez_gene_ids.set_index('NCBI gene ID').to_dict()['Transcript stable ID']
ensmbl_entrez_gene_ids

{4535: 'ENST00000361390',
 4536: 'ENST00000361453',
 4512: 'ENST00000361624',
 113219467: 'ENST00000387416',
 4513: 'ENST00000361739',
 4509: 'ENST00000361851',
 4508: 'ENST00000361899',
 4514: 'ENST00000362079',
 4537: 'ENST00000361227',
 4539: 'ENST00000361335',
 4538: 'ENST00000361381',
 4540: 'ENST00000361567',
 4541: 'ENST00000361681',
 4519: 'ENST00000361789',
 1028: 'ENST00000471157',
 51621: 'ENST00000616962',
 255027: 'ENST00000567442',
 9665: 'ENST00000549219',
 4849: 'ENST00000611667',
 102723475: 'ENST00000622690',
 9093: 'ENST00000431375',
 79165: 'ENST00000619669',
 57348: 'ENST00000423529',
 1113: 'ENST00000556876',
 5165: 'ENST00000493226',
 89927: 'ENST00000565913',
 94104: 'ENST00000464256',
 1007: 'ENST00000511822',
 147798: 'ENST00000620520',
 23532: 'ENST00000398741',
 338755: 'ENST00000338569',
 144125: 'ENST00000307401',
 54664: 'ENST00000462754',
 51277: 'ENST00000534855',
 440243: 'ENST00000612056',
 161725: 'ENST00000560598',
 2558: 'ENST00000400081',
 285148:

In [91]:
ensmbl_ids = pd.Series(lihc.columns.values[1:]).apply(lambda x: x.split('|')[1]).astype(int).map(ensmbl_entrez_gene_ids).dropna()

# Had to drop ~2k that didn't align, is there a better way?

In [80]:
dataset.filters

{'link_so_mini_closure': <biomart.Filter name='link_so_mini_closure', type='list'>,
 'link_go_closure': <biomart.Filter name='link_go_closure', type='text'>,
 'link_ensembl_transcript_stable_id': <biomart.Filter name='link_ensembl_transcript_stable_id', type='text'>,
 'gene_id': <biomart.Filter name='gene_id', type='text'>,
 'transcript_id': <biomart.Filter name='transcript_id', type='text'>,
 'link_ensembl_gene_id': <biomart.Filter name='link_ensembl_gene_id', type='text'>,
 'chromosome_name': <biomart.Filter name='chromosome_name', type='text'>,
 'start': <biomart.Filter name='start', type='text'>,
 'end': <biomart.Filter name='end', type='text'>,
 'band_start': <biomart.Filter name='band_start', type='drop_down_basic_filter'>,
 'band_end': <biomart.Filter name='band_end', type='drop_down_basic_filter'>,
 'marker_start': <biomart.Filter name='marker_start', type='drop_down_basic_filter'>,
 'marker_end': <biomart.Filter name='marker_end', type=''>,
 'hsapiens_encode.type': <biomart.Fi

In [96]:
# Find a faster way to do this

attributes = [
#     'gene_id',
    'entrezgene_id',
    'ensembl_gene_id',
    'ensembl_transcript_id',
    'go_id',
    'name_1006',
    'definition_1006',
    'go_linkage_type',
    'hgnc_id',
    'hgnc_symbol',
#     'hgnc_trans_name',
]
go_mappings = dataset.query(attributes=attributes)

In [97]:
go_mappings.to_csv('data/go_mappings.csv', index=False)

In [92]:
# Figure out better way to do this
attributes = [
#     'gene_id',
    'entrezgene_id',
    'ensembl_gene_id',
    'ensembl_transcript_id',
    'go_id',
    'name_1006',
    'definition_1006',
    'go_linkage_type',
    'hgnc_id',
    'hgnc_symbol',
#     'hgnc_trans_name',
]
dataset.query(attributes=attributes,
              filters={'transcript_id': ensmbl_ids.values}
             )

Unnamed: 0,NCBI gene ID,Gene stable ID,Transcript stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,HGNC ID,HGNC symbol


## Examine Gene Counts



In [98]:
go_mappings = pd.read_csv('data/go_mappings.csv')

go_mappings.shape

  interactivity=interactivity, compiler=compiler, result=result)


(1392800, 9)

In [99]:
go_mappings.head()

Unnamed: 0,NCBI gene ID,Gene stable ID,Transcript stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,HGNC ID,HGNC symbol
0,,ENSG00000210049,ENST00000387314,,,,,HGNC:7481,MT-TF
1,,ENSG00000211459,ENST00000389680,,,,,HGNC:7470,MT-RNR1
2,,ENSG00000210077,ENST00000387342,,,,,HGNC:7500,MT-TV
3,,ENSG00000210082,ENST00000387347,,,,,HGNC:7471,MT-RNR2
4,,ENSG00000209082,ENST00000386347,,,,,HGNC:7490,MT-TL1


In [104]:
go_mappings.loc[~go_mappings['NCBI gene ID'].isnull()]

Unnamed: 0,NCBI gene ID,Gene stable ID,Transcript stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,HGNC ID,HGNC symbol
5,4535,ENSG00000198888,ENST00000361390,GO:0016020,membrane,A lipid bilayer along with all the proteins an...,IEA,HGNC:7455,MT-ND1
6,4535,ENSG00000198888,ENST00000361390,GO:0016021,integral component of membrane,The component of a membrane consisting of the ...,IEA,HGNC:7455,MT-ND1
7,4535,ENSG00000198888,ENST00000361390,GO:0055114,oxidation-reduction process,A metabolic process that results in the remova...,IEA,HGNC:7455,MT-ND1
8,4535,ENSG00000198888,ENST00000361390,GO:0005743,mitochondrial inner membrane,"The inner, i.e. lumen-facing, lipid bilayer of...",IEA,HGNC:7455,MT-ND1
9,4535,ENSG00000198888,ENST00000361390,GO:0005739,mitochondrion,"A semiautonomous, self replicating organelle t...",IEA,HGNC:7455,MT-ND1
...,...,...,...,...,...,...,...,...,...
1392787,60678,ENSG00000284869,ENST00000646013,,,,,HGNC:24614,EEFSEC
1392796,56169,ENSG00000285114,ENST00000645112,,,,,HGNC:7151,GSDMC
1392797,56169,ENSG00000285114,ENST00000642712,,,,,HGNC:7151,GSDMC
1392798,56169,ENSG00000285114,ENST00000646090,,,,,HGNC:7151,GSDMC


In [284]:
# Create mapping of NCBI gene id to gene name/symbol

map_geneid_to_hfnc_symbol = go_mappings[['NCBI gene ID', 'HGNC symbol']] \
    .drop_duplicates() \
    .dropna()
# map_geneid_to_hfnc_symbol['NCBI gene ID'] = map_geneid_to_hfnc_symbol['NCBI gene ID'].astype(int)
map_geneid_to_hfnc_symbol = map_geneid_to_hfnc_symbol.set_index('NCBI gene ID') \
    .to_dict()['HGNC symbol']
map_geneid_to_hfnc_symbol

{'4535': 'MT-ND1',
 '4536': 'MT-ND2',
 '4512': 'MT-CO1',
 '113219467': 'MT-TS1',
 '4513': 'MT-CO2',
 '4509': 'MT-ATP8',
 '4508': 'MT-ATP6',
 '4514': 'MT-CO3',
 '4537': 'MT-ND3',
 '4539': 'MT-ND4L',
 '4538': 'MT-ND4',
 '4540': 'MT-ND5',
 '4541': 'MT-ND6',
 '4519': 'MT-CYB',
 '1028': 'CDKN1C',
 '51621': 'KLF13',
 '255027': 'MPV17L',
 '9665': 'MARF1',
 '4849': 'CNOT3',
 '102723475': 'KCNE1B',
 '9093': 'DNAJA3',
 '79165': 'LENG1',
 '57348': 'TTYH1',
 '1113': 'CHGA',
 '5165': 'PDK3',
 '89927': 'BMERB1',
 '94104': 'PAXBP1',
 '1007': 'CDH9',
 '147798': 'TMC4',
 '23532': 'PRAME',
 '338755': 'OR2AG2',
 '144125': 'OR2AG1',
 '54664': 'TMEM106B',
 '51277': 'DNAJC27',
 '440243': 'GOLGA6L22',
 '161725': 'OTUD7A',
 '2558': 'GABRA5',
 '285148': 'IAH1',
 '83787': 'ARMC10',
 '643401': 'PURPL',
 '7726': 'TRIM26',
 '2533': 'FYB1',
 '56245': 'C21orf62',
 '26531': 'OR11A1',
 '442194': 'OR10C1',
 '222236': 'NAPEPLD',
 '9962': 'SLC23A2',
 '81696': 'OR5V1',
 '26530': 'OR12D1',
 '81797': 'OR12D3',
 '2562': 'GAB

In [310]:
# Map NCBI gene id to gene name/symbol
lihc_processed = lihc.copy(deep=True)

lihc_processed.columns = list(map(lambda x: x.split('|')[1], lihc_processed.columns))

In [318]:
gene_ids = list(map(lambda x: str(int(x)) if type(x) is float else x, list(map_geneid_to_hfnc_symbol.keys())))
gene_ids = [gene for gene in gene_ids if gene in lihc_processed.columns] # get intersection of gene ids
lihc_processed = lihc_processed[gene_ids]
lihc_processed

Unnamed: 0_level_0,1028,51621,255027,9665,4849,9093,79165,57348,1113,5165,...,22826,93974,83667,54797,389072,1421,1420,219990,79669,257144
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-2V-A95S-01A-11R-A37K-07,72.3245,768.8549,13.0314,1048.3792,1090.0798,1720.8014,289.9495,33.8817,0.0000,104.9031,...,1751.4253,3309.3338,1008.6333,533.6374,187.6527,0.0,0.0,0.0,7.8189,29.9723
TCGA-2Y-A9GS-01A-12R-A38B-07,41.3997,4116.8063,690.4879,1209.9556,736.8162,2790.0444,183.8344,25.6284,2.9571,134.5490,...,1984.2287,3247.4125,1559.8817,565.3031,324.7905,0.0,0.0,0.0,7.8857,8.8714
TCGA-2Y-A9GT-01A-11R-A38B-07,48.4196,3557.0500,642.0085,1310.0202,954.0462,1407.7561,115.6691,0.0000,0.0000,112.0825,...,1500.1121,3359.7848,720.9146,494.0596,633.0419,0.0,0.0,0.0,5.3800,17.0365
TCGA-2Y-A9GU-01A-11R-A38B-07,54.2191,2182.9435,862.5004,1270.3947,2207.9677,2332.2545,241.0664,10.0097,0.0000,40.0387,...,1516.4659,3189.7501,786.5937,320.3096,69.2336,0.0,0.0,0.0,2.5024,3.3366
TCGA-2Y-A9GV-01A-11R-A38B-07,45.6853,2036.3790,296.9543,1212.3519,833.3333,1542.3012,120.1354,1.6920,38.0711,65.9898,...,1544.8393,2682.7411,963.6210,438.2403,282.5719,0.0,0.0,0.0,0.8460,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-ZS-A9CD-01A-11R-A37K-07,40.5710,1616.8295,468.0691,942.9001,670.9241,2818.1818,222.3892,21.7881,12.0210,85.6499,...,1625.0939,3519.9098,1054.0947,293.0128,178.0616,0.0,0.0,0.0,0.7513,0.0000
TCGA-ZS-A9CE-01A-11R-A37K-07,20.7877,4619.2560,695.8425,1242.8884,699.1247,1862.1444,237.9650,3.2823,0.0000,101.7505,...,1514.2232,4958.4245,471.0066,439.2779,235.2298,0.0,0.0,0.0,0.5470,1.0941
TCGA-ZS-A9CF-01A-11R-A38B-07,63.4045,793.9352,504.4797,1274.2936,671.2612,2253.6182,132.3225,7.5810,5.5134,94.4176,...,2202.6189,5642.3156,1602.3432,809.0972,474.1558,0.0,0.0,0.0,8.9593,6.2026
TCGA-ZS-A9CF-02A-11R-A38B-07,33.3745,2132.2621,632.8801,1221.8789,597.6514,1804.0791,192.8307,0.6180,0.0000,84.6724,...,1734.2398,4358.4672,1465.3894,851.0507,768.8504,0.0,0.0,0.0,1.2361,6.7985


In [320]:
lihc_processed.columns = pd.Series(lihc_processed.T.index).map(map_geneid_to_hfnc_symbol).values
lihc_processed.head()

Unnamed: 0_level_0,CDKN1C,KLF13,MPV17L,MARF1,CNOT3,DNAJA3,LENG1,TTYH1,CHGA,PDK3,...,DNAJC8,ATP5IF1,SESN2,MED18,PLEKHM3,CRYGD,CRYGC,OOSP2,C3orf52,GCSAM
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-2V-A95S-01A-11R-A37K-07,72.3245,768.8549,13.0314,1048.3792,1090.0798,1720.8014,289.9495,33.8817,0.0,104.9031,...,1751.4253,3309.3338,1008.6333,533.6374,187.6527,0.0,0.0,0.0,7.8189,29.9723
TCGA-2Y-A9GS-01A-12R-A38B-07,41.3997,4116.8063,690.4879,1209.9556,736.8162,2790.0444,183.8344,25.6284,2.9571,134.549,...,1984.2287,3247.4125,1559.8817,565.3031,324.7905,0.0,0.0,0.0,7.8857,8.8714
TCGA-2Y-A9GT-01A-11R-A38B-07,48.4196,3557.05,642.0085,1310.0202,954.0462,1407.7561,115.6691,0.0,0.0,112.0825,...,1500.1121,3359.7848,720.9146,494.0596,633.0419,0.0,0.0,0.0,5.38,17.0365
TCGA-2Y-A9GU-01A-11R-A38B-07,54.2191,2182.9435,862.5004,1270.3947,2207.9677,2332.2545,241.0664,10.0097,0.0,40.0387,...,1516.4659,3189.7501,786.5937,320.3096,69.2336,0.0,0.0,0.0,2.5024,3.3366
TCGA-2Y-A9GV-01A-11R-A38B-07,45.6853,2036.379,296.9543,1212.3519,833.3333,1542.3012,120.1354,1.692,38.0711,65.9898,...,1544.8393,2682.7411,963.621,438.2403,282.5719,0.0,0.0,0.0,0.846,0.0


## Pre-filtering RNA-seq data

Our count matrix contains many rows with only zeros, and additionally many rows with only a few fragments total. In order to reduce the size of the object, and to increase the speed of our functions, we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove columns of the count matrix that have no counts, or only a single count across all samples:

In [321]:
lihc_processed = lihc_processed.T[lihc_processed.T.sum(axis=1) > len(lihc_processed)].T
lihc_processed.shape

(423, 17180)

## Survey of clinical characteristics

In [None]:
## TODO: Conduct survey of clinical characteristics

In [None]:
## Label and Split data based on recurrence? drug?

# Looking for differentially expressed gene between clinical conditions

## R-log transformation

https://bioconductor.org/help/course-materials/2017/CSAMA/labs/2-tuesday/lab-03-rnaseq/rnaseqGene_CSAMA2017.html

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic. For RNA-seq raw counts, however, the variance grows with the mean. For example, if one performs PCA directly on a matrix of size-factor-normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with the very lowest counts will tend to dominate the results because, due to the strong Poisson noise inherent to small count values, and the fact that the logarithm amplifies differences for the smallest values, these low count genes will show the strongest relative differences between samples.

As a solution, DESeq2 offers transformations for count data that stabilize the variance across the mean. One such transformation is the regularized-logarithm transformation or rlog 2. For genes with high counts, the rlog transformation will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. Using an empirical Bayesian prior on inter-sample differences in the form of a ridge penalty, the rlog-transformed data then becomes approximately homoskedastic, and can be used directly for computing distances between samples and making PCA plots. Another transformation, the variance stabilizing transformation 17, is discussed alongside the rlog in the DESeq2 vignette.

## PCA Plots
Now we have everything setup, the first thing to do is to generate PCA plots to observe whether the samples cluster as expected: controls with controls, and treatments with treatments.

Another way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (here, the samples) are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (figure below). The x-axis is the direction that separates the data points the most. The values of the samples in this direction are written PC1. The y-axis is a direction (it must be orthogonal to the first direction) that separates the data the second most. The values of the samples in this direction are written PC2. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not add to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).

In [325]:
# Use DESEQ2 R library to rlog transform

from rpy2.robjects.packages import importr
deseq = importr('DESeq2')

In [None]:
# Create DESeq dataset
dds = deseq.DESeqDataSetFromMatrix(countData=count_matrix,
                                        colData=self.design_matrix,
                                        design=self.design_formula)

deseq.rlog(dds,blind=False)

In [None]:
rld <- 
                        rlog
                        (dds, 
                        blind=
                        FALSE
                        )

                        head
                        (
                        assay
                        (rld), 
                        3
                        )

## Differential Gene Expression
Now we are ready to identify the differentially expressed genes between the two sets of samples: control vs. treatment. We will achieve this using the Characteristic Direction method[6](#ref6) that we developed and published in BMC Bioinformatics in 2014.

An implementation in Python of the Characteristic Direction method can be downloaded and installed from here: https://github.com/wangz10/geode.
    


In [None]:
import geode
d_platform_cd = {} # to top up/down genes
cd_results = pd.DataFrame(index=expr_df.index)

sample_classes = {}
for layout in meta_df['LibraryLayout_s'].unique():
    ## make sample_class 
    sample_class = np.zeros(expr_df.shape[1], dtype=np.int32)
    sample_class[meta_df['LibraryLayout_s'].values == layout] = 1
    sample_class[(meta_df['LibraryLayout_s'].values == layout) & 
                 (meta_df['infection_status_s'].values == 'Zika infected')] = 2
    platform = d_layout_platform[layout]
    sample_classes[platform] = sample_class

sample_classes['combined'] = sample_classes['MiSeq'] + sample_classes['NextSeq 500']
print sample_classes

for platform, sample_class in sample_classes.items():
    cd_res = geode.chdir(expr_df.values, sample_class, expr_df.index, 
                      gamma=.5, sort=False, calculate_sig=False)
    cd_coefs = np.array(map(lambda x: x[0], cd_res))
    cd_results[platform] = cd_coefs
    
    # sort CD in by absolute values in descending order
    srt_idx = np.abs(cd_coefs).argsort()[::-1]
    cd_coefs = cd_coefs[srt_idx][:600]
    sorted_DEGs = expr_df.index[srt_idx][:600]
    # split up and down
    up_genes = dict(zip(sorted_DEGs[cd_coefs > 0], cd_coefs[cd_coefs > 0]))
    dn_genes = dict(zip(sorted_DEGs[cd_coefs < 0], cd_coefs[cd_coefs < 0]))
    d_platform_cd[platform+'-up'] = up_genes
    d_platform_cd[platform+'-dn'] = dn_genes

print cd_results.head()

In [None]:
## Check the cosine distance between the two signatures
from scipy.spatial.distance import cosine
from itertools import combinations
for col1, col2 in combinations(cd_results.columns, 2):
    print col1, col2, cosine(cd_results[col1], cd_results[col2])

## Prepare count matrices
expect input data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads (or fragments, for paired-end RNA-seq) have been assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP-Seq), species of bacteria (with metagenomic datasets), or peptide sequences (with quantitative mass spectrometry).

The values in the matrix should be counts of sequencing reads/fragments. This is important for DESeq2’s statistical model to hold, as only counts allow assessing the measurement precision correctly. It is important to never provide counts that were pre-normalized for sequencing depth/library size, as the statistical model is most powerful when applied to un-normalized counts, and is designed to account for library size differences internally.

## Align Reads to reference genome
The computational analysis of an RNA-seq experiment begins from the FASTQ files that contain the nucleotide sequence of each read and a quality score at each position. These reads must first be aligned to a reference genome or transcriptome, or the abundances and estimated counts per transcript can be estimated without alignment, as described above. In either case, it is important to know if the sequencing experiment was single-end or paired-end, as the alignment software will require the user to specify both FASTQ files for a paired-end experiment. The output of this alignment step is commonly stored in a file format called SAM/BAM.


## Define gene models

## Plot counts

## PCA Plot 

## Differential Expression Analysis
## Gene Clustering





In [None]:
from diffexp.py_deseq import py_DESeq2

dds = py_DESeq2(count_matrix = df,
               design_matrix = sample_df,
               design_formula = '~ sample',
               gene_column = 'id') # <- telling DESeq2 this should be the gene ID column
    
dds.run_deseq() 
dds.get_deseq_result()
res = dds.deseq_result 
res.head()

In [None]:
import matplotlib.pyplot as plt
plt.scatter(res.log2FoldChange, -np.log2(res.padj))

In [2]:
from dgeclust import CountData, SimulationManager

ModuleNotFoundError: No module named 'dgeclust'