# Exploratory Data Analysis of Cancer Genomics data using TCGA

In this notebook, we will take a look at one of the canonical datasets, if not _the_ dataset, in cancer genomics: TCGA.

We'll start with investigating the RNA Sequencing (rnaseq) and Clinical data available for the cancer type LIHC. 

The data is stored in the R package _[RTCGA](http://rtcga.github.io/RTCGA/)_

## Load libraries

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Set variables

In [44]:
data_dir="data/"
response_name="patient.race"
rnaseq_file=data_dir+"lihc_rnaseq.csv.gz"
clinical_file=data_dir+"lihc_clinical.csv.gz"

## Load data

The data is stored in the RTCGA package in the R programming language. I've outputted it for easy use within python. 

We will be investigating the Hepatocellular carcinoma dataset. Read about it [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680778/).

The TCGA RNASeq data is illumina hiseq Level 3 RSEM normalized expression data. You can read about thec RSEM method [here](https://academic.oup.com/bioinformatics/article/26/4/493/243395).

Essentially this is the raw counts of reads that aligned to the gene transcript, though it's only a guess by the program. Since it's a guess, the values are rational numbers. To simplify things, we'll round the values to the next whole integer. 



In [45]:
rnaseq = (pd.
          read_csv(rnaseq_file,compression="gzip").
          set_index('bcr_patient_barcode').
          applymap(lambda x : int(np.ceil(x)))
         )
display(rnaseq.shape)
display(rnaseq.head())

(423, 20531)

Unnamed: 0_level_0,?|100130426,?|100133144,?|100134869,?|10357,?|10431,?|136542,?|155060,?|26823,?|280660,?|317712,...,ZXDA|7789,ZXDB|158586,ZXDC|79364,ZYG11A|440590,ZYG11B|79699,ZYX|7791,ZZEF1|23140,ZZZ3|26009,psiTPTE22|387590,tAKR|389932
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-2V-A95S-01A-11R-A37K-07,0,2,4,91,1018,0,142,1,0,0,...,25,274,795,19,500,3173,891,511,4,7
TCGA-2Y-A9GS-01A-12R-A38B-07,0,27,3,72,640,0,123,2,0,0,...,69,633,1154,72,1001,5302,756,861,7,483
TCGA-2Y-A9GT-01A-11R-A38B-07,0,0,5,96,743,0,96,2,1,0,...,47,1220,1134,13,1290,3220,861,524,15,84
TCGA-2Y-A9GU-01A-11R-A38B-07,0,6,6,62,1187,0,281,1,0,0,...,19,286,1151,10,942,3093,1340,344,3,3
TCGA-2Y-A9GV-01A-11R-A38B-07,0,12,6,105,879,0,283,0,0,0,...,42,1000,1632,5,1381,2903,576,666,3,120


In [46]:
gene_name_logical = [len(x[0])>1 for x in rnaseq.columns.str.split('|')]
sub = rnaseq.loc[:,gene_name_logical]
sub.columns = [x[0] for x in sub.columns.str.split('|')]
rnaseq_sub = sub.copy()
rnaseq_sub.head()

Unnamed: 0_level_0,A1BG,A1CF,A2BP1,A2LD1,A2ML1,A2M,A4GALT,A4GNT,AAA1,AAAS,...,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,psiTPTE22,tAKR
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-2V-A95S-01A-11R-A37K-07,22283,584,0,375,0,286320,81,543,1,1032,...,25,274,795,19,500,3173,891,511,4,7
TCGA-2Y-A9GS-01A-12R-A38B-07,22642,1573,3,99,0,31169,163,2,1,903,...,69,633,1154,72,1001,5302,756,861,7,483
TCGA-2Y-A9GT-01A-11R-A38B-07,77670,1281,0,215,1,19515,119,1,3,773,...,47,1220,1134,13,1290,3220,861,524,15,84
TCGA-2Y-A9GU-01A-11R-A38B-07,9323,1253,0,2914,2,243941,72,0,0,722,...,19,286,1151,10,942,3093,1340,344,3,3
TCGA-2Y-A9GV-01A-11R-A38B-07,84243,1641,0,404,0,8756,83,0,9,828,...,42,1000,1632,5,1381,2903,576,666,3,120


The clinical data is within the RTCGA package, but is also available [here](https://portal.gdc.cancer.gov/projects/TCGA-LIHC). More cdescription of the clinical attributes are [here](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization).

In [47]:
clinical = pd.read_csv(clinical_file,compression="gzip").set_index('patient.bcr_patient_barcode')
display(clinical.shape)
display(clinical.head())

(377, 1586)

Unnamed: 0_level_0,admin.bcr,admin.day_of_dcc_upload,admin.disease_code,admin.file_uuid,admin.month_of_dcc_upload,admin.patient_withdrawal.withdrawn,admin.project_code,admin.year_of_dcc_upload,patient.ablation_embolization_tx_adjuvant,patient.additional_studies,...,patient.samples.sample.preservation_method,patient.samples.sample.sample_type,patient.samples.sample.sample_type_id,patient.samples.sample.shortest_dimension,patient.samples.sample.time_between_clamping_and_freezing,patient.samples.sample.time_between_excision_and_freezing,patient.samples.sample.tissue_type,patient.samples.sample.tumor_descriptor,patient.samples.sample.tumor_pathology,patient.samples.sample.vial_number
patient.bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tcga-2v-a95s,nationwide children's hospital,1,lihc,47a9da83-f87b-45a0-a501-542c5a9df212,10,False,tcga,2015,no,,...,,primary tumor,1,,,,,,,a
tcga-2y-a9gs,nationwide children's hospital,1,lihc,ada84bc2-9724-428c-9557-7280af5f8297,10,False,tcga,2015,no,,...,,primary tumor,1,,,,,,,a
tcga-2y-a9gt,nationwide children's hospital,1,lihc,b9f63d47-2293-4360-b373-5acdaa2ec7cb,10,False,tcga,2015,,,...,,primary tumor,1,,,,,,,a
tcga-2y-a9gu,nationwide children's hospital,1,lihc,2b488ba2-f46b-4d0b-8c73-d9a761ab61b1,10,False,tcga,2015,no,,...,,primary tumor,1,,,,,,,a
tcga-2y-a9gv,nationwide children's hospital,1,lihc,69fc70f4-fe6a-472a-b354-cf37515c524d,10,False,tcga,2015,no,,...,,primary tumor,1,,,,,,,a


## Gene level distribution

In this section, we will investigate the value distribution of genes in our dataset.

<br>

Sample questions:

What is the range of values for a given gene?

What is the distribution of values for a given gene?

Are there higher than average or lower than average expression of genes?

## Dimension reduction based on gene expression

In this section, we will investigate a lower dimensional gene expression space

## Differential Expression Analysis

In this section, we will investigate differential expression results derived from the [DESeq2] package in R. Also see this [vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) on how to do these analyses, as well as understand the methods.

## Clinical data type investigation

In this section, we will investigate the diversity of the clinical data

<br>

Sample Questions:

How many unique values are there for a given clinical attribute?

How can we define an appropriate response variable for supervised learning?

## Set up for supervised learning 

In this section, we will set up a supervised learning paradigm using the Genes within the RNASeq data as predictors and a clinical attribute as a response variable. 