In [3]:
import seaborn as sns
from ggplot import *
from matplotlib import pyplot as plt
import bokeh

import pandas as pd
import dask.dataframe as dd
import numpy as np
import scipy as sc
import statsmodels as sm
import networkx as nx

import sklearn as sk
import tensorflow as tf
import keras
import xgboost as xgb
import lightgbm as lgbm
import tpot

import sys
import os
import gc

# Data sources

## DNA, Mutation
Literally, per genome and chromosome the change in the pair compared to a normal reference. Remember we have (Adenine,Thymine) and (Guanine,Cytosine) as the base pairs.




The types of mutations include (taken from here):

**Missense mutation**:. This type of mutation is a change in one DNA base pair that results in the substitution of one amino acid for another in the protein made by a gene.

**Nonsense mutation**: is also a change in one DNA base pair. Instead of substituting one amino acid for another, however, the altered DNA sequence prematurely signals the cell to stop building a protein. This type of mutation results in a shortened protein that may function improperly or not at all.

**Insertion**: An insertion changes the number of DNA bases in a gene by adding a piece of DNA. As a result, the protein made by the gene may not function properly.

**Deletion**: A deletion changes the number of DNA bases by removing a piece of DNA. Small deletions may remove one or a few base pairs within a gene, while larger deletions can remove an entire gene or several neighboring genes. The deleted DNA may alter the function of the resulting protein(s).

**Duplication**: A duplication consists of a piece of DNA that is abnormally copied one or more times. This type of mutation may alter the function of the resulting protein.

**Frameshift mutation**: This type of mutation occurs when the addition or loss of DNA bases changes a gene's reading frame. A reading frame consists of groups of 3 bases that each code for one amino acid. A frameshift mutation shifts the grouping of these bases and changes the code for amino acids. The resulting protein is usually nonfunctional. Insertions, deletions, and duplications can all be frameshift mutations.

**Repeat expansion**: Nucleotide repeats are short DNA sequences that are repeated a number of times in a row. For example, a trinucleotide repeat is made up of 3-base-pair sequences, and a tetranucleotide repeat is made up of 4-base-pair sequences. A repeat expansion is a mutation that increases the number of times that the short DNA sequence is repeated. This type of mutation can cause the resulting protein to function improperly.

### DATA FIELDS, shape (422553, 11)
```
ID      |  Location        | Change     |  Gene   | Mutation type|  Var.Allele.Frequency  | Amino acid

SampleID,| Chr, Start, Stop|  Ref, Alt  | Gene    |    Effect    |  DNA_VAF, RNA_VAF      | Amino_Acid_Change

string   |string, int, int | char, char | string  |    string    |  float, float          |  string
```

**NOTE**: this gives us direct insight in how genetic mutations lead to changes in amino-acids.

## Copy Number Variations
A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next.

### DATA FIELDS, shape (24802, 372)
```
Gene      | Chr, Start, Stop | Strand     |   SampleID 1..SampleID N

string    |string, int, int  | int        |  int..int
```

## Methylation, gene expression regulation
Degree of methylation indicates addition of Methyl groups to the DNA. Increased methylation is associated with less transcription of the DNA: Methylated means the gene is switched OFF, Unmethylated means the gene is switched ON.

Alterations of DNA methylation have been recognized as an important component of cancer development.

### DATA FIELDS, shape (485577, 483)
```
probeID   | Chr, Start, Stop | Strand  | Gene   |  Relation_CpG_island | SampleID 1..SampleID N

string    |string, int, int  | int     | string |   string             | float..float
```

## RNA, gene expression
Again four building blocks; Adenosine (A), Uracil (U), Guanine (G), Cytosine (C).

(DNA) --> (RNA)

A --> U

T --> A

C --> G

G --> C

Gene expression profiles, continuous values resulting from the normalisation of counts.

### DATA FIELDS, shape (60531, 477)
```
Gene      | Chr, Start, Stop | Strand  | SampleID 1..SampleID N

string    |string, int, int  | int     |  float..float
```

## miRNA, transcriptomics
The connection between the RNA production and protein creation. I.e. perhaps miRNA expression values can be associated with specific proteins.

### DATA FIELDS, shape (2220, 458)
```
MIMATID  | Name   | Chr, Start, Stop | Strand  | SampleID 1..SampleID N

string   | string |string, int, int  | int     |  float..float
```

## Proteomes
Proteine expression profiles, ditto, continuous values resulting from the normalisation of counts

### DATA FIELDS, shape (282, 355)
```
ProteinID  | SampleID 1..SampleID N

string     | float..float
```

**QUIZ**, identify our data sets in the following image!

![Quiz](https://media.nature.com/m685/nature-assets/nrg/journal/v16/n2/images/nrg3868-f1.jpg)

# GOAL
**Some degree of multi-omic or trans-omic analysis and identification of pathways.**


![Quiz](https://www.cell.com/cms/attachment/2119084140/2088971044/gr1_lrg.jpg)



## Our reality
![Quiz](https://media.springernature.com/m685/nature-assets/nrg/journal/v16/n2/images/nrg3868-f2.jpg)

In [19]:
# Melanoma_CopyNumberVariations = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_CNV.txt", sep="\t")
# Melanoma_Mutation = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_Mutation.txt", sep="\t")
# Melanoma_Methylation = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_Methylation.txt", sep="\t")

mge_df = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_GeneExpression.txt", sep="\t")
# Melanoma_miRNA = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_miRNA.txt", sep="\t")
# Melanoma_Proteome = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_Proteome.txt", sep="\t")



  interactivity=interactivity, compiler=compiler, result=result)


In [9]:
mmeta_df = pd.read_table("https://storage.googleapis.com/genx_2018/Melanoma_Phenotype_Metadata.txt", sep="\t")
mmeta_df = mmeta_df.set_index("SampleID")

In [21]:
gene_id_c = ["Gene", "Chr", "Start", "Stop", "Strand"]
g_id = mge_df[["Gene", "Chr", "Start", "Stop", "Strand"]]\
                    .apply(lambda x: "_".join(map(str,x.values)), axis=1)

mge_df = mge_df.set_index(g_id)
mge_df = mge_df.drop(gene_id_c, axis=1)
mge_df_T = mge_df.transpose()


In [64]:
mge_df_T.index.name = "SampleID"
classification_target = "Response To Therapy"

target_conditions=True
k=[df[classification_target].notnull(),
   df['Drug Therapy Type']=='Immunotherapy']
for i in k:
    target_conditions *=i
    
df = mge_df_T.join(mmeta_df, how="left")
df = df[target_conditions]
expression_df = df.iloc[:,0:60531]
target_map = {
  "Complete Response":0,
  "Clinical Progressive Disease":1,        
  "Radiographic Progressive Disease":1,    
  "Stable Disease":1,                      
  "Partial Response":0                     
}
target = df.loc[target_conditions][classification_target].map(target_map)
#target = df.loc[df['Drug Therapy Type']=='Chemotherapy'][classification_target].map(target_map)
#target = df.loc[df['Drug Therapy Type']=='Vaccine'][classification_target].map(target_map)


  .format(op=op_str, alt_op=unsupported[op_str]))


In [0]:
mge_df.index.name = "SampleID"
classification_target = "Sample Type"
df = mge_df_T.join(mmeta_df, how="left")
df = df[df[classification_target].notnull()]
target_map = {
  "Metastatic":0,
  "Primary Tumor":1,        
  "Solid Tissue Normal":2,    
  "Additional Metastatic":3,                      
}
df = df[df[classification_target].isin(["Metastatic", "Primary Tumor"])] 
target = df[classification_target].map(target_map)

expression_df = df.iloc[:,0:60531]


## Classification 

In [49]:
import xgboost as xgb
from sklearn import metrics, model_selection

x = expression_df.values
y = target.values

splits = model_selection.StratifiedKFold(n_splits=10)

model = xgb.XGBClassifier(scale_pos_weight=2.5)

def benchmark_classifier(clf,x,y,splitter):
    splitter.random_state = 111
    pred = np.zeros(shape=y.shape)

    for train_index, test_index in splitter.split(x, y):
        x_train, x_test = x[train_index], x[test_index]
        y_train, y_test = y[train_index], y[test_index] 

        clf.fit(x_train,y_train)
        pred[test_index] = clf.predict(x_test)
        
        print(metrics.accuracy_score(y_test,pred[test_index]))
        print(metrics.confusion_matrix(y_test,pred[test_index]))

    return pred

predictions = benchmark_classifier(model,x,y,splits)

print(metrics.accuracy_score(y,predictions))
print(metrics.confusion_matrix(y,predictions))

  if diff:


0.75
[[1 1]
 [0 2]]


  if diff:


0.5
[[1 1]
 [1 1]]


  if diff:


0.25
[[0 2]
 [1 1]]


  if diff:


0.25
[[0 2]
 [1 1]]


KeyboardInterrupt: 