In [30]:
import seaborn as sns
from ggplot import *
from matplotlib import pyplot as plt
import bokeh

import pandas as pd
import dask.dataframe as dd
import numpy as np
import scipy as sc
import statsmodels as sm

import sklearn as sk
import tensorflow as tf
import keras
import xgboost as xgb
import lightgbm as lgbm
import tpot

import sys
import os
import gc
import re

# data sources

## DNA, [Mutation](https://ghr.nlm.nih.gov/primer/mutationsanddisorders/possiblemutations)

Literally, per genome and chromosome the change in the pair compared 
to a normal reference. Remember we have (Adenine,Thymine) and (Guanine,Cytosine) as the base pairs.

The types of mutations include (taken [from here]((https://ghr.nlm.nih.gov/primer/mutationsanddisorders/possiblemutations)):

Missense mutation. This type of mutation is a change in one DNA base pair that results in the substitution of one amino acid for another in the protein made by a gene. 

Nonsense mutation: is also a change in one DNA base pair. Instead of substituting one amino acid for another, however, the altered DNA sequence prematurely signals the cell to stop building a protein. This type of mutation results in a shortened protein that may function improperly or not at all.

Insertion: 
An insertion changes the number of DNA bases in a gene by adding a piece of DNA. As a result, the protein made by the gene may not function properly.

Deletion:
A deletion changes the number of DNA bases by removing a piece of DNA. Small deletions may remove one or a few base pairs within a gene, while larger deletions can remove an entire gene or several neighboring genes. The deleted DNA may alter the function of the resulting protein(s).

Duplication:
A duplication consists of a piece of DNA that is abnormally copied one or more times. This type of mutation may alter the function of the resulting protein.

Frameshift mutation:
This type of mutation occurs when the addition or loss of DNA bases changes a gene's reading frame. A reading frame consists of groups of 3 bases that each code for one amino acid. A frameshift mutation shifts the grouping of these bases and changes the code for amino acids. The resulting protein is usually nonfunctional. Insertions, deletions, and duplications can all be frameshift mutations.

Repeat expansion:
Nucleotide repeats are short DNA sequences that are repeated a number of times in a row. For example, a trinucleotide repeat is made up of 3-base-pair sequences, and a tetranucleotide repeat is made up of 4-base-pair sequences. A repeat expansion is a mutation that increases the number of times that the short DNA sequence is repeated. This type of mutation can cause the resulting protein to function improperly.

### DATA FIELDS, shape (422553, 11)
``` ID      |  Location        | Change     |  Gene   | Mutation type|  Var.Allele.Frequency  | Amino acid```

```SampleID,| Chr, Start, Stop|  Ref, Alt  | Gene    |    Effect    |  DNA_VAF, RNA_VAF      | Amino_Acid_Change```

```string   |string, int, int | char, char | string  |    string    |  float, float          |  string```

NOTE: this gives us direct insight in how genetic mutations lead to changes in amino-acids.

## Copy Number Variations

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next.

### DATA FIELDS, shape (24802, 372)
``` Gene      | Chr, Start, Stop | Strand     |   SampleID 1..SampleID N```

``` string    |string, int, int  | int        |  int..int```


## Methylation, gene expression regulation

Degree of [methylation](https://en.wikipedia.org/wiki/DNA_methylation)
indicates addition of Methyl groups to the DNA. Increased methylation is associated with less transcription of the DNA:
Methylated means the gene is switched OFF, Unmethylated means the gene is switched ON.

Alterations of DNA methylation have been recognized as an important component of cancer development.


### DATA FIELDS, shape (485577, 483) 
``` probeID   | Chr, Start, Stop | Strand  | Gene   |  Relation_CpG_island | SampleID 1..SampleID N```

``` string    |string, int, int  | int     | string |   string             | float..float```


## RNA, gene expression

Again four building blocks; Adenosine (A), Uracil (U), Guanine (G), Cytosine (C).

(DNA) --> (RNA)

A --> U 

T --> A

C --> G

G --> C

Gene expression profiles, continuous values resulting from the normalisation of counts.

### DATA FIELDS, shape (60531, 477)
``` Gene      | Chr, Start, Stop | Strand  | SampleID 1..SampleID N```

``` string    |string, int, int  | int     |  float..float```


## miRNA, transcriptomics

The connection between the RNA production and protein creation. I.e. perhaps miRNA expression values can be associated with specific proteins.

### DATA FIELDS, shape (2220, 458)
``` MIMATID  | Name   | Chr, Start, Stop | Strand  | SampleID 1..SampleID N```

``` string   | string |string, int, int  | int     |  float..float```


## Proteomes

Proteine expression profiles, ditto, continuous values resulting from the normalisation of counts


### DATA FIELDS, shape (282, 355)
``` ProteinID  | SampleID 1..SampleID N```

``` string     | float..float```

### QUIZ, identify our data sets in the following image!


![image.png](_hackathon2018/_images/overview.png)


## GOAL

Some degree of multi-omic analysis and identification of pathways.

![image.png](_hackathon2018/_images/multi_omic.png)


# load in data...

In [3]:
data_clinical = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Phenotype_Metadata.txt',
                           sep="\t")
data_gene_expression = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_GeneExpression.txt',
                           sep="\t", dtype={'Start': 'float64', 'Stop': 'float64'})
data_copy_number = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_CNV.txt',
                           sep="\t",  dtype={'Start': 'float64', 'Stop': 'float64'})
data_miRNA = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_miRNA.txt',
                           sep="\t")
data_Mutation = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Mutation.txt',
                           sep="\t")
data_Methylation = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Methylation.txt',
                           sep="\t", dtype={'Start': 'float64', 'Stop': 'float64'})
data_Proteome = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Proteome.txt',
                           sep="\t")

In [191]:
df_Methylation = data_Methylation.compute()

  result = _execute_task(task, data)
  result = _execute_task(task, data)


In [299]:
df_GeneExpression = data_gene_expression.compute()
df_proteome = data_Proteome.compute()
df_mutation = data_Mutation.compute()
df_copy_number = data_copy_number.compute()
df_miRNA = data_miRNA.compute()

  result = _execute_task(task, data)
  result = _execute_task(task, data)


In [120]:
df_clinical = data_clinical.compute()

# Feature manipulation

## Copy number variation

In [300]:
df_copy_number['Strand'] = df_copy_number['Strand'].apply(lambda x: -1 if x=='-' else 1)
df_copy_number['GeneDiff'] = df_copy_number['Stop']-df_copy_number['Start']
df_copy_number.Chr = df_copy_number.loc[(~df_copy_number.Chr.isna()) & (df_copy_number.Chr.str.contains('chr'))].Chr\
                                    .apply(lambda x: re.sub(r'chr', '', x))

In [302]:
df_copy_number_transposed = df_copy_number[df_copy_number.columns[~df_copy_number.\
                                          columns.\
                                              isin(['Chr', 'Start', 'Stop', 'Strand', 'GeneDiff'])]].T

In [303]:
df_copy_number_transposed.columns = df_copy_number_transposed.iloc[0]
df_copy_number_transposed.drop('Gene', axis=0, inplace=True)
df_copy_number_transposed.index.rename('Sample', inplace=True)
df_copy_number_transposed.reset_index(inplace=True)
df_copy_number_transposed.index=df_copy_number_transposed.Sample

In [1]:
df_copy_number_transposed

NameError: name 'df_copy_number_transposed' is not defined

## get patient-to-patient similarity matrix

..to find patient clusters

## get Gene-to-Gene similarity matrix

..to find gene clusters

## space embedding using t-SNE
Embedding in 3-dimensional space allows for visualisation. To identify clusters per target variable we need to attach the clinical data. This is useful because a priori we do not know what target variables leads to the best seperation.

To avoid computational complexity issues: 
* base t-SNE on exemplars
* apply PCA/LDA or some other dimension reducer before apply t-SNE
* use hierarchical t-SNE: https://github.com/DmitryUlyanov/Multicore-TSNE, https://github.com/danielfrg/tsne


In [307]:
temp_merged = df_cnv_reduced.merge(df_clinical, 
                                   how='inner', 
                                   left_on='Sample', 
                                   right_on='SampleID')

Unnamed: 0,Sample,A1BG,A1CF,A2M,A2ML1,A2MP1,A3GALT2,A4GALT,A4GNT,AAAS,...,Radiation Therapy,Time To Radiation Therapy (Days),Response To Therapy,Time To Therapy (Days),Therapy Ongoing,New Tumor Event,New Tumor Event Type,New Tumor Event Anatomical Location,Subsequent Primary Melanoma,Time To New Tumor Event (Days)
0,TCGA-3N-A9WB-06,0,0,0,0,0,0,0,0,0,...,NO,,,,,YES,Distant Metastasis,,,487.0
1,TCGA-3N-A9WC-06,0,0,0,0,0,1,0,0,0,...,NO,,,,,NO,,,NO,
2,TCGA-3N-A9WD-06,0,-1,0,0,0,1,1,0,-1,...,YES,244.0,,244.0,,YES,Distant Metastasis,,,306.0
3,TCGA-BF-AAP0-06,0,-1,0,0,0,1,0,0,0,...,NO,,,,,NO,,,,
4,TCGA-D3-A1Q1-06,1,0,1,1,1,0,1,0,1,...,NO,,Clinical Progressive Disease,247.0,NO,YES,Distant Metastasis,,,469.0
5,TCGA-D3-A1Q3-06,0,0,0,0,0,0,0,0,0,...,NO,,Complete Response,362.0,NO,NO,,,,
6,TCGA-D3-A1Q4-06,0,-1,0,0,0,-1,0,0,0,...,NO,,,,,NO,,,NO,
7,TCGA-D3-A1Q5-06,1,-2,0,0,0,0,0,-1,0,...,YES,2528.0,,2528.0,,YES,Locoregional Recurrence,,,2452.0
8,TCGA-D3-A1Q6-06,1,-1,1,1,1,0,0,0,1,...,NO,,,111.0,NO,YES,Regional lymph node,,,48.0
9,TCGA-D3-A1Q7-06,0,-1,0,0,0,0,0,-1,0,...,YES,798.0,,798.0,,NO,,,NO,


# Cluster

## HDBSCAN

[HDBSCAN](http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)

## Affinity Propagation

## Markov Clustering


## perform PCA
To get a feel for the number of significantly important features.

## perform LDA

To find dimensions that best seperate classes

## Mutation

In [68]:
df_mutation_temp = df_mutation

In [81]:
temp_merged = df_mutation.merge(df_mutation_temp[['Gene', 'Alt', 'Chr', 'Start', 'Stop']], 
                          how='outer', 
                          on=('Gene', 'Alt', 'Chr', 'Start', 'Stop'))

In [134]:
Gene_presence_mutation = temp_merged.groupby(by='Gene').size().sort_values(ascending=False)
AminoAcid_presence_mutation = temp_merged.groupby(by='Amino_Acid_Change').size().sort_values(ascending=False)
Effect_presence_mutation = temp_merged.groupby(by='Effect').size().sort_values(ascending=False)

In [174]:
temp_merged[['Gene', 'Chr', 'Sample']].groupby(by=['Sample', 'Gene' ]).count().reset_index()

Unnamed: 0,Sample,Gene,Chr
0,TCGA-3N-A9WB-06,AC005562.1,2
1,TCGA-3N-A9WB-06,AC018890.6,6
2,TCGA-3N-A9WB-06,ADAM18,2
3,TCGA-3N-A9WB-06,ADAM7,3
4,TCGA-3N-A9WB-06,ADARB2,1
5,TCGA-3N-A9WB-06,AFF3,1
6,TCGA-3N-A9WB-06,AGPS,1
7,TCGA-3N-A9WB-06,AHCTF1,1
8,TCGA-3N-A9WB-06,AIM1L,1
9,TCGA-3N-A9WB-06,ALPP,1


In [139]:
_temp_merged = temp_merged[['Gene', 'Sample']].merge(df_clinical, 
                                      how='left', 
                                      left_on='Sample', 
                                      right_on='SampleID')

In [143]:
_temp_merged.columns

Index(['Gene', 'Sample', 'SampleID', 'Sample Type', 'Gender', 'Ethnicity',
       'Age At Diagnosis (Days)', 'Age At Diagnosis (Years)', 'Vital Status',
       'Overall Survival Status',
       'Time To Overall Survival From Diagnosis (Days)', 'BMI',
       'Anatomic Treatment Site', 'Location Distant Metastasis',
       'Breslow Depth Value', 'Clarks Level', 'Ulceration Status',
       'Mitotic Count Rate', 'Morphology', 'Site Of Resection', 'Tumor Stage',
       'T-stage', 'N-stage', 'M-stage', 'Drug Therapy Type',
       'Prior Drug Therapy Type', 'Drug Name', 'Time To Drug Therapy (Days)',
       'Radiation Therapy', 'Time To Radiation Therapy (Days)',
       'Response To Therapy', 'Time To Therapy (Days)', 'Therapy Ongoing',
       'New Tumor Event', 'New Tumor Event Type',
       'New Tumor Event Anatomical Location', 'Subsequent Primary Melanoma',
       'Time To New Tumor Event (Days)'],
      dtype='object')

In [121]:
df_clinical

Unnamed: 0,SampleID,Sample Type,Gender,Ethnicity,Age At Diagnosis (Days),Age At Diagnosis (Years),Vital Status,Overall Survival Status,Time To Overall Survival From Diagnosis (Days),BMI,...,Radiation Therapy,Time To Radiation Therapy (Days),Response To Therapy,Time To Therapy (Days),Therapy Ongoing,New Tumor Event,New Tumor Event Type,New Tumor Event Anatomical Location,Subsequent Primary Melanoma,Time To New Tumor Event (Days)
0,TCGA-3N-A9WB-06,Metastatic,male,white,26176.0,71.0,dead,1,518.0,25.469388,...,NO,,,,,YES,Distant Metastasis,,,487.0
1,TCGA-3N-A9WC-06,Metastatic,male,white,30286.0,82.0,alive,0,2022.0,20.305175,...,NO,,,,,NO,,,NO,
2,TCGA-3N-A9WD-06,Metastatic,male,white,30163.0,82.0,dead,1,395.0,34.638239,...,YES,244.0,,244.0,,YES,Distant Metastasis,,,306.0
3,TCGA-BF-A1PU-01,Primary Tumor,female,,17025.0,46.0,alive,0,387.0,22.656250,...,,,,,,,,,,
4,TCGA-BF-A1PV-01,Primary Tumor,female,,27124.0,74.0,alive,0,14.0,27.343750,...,,,,,,,,,,
5,TCGA-BF-A1PX-01,Primary Tumor,male,white,20626.0,56.0,dead,1,282.0,25.469388,...,NO,,,,,NO,,,,
6,TCGA-BF-A1PZ-01,Primary Tumor,female,white,26240.0,71.0,alive,0,853.0,21.077195,...,NO,,,,,NO,,,,
7,TCGA-BF-A1Q0-01,Primary Tumor,male,white,29380.0,80.0,alive,0,831.0,23.054562,...,NO,,,,,NO,,,,
8,TCGA-BF-A3DJ-01,Primary Tumor,female,,13332.0,36.0,alive,0,464.0,25.711662,...,,,,,,,,,,
9,TCGA-BF-A3DL-01,Primary Tumor,female,white,30805.0,84.0,alive,0,769.0,24.835764,...,NO,,,,,NO,,,,


# feature normalisation

# feature batching and transposition

## per layer clustering

## per layer classification

# feature merging