# Deep Neuron Network-based Cancer Origin Classifier

--- 

***Chunlei Zheng, Rong Xu***

**6-5-2019**

----


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1-Project-Description" data-toc-modified-id="1-Project-Description-1">1 Project Description</a></span></li><li><span><a href="#2-Requirement-for-reproducing-this-project" data-toc-modified-id="2-Requirement-for-reproducing-this-project-2">2 Requirement for reproducing this project</a></span></li><li><span><a href="#3--Cancer-origin-classifier" data-toc-modified-id="3--Cancer-origin-classifier-3">3  Cancer origin classifier</a></span><ul class="toc-item"><li><span><a href="#3.1-Import-required-packages-and-modules" data-toc-modified-id="3.1-Import-required-packages-and-modules-3.1">3.1 Import required packages and modules</a></span></li><li><span><a href="#3.2--Make-project-structure" data-toc-modified-id="3.2--Make-project-structure-3.2">3.2  Make project structure</a></span></li><li><span><a href="#3.3-Data-preparation" data-toc-modified-id="3.3-Data-preparation-3.3">3.3 Data preparation</a></span><ul class="toc-item"><li><span><a href="#3.3.1-Train,-dev,-and-test-data-preparation" data-toc-modified-id="3.3.1-Train,-dev,-and-test-data-preparation-3.3.1">3.3.1 Train, dev, and test data preparation</a></span></li><li><span><a href="#3.3.2-Cross-validation-(10x)-data-preparatiion" data-toc-modified-id="3.3.2-Cross-validation-(10x)-data-preparatiion-3.3.2">3.3.2 Cross validation (10x) data preparatiion</a></span></li><li><span><a href="#3.3.3-Metastatic-data-from-TCGA" data-toc-modified-id="3.3.3-Metastatic-data-from-TCGA-3.3.3">3.3.3 Metastatic data from TCGA</a></span></li><li><span><a href="#3.3.4-Test-data-from-GEO" data-toc-modified-id="3.3.4-Test-data-from-GEO-3.3.4">3.3.4 Test data from GEO</a></span></li></ul></li><li><span><a href="#3.4-Build-DNN-model" data-toc-modified-id="3.4-Build-DNN-model-3.4">3.4 Build DNN model</a></span><ul class="toc-item"><li><span><a href="#3.4.1-Model-training-example" data-toc-modified-id="3.4.1-Model-training-example-3.4.1">3.4.1 Model training example</a></span></li><li><span><a href="#3.4.2-Model-selection" data-toc-modified-id="3.4.2-Model-selection-3.4.2">3.4.2 Model selection</a></span></li></ul></li><li><span><a href="#3.5-Evaluate-DNN-model" data-toc-modified-id="3.5-Evaluate-DNN-model-3.5">3.5 Evaluate DNN model</a></span><ul class="toc-item"><li><span><a href="#3.5.1-Cross-validation-(10x)" data-toc-modified-id="3.5.1-Cross-validation-(10x)-3.5.1">3.5.1 Cross validation (10x)</a></span></li><li><span><a href="#3.5.2-Test-set-from-TCGA" data-toc-modified-id="3.5.2-Test-set-from-TCGA-3.5.2">3.5.2 Test set from TCGA</a></span></li><li><span><a href="#3.5.3-Metastatic-data-from-TCGA" data-toc-modified-id="3.5.3-Metastatic-data-from-TCGA-3.5.3">3.5.3 Metastatic data from TCGA</a></span></li><li><span><a href="#3.5.4-Test-data-from-GEO" data-toc-modified-id="3.5.4-Test-data-from-GEO-3.5.4">3.5.4 Test data from GEO</a></span></li></ul></li></ul></li></ul></div>

## 1 Project Description

Around 5% of metastatic malignancies are of unknown primary origin (CUP) and 80% of CUP patients have poor prognosis. Cancer origin determination combined with site-specific treatment of CUP has shown to improve patient outcomes. However, determining tissue origin of CUP is challenging in clinical settings. Existing pathology and gene expression-based techniques are time consuming, costly and often have limited performance. We aim to develop a high performance and easily implemented model for cancer origin prediction.
   
   We developed a deep neuron network (DNN)-based tissue of origin classifier using DNA methylation data of 7,342 patients from The Cancer Genome Atlas (TCGA) that cover 18 different cancer origins.

## 2 Requirement for reproducing this project
To reproduce this project, we recommend to follow following steps:
1. Create an enviroment using given .yml file
2. Launch jupyter notebook
3. Choose jupyter kernel in that enviroment
4. Follow the below procedures

## 3  Cancer origin classifier

In [1]:
%load_ext autoreload
%autoreload 2

### 3.1 Import required packages and modules

In [2]:
# import required modules
import sys
sys.path.append("../src/data")
sys.path.append("../src/model")

import warnings
warnings.filterwarnings("ignore")
import argparse
from collections import OrderedDict

import numpy as np
import pandas as pd
import time
import scipy as sp
import scipy.stats
from sklearn.metrics import precision_recall_curve, average_precision_score, confusion_matrix
import tensorflow as tf
from shutil import copy

# self modules
from data import Data 
from DNN_model import DNN_model
from model_training import ModelTraining
from model_evaluation import ModelEval
from model_testing import run_testing
from model_selection import run_model_selection



### 3.2  Make project structure

In [4]:
# data folder
DATA_RAW = '../data/raw/'
DATA_TRAIN_DEV_TEST = '../data/train_dev_test/'
DATA_METASTATIC = '../data/metastatic/'
DATA_GEO = '../data/GEO/'
DATA_CV = '../data/CV/'

# model folder
MODELS = '../DNN_model/'
# result folder
RESULTS = '../results/'



### 3.3 Data preparation

#### 3.3.1 Train, dev, and test data preparation

In [14]:
inputfile = DATA_RAW + "all_labeled_data1.csv"
inputfile_meta = DATA_RAW + "all_labeled_data1_meta.csv"
num_of_case = 100
outdir = DATA_TRAIN_DEV_TEST
# Data.train_dev_test_prep(inputfile, inputmetafile, num_of_case, outdir)
# !python3 ./codes/data_prep.py train_dev_test \
#                                         --datafile inputfile \
#                                         --metadatafile inputfile_meta \
#                                         --outdir outdir/

<p align="center">
  <img src="../figures/data.jpg" width="800">
</p>

#### 3.3.2 Cross validation (10x) data preparatiion

In [None]:
datafile = DATA_TRAIN_DEV_TEST + "train1.csv" \
outdir = DATA_CV
n_fold = 10
# Data.CV_prep(datafile, outdir, n_fold)

#### 3.3.3 Metastatic data from TCGA

In [None]:
datafile = DATA_RAW + 'metastatic_data1.csv'
metadatafile = DATA_RAW + 'metastatic_data1_meta.csv'
codesfile = DATA_TRAIN_DEV_TEST + 'code.csv'
featurefile = DATA_TRAIN_DEV_TEST + 'features.txt' 
outdir = DATA_METASTATIC
# Data.test_prep(datafile, metadatafile, codesfile, featurefile, outdir)

#### 3.3.4 Test data from GEO

In [None]:
datafile = DATA_RAW + 'GEO_combined_final.csv'
metadatafile = DATA_RAW + 'GEO_combined_final_meta.csv'
codesfile = DATA_TRAIN_DEV_TEST + 'code.csv'
featurefile = DATA_TRAIN_DEV_TEST + 'features.txt'
outdir = DATA_GEO
# Data. test_prep(datafile, metadatafile, codesfile, featurefile, outdir)

### 3.4 Build DNN model

#### 3.4.1 Model training example

In [72]:
trainfile = DATA_TRAIN_DEV_TEST + 'train1.tfrecords'
unit = 64
modelfile = MODELS + 'train_model/train.ckpt'
train = ModelTraining()
train.run_training(trainfile, unit, modelfile)

Shape of features: (128, 10360)
Shape of label: (128,)
Step 0: loss = 3.19; accuracy = 0.0469 (0.100 sec)
Step 100: loss = 0.53; accuracy = 0.9062 (0.030 sec)
Step 200: loss = 0.16; accuracy = 0.9453 (0.032 sec)
Step 300: loss = 0.13; accuracy = 0.9531 (0.034 sec)
Step 400: loss = 0.09; accuracy = 0.9609 (0.033 sec)
Step 500: loss = 0.09; accuracy = 0.9688 (0.032 sec)
Step 600: loss = 0.06; accuracy = 0.9844 (0.033 sec)
Step 700: loss = 0.08; accuracy = 0.9922 (0.044 sec)
Step 800: loss = 0.06; accuracy = 0.9922 (0.032 sec)
Step 900: loss = 0.02; accuracy = 0.9922 (0.032 sec)
Step 1000: loss = 0.02; accuracy = 0.9844 (0.044 sec)
Done training for 30 epochs, 1019 steps.


#### 3.4.2 Model selection

In [75]:
trainfile = DATA_TRAIN_DEV_TEST + 'train1.tfrecords'
testfile = DATA_TRAIN_DEV_TEST + 'dev.tfrecords'
units = [32, 64, 128]
sample_size = 1468
model_dir = MODELS + 'models/'
best_model_dir = MODELS + 'best_model/'

run_model_selection(trainfile, testfile, units, sample_size, model_dir, best_model_dir)


Running training 1 of 3 ...
Unit of hidden layer: 32
Shape of features: (128, 10360)
Shape of label: (128,)
Step 0: loss = 2.92; accuracy = 0.0469 (0.061 sec)
Step 100: loss = 1.08; accuracy = 0.7734 (0.029 sec)
Step 200: loss = 0.47; accuracy = 0.9219 (0.029 sec)
Step 300: loss = 0.31; accuracy = 0.9688 (0.027 sec)
Step 400: loss = 0.29; accuracy = 0.9375 (0.029 sec)
Step 500: loss = 0.19; accuracy = 0.9453 (0.028 sec)
Step 600: loss = 0.20; accuracy = 0.9766 (0.028 sec)
Step 700: loss = 0.11; accuracy = 0.9688 (0.028 sec)
Step 800: loss = 0.09; accuracy = 0.9531 (0.027 sec)
Step 900: loss = 0.08; accuracy = 0.9609 (0.030 sec)
Step 1000: loss = 0.11; accuracy = 0.9141 (0.029 sec)
Done training for 30 epochs, 1019 steps.
INFO:tensorflow:Restoring parameters from ./DNN_model/models/model_0.ckpt
Iteration: 1
Testing done!
Accuracy: 0.9489100575447083

Running training 2 of 3 ...
Unit of hidden layer: 64
Shape of features: (128, 10360)
Shape of label: (128,)
Step 0: loss = 3.19; accuracy

<p align="center">
  <img src="../figures/model.jpg" width="600">
</p>

### 3.5 Evaluate DNN model

#### 3.5.1 Cross validation (10x)

In [90]:
CVData_dir = DATA_CV
CVModel_dir = MODELS + 'cv_model2/'
n_fold = 10
sample_size = 431
units = 64
codesfile = DATA_TRAIN_DEV_TEST + 'code.csv'
results_dir = RESULTS + 'CV/'

SSPN, SSPN_stat = ModelEval.cross_validation(CVData_dir, CVModel_dir, n_fold, sample_size, units, codesfile)
print(SSPN)
print(SSPN_stat)
SSPN.to_csv(results_dir + 'sspn.csv')
SSPN_stat.to_csv(results_dir + 'sspn_stat.csv')



Running fold 1 of 10 ...
./data/CV/train_0.tfrecords
./data/CV/test_0.tfrecords
./DNN_model/cv_model2/model_0.ckpt
Shape of features: (128, 10360)
Shape of label: (128,)
Step 0: loss = 3.17; accuracy = 0.0547 (0.061 sec)
Step 100: loss = 0.31; accuracy = 0.9531 (0.033 sec)
Step 200: loss = 0.25; accuracy = 0.9609 (0.031 sec)
Step 300: loss = 0.11; accuracy = 0.9922 (0.030 sec)
Step 400: loss = 0.08; accuracy = 0.9688 (0.034 sec)
Step 500: loss = 0.06; accuracy = 0.9844 (0.034 sec)
Step 600: loss = 0.06; accuracy = 0.9844 (0.032 sec)
Step 700: loss = 0.06; accuracy = 0.9922 (0.034 sec)
Step 800: loss = 0.05; accuracy = 1.0000 (0.034 sec)
Step 900: loss = 0.03; accuracy = 0.9922 (0.032 sec)
Done training for 30 epochs, 915 steps.
INFO:tensorflow:Restoring parameters from ./DNN_model/cv_model2/model_0.ckpt
Iteration: 1
Testing done!
Accuracy: 0.9512761235237122


Running fold 2 of 10 ...
./data/CV/train_1.tfrecords
./data/CV/test_1.tfrecords
./DNN_model/cv_model2/model_1.ckpt
Shape of f

Step 400: loss = 0.06; accuracy = 0.9688 (0.043 sec)
Step 500: loss = 0.08; accuracy = 0.9453 (0.039 sec)
Step 600: loss = 0.10; accuracy = 0.9766 (0.038 sec)
Step 700: loss = 0.07; accuracy = 1.0000 (0.037 sec)
Step 800: loss = 0.07; accuracy = 0.9688 (0.034 sec)
Step 900: loss = 0.01; accuracy = 0.9922 (0.036 sec)
Done training for 30 epochs, 919 steps.
INFO:tensorflow:Restoring parameters from ./DNN_model/cv_model2/model_9.ckpt
Iteration: 1
Testing done!
Accuracy: 0.9512761235237122
                                  0         1         2         3         4  \
Specificity                0.997107  0.996913  0.997367  0.996875  0.997916   
Sensitivity                0.928515  0.917158  0.936241  0.934899  0.932885   
Postive_predictive_value   0.962885  0.967006  0.949338  0.933109  0.976234   
Negative_predictive_value  0.997155  0.997059  0.997420  0.996890  0.998031   
Accuracy                   0.951276  0.948956  0.955916  0.946636  0.965197   

                                  

#### 3.5.2 Test set from TCGA

In [7]:
testfile = DATA_TRAIN_DEV_TEST + 'test1.tfrecords'
testmetafile = DATA_TRAIN_DEV_TEST + 'test1_meta.csv'
sample_size = 1468
units = 64
modelfile = MODELS + 'best_model/model_0.ckpt'
codesfile = DATA_TRAIN_DEV_TEST + 'code.csv'
results_dir = RESULTS + 'test2/'

# precision, recall, average_precision, SSPN, cm, acc, preds = \
#     cancer_origin_DNN.run_testing(testfile, testmetafile, units, modelfile, sample_size, codesfile)

accuracy, precision, recall, average_precision_ser, cm_df, SSPN_df, pred_df = \
ModelEval.testeval(testfile, testmetafile, sample_size, units, modelfile, codesfile)

# display metrics
print("\nAccuracy: {}".format(round(accuracy,4)))
print("\nConfusion matrix:")
print(cm_df)
print("\nAverage precision:")
print(average_precision_ser)
print("\n SSPN:\n {}".format(SSPN_df))

# save results to files
pd.Series(accuracy, index=['accuracy']).to_csv(results_dir + "accuracy.txt", sep="\n")
cm_df.to_csv(results_dir + "confusion_matrix.csv")
SSPN_df.to_csv(results_dir + "sspn.csv")
average_precision_ser.to_csv(results_dir + "average_precision.csv")
pred_df.to_csv(results_dir + "pred_results.csv", sep=",")

INFO:tensorflow:Restoring parameters from ../DNN_model/best_model/model_0.ckpt
Iteration: 1
Testing done!
Accuracy: 0.971389651298523

Accuracy: 0.9714000225067139

Confusion matrix:
               Adrenal Gland  Bladder  Brain  Breast  Colorectal  Esophagus  \
Adrenal Gland             46        0      0       0           0          0   
Bladder                    0       81      0       0           0          1   
Brain                      0        0    128       0           0          0   
Breast                     0        0      0     155           0          0   
Colorectal                 0        0      0       0          71          0   
Esophagus                  0        1      0       0           0         25   
Head and Neck              0        0      0       1           0          7   
Kidney                     0        0      0       0           0          0   
Liver                      0        0      0       0           0          0   
Lung                       

#### 3.5.3 Metastatic data from TCGA

In [39]:
# metastatic cancer performance 2


# !python3 ./codes/cancer_origin_DNN.py test \
#                                         --testfile ./data/metastatic_data_15_20/metastatic_data1.tfrecords \
#                                         --testmetafile ./data/metastatic_data_15_20/metastatic_data1_meta.csv \
#                                         --units 64  \
#                                         --modelfile ./DNN_model/best_model/model_0.ckpt \
#                                         --codesfile ./DNN_model/code.csv \
#                                         --sample_size 143 \
#                                         --results_dir ./results/metastatic2/

In [6]:
testfile = DATA_METASTATIC + 'metastatic_data1.tfrecords'
testmetafile = DATA_METASTATIC + 'metastatic_data1_meta.csv'
sample_size = 143
units = 64
modelfile = MODELS + 'best_model/model_0.ckpt'
codesfile = DATA_TRAIN_DEV_TEST + 'code.csv'
results_dir = RESULTS + 'metastatic/'

# precision, recall, average_precision, SSPN, cm, acc, preds = \
#     cancer_origin_DNN.run_testing(testfile, testmetafile, units, modelfile, sample_size, codesfile)

accuracy, precision, recall, average_precision_ser, cm_df, SSPN_df, pred_df = \
ModelEval.testeval(testfile, testmetafile, sample_size, units, modelfile, codesfile)

# display metrics
print("\nAccuracy: {}".format(round(accuracy,4)))
print("\nConfusion matrix:")
print(cm_df)
print("\nAverage precision:")
print(average_precision_ser)
print("\n SSPN:\n {}".format(SSPN_df))

# save results to files
pd.Series(accuracy, index=['accuracy']).to_csv(results_dir + "accuracy.txt", sep="\n")
cm_df.to_csv(results_dir + "confusion_matrix.csv")
SSPN_df.to_csv(results_dir + "sspn.csv")
average_precision_ser.to_csv(results_dir + "average_precision.csv")
pred_df.to_csv(results_dir + "pred_results.csv", sep=",")

INFO:tensorflow:Restoring parameters from ../DNN_model/best_model/model_0.ckpt
Iteration: 1
Testing done!
Accuracy: 0.9370629191398621

Accuracy: 0.9370999932289124

Confusion matrix:
               Adrenal Gland  Bladder  Breast  Colorectal  Esophagus  \
Adrenal Gland              4        0       0           0          0   
Bladder                    0       27       0           0          1   
Breast                     0        0       3           0          0   
Colorectal                 0        0       0           8          0   
Esophagus                  0        0       0           0          2   
Head and Neck              0        0       1           0          6   
Kidney                     0        0       0           0          0   
Liver                      0        0       0           0          0   
Lung                       0        0       0           0          0   
Pancreas                   0        0       0           0          0   
Stomach                 

#### 3.5.4 Test data from GEO

In [5]:
testfile = DATA_GEO + 'GEO_combined_final.tfrecords'
testmetafile = DATA_GEO + 'GEO_combined_final_meta.csv'
sample_size = 581
units = 64
modelfile = MODELS + '/best_model/model_0.ckpt'
codesfile = DATA_TRAIN_DEV_TEST + 'code.csv'
results_dir = RESULTS + 'GEO/'

accuracy, precision, recall, average_precision_ser, cm_df, SSPN_df, pred_df = \
ModelEval.testeval(testfile, testmetafile, sample_size, units, modelfile, codesfile)

# display metrics
print("\nAccuracy: {}".format(round(accuracy,4)))
print("\nConfusion matrix:")
print(cm_df)
print("\nAverage precision:")
print(average_precision_ser)
print("\n SSPN:\n {}".format(SSPN_df))

# save results to files
pd.Series(accuracy, index=['accuracy']).to_csv(results_dir + "accuracy.txt", sep="\n")
cm_df.to_csv(results_dir + "confusion_matrix.csv")
SSPN_df.to_csv(results_dir + "sspn.csv")
average_precision_ser.to_csv(results_dir + "average_precision.csv")
pred_df.to_csv(results_dir + "pred_results.csv", sep=",")

INFO:tensorflow:Restoring parameters from ../DNN_model//best_model/model_0.ckpt
Iteration: 1
Testing done!
Accuracy: 0.9259896874427795

Accuracy: 0.9259999990463257

Confusion matrix:
               Adrenal Gland  Bladder  Breast  Colorectal  Head and Neck  \
Adrenal Gland             14        0       0           0              0   
Bladder                    0       25       0           0              0   
Breast                     0        0      34           0              0   
Colorectal                 0        0       1         108              0   
Head and Neck              0        0       0           0              5   
Kidney                     0        0       0           0              0   
Liver                      0        0       0           0              0   
Lung                       0        0       0           0              0   
Pancreas                   0        0       0           0              0   
Prostate                   0        0       0          