# Gene Expressions for different types of tumor
<div style="text-align: center;">
    <img src="images/gene_protein_cancer.jpg" alt="image" width="300" height="200">
</div>
[image src: https://www.cancer.gov/about-cancer/causes-prevention/genetics]

This project aims to identify different gene expressions associated to 5 types of tumor : 
- BRCA (Breast Cancer): Family of Genes (BRCA1 and BRCA2) are known as tumor suppresors. But mutation in these genes cause cancer.
- KIRC (Kidney Renal Clear Cell Carcinoma): 
- COAD (Colon Adenocarcinoma)
- LUAD (Lung Adenocarcinoma)
- PRAD (Prostate Adenocarcinoma)

The original dataset is published at https://www.synapse.org/Synapse:syn300013/discussion/threadId=5455. The Gene names in the dataset are dummy names. The actual gene names are at https://www.ncbi.nlm.nih.gov/gene, per this discussion thread https://www.synapse.org/Synapse:syn300013/discussion/threadId=5455. 

In [25]:
import warnings

warnings.filterwarnings("ignore")
DATA_ANALYSIS_DIR = "data-analysis/"

## Data Loading


In [33]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from src.utils.ArrayUtils import get_list_of_items_in_both_lists


In [5]:
# read the data from files
url_reconstructed = 'TCGA-PANCAN-HiSeq-801x20531/data.csv'

# url = 'https://drive.google.com/file/d/1VXyhDXpYT8G2Buhkc6kBjw93CLG1y1f0/view?usp=drive_link'
# # Use only the Id and reconstruct the URL
# url_reconstructed = 'https://drive.google.com/uc?id=' + url.split('/')[-2]

df = pd.read_csv(url_reconstructed)
tumor_df = pd.read_csv('TCGA-PANCAN-HiSeq-801x20531/labels.csv')
df

Unnamed: 0.1,Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,...,gene_20521,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530
0,sample_0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,...,4.926711,8.210257,9.723516,7.220030,9.119813,12.003135,9.650743,8.921326,5.286759,0.000000
1,sample_1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.000000,0.0,...,4.593372,7.323865,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.000000
2,sample_2,0.0,3.511759,4.327199,6.881787,9.870730,0.0,6.972130,0.452595,0.0,...,5.125213,8.127123,10.908640,5.401607,9.911597,9.045255,9.788359,10.090470,1.683023,0.000000
3,sample_3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,...,6.076566,8.792959,10.141520,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.000000
4,sample_4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,...,5.996032,8.891425,10.373790,7.181162,9.846910,11.922439,9.217749,9.461191,5.110372,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,sample_796,0.0,1.865642,2.718197,7.350099,10.006003,0.0,6.764792,0.496922,0.0,...,6.088133,9.118313,10.004852,4.484415,9.614701,12.031267,9.813063,10.092770,8.819269,0.000000
797,sample_797,0.0,3.942955,4.453807,6.346597,10.056868,0.0,7.320331,0.000000,0.0,...,6.371876,9.623335,9.823921,6.555327,9.064002,11.633422,10.317266,8.745983,9.659081,0.000000
798,sample_798,0.0,3.249582,3.707492,8.185901,9.504082,0.0,7.536589,1.811101,0.0,...,5.719386,8.610704,10.485517,3.589763,9.350636,12.180944,10.681194,9.466711,4.677458,0.586693
799,sample_799,0.0,2.590339,2.787976,7.318624,9.987136,0.0,9.213464,0.000000,0.0,...,5.785237,8.605387,11.004677,4.745888,9.626383,11.198279,10.335513,10.400581,5.718751,0.000000


In [6]:
print(f"Dataframe before adding the class column: {df.info()}")
print(f"Total # of columns: {len(df.columns)}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Columns: 20532 entries, Unnamed: 0 to gene_20530
dtypes: float64(20531), object(1)
memory usage: 125.5+ MB
Dataframe before adding the class column: None
Total # of columns: 20532


In [7]:
# Copy over the Class column into the main dataframe
df = df.rename(columns={'Unnamed: 0': 'sample'})
tumor_df = tumor_df.rename(columns={'Unnamed: 0': 'sample'})
df['Class'] = np.where( (df['sample'] == tumor_df['sample']), tumor_df['Class'], df['sample'])

# df['Class'] = tumor_df['Class']
# print(f"Total # of columns: {len(df.columns)}")

df.drop(columns=['sample'], axis=1, inplace=True)

df.head()


Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,gene_9,...,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530,Class
0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,0.0,...,8.210257,9.723516,7.22003,9.119813,12.003135,9.650743,8.921326,5.286759,0.0,PRAD
1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.0,0.0,0.0,...,7.323865,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.0,LUAD
2,0.0,3.511759,4.327199,6.881787,9.87073,0.0,6.97213,0.452595,0.0,0.0,...,8.127123,10.90864,5.401607,9.911597,9.045255,9.788359,10.09047,1.683023,0.0,PRAD
3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,0.0,...,8.792959,10.14152,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.0,PRAD
4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,0.0,...,8.891425,10.37379,7.181162,9.84691,11.922439,9.217749,9.461191,5.110372,0.0,BRCA


## Exploratory Data Analysis (EDA)
### Data Cleaning and PreProcessing

Data : 
- No NANs as stated on the data source page


In [8]:
df.isnull().any().any()

False

In [9]:
print(f"Distribution of Class values: \n{df['Class'].value_counts()}")


Distribution of Class values: 
Class
BRCA    300
KIRC    146
LUAD    141
PRAD    136
COAD     78
Name: count, dtype: int64


In [10]:
# Split -
X = df.drop(columns=['Class'], axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=df['Class'])


In [20]:
type(y_train)

pandas.core.series.Series

### Univariate and Multivariate Analysis
Given the large # of variables (columns), mapping each out against other can be a challenge. 

In [None]:
from scipy import stats

df_desc = df.describe()
df_desc_t = df_desc.T
df_desc_t.to_csv(DATA_ANALYSIS_DIR + "df_describe.csv")


df_desc_zscore = df_desc.apply(stats.zscore)
df_desc_zscore.T.to_csv(DATA_ANALYSIS_DIR + "rawdata_zscore.csv")
# df_desc['upperbound'] = df_desc['mean'] + 3*(df_desc['std'])
# df_desc['lowerbound'] = df_desc['mean'] - 3*(df_desc['std'])


In [None]:
corr_df = X.corr()
corr_df.to_csv(DATA_ANALYSIS_DIR + 'features_corr_matrix.csv')


In [None]:
# Are there any Features with Correlation value= NaN ? Identify the columns
corr_nan_list = corr_df.columns[corr_df.isna().all()].tolist()
print(len(corr_nan_list))

# -- In total there are 267 cols with all NaN values, which means the std deviation is most likely 0



267


### Feature Selection


### Dimensionality Reduction

In [28]:
# PCA -
# -- Is PCA the right approach to this data set where we are trying to identify Gene expressions for tumors.
# -- Shudnt we include every expression - small or big ?
from sklearn.decomposition import PCA
pca = PCA()
pca_comp_df = pca.fit_transform(X_train)
print(f"PCA Explained Variance Ratio: \n{pca.explained_variance_ratio_}")
print(f"PCA Components: {pca.n_components_}")
pca_features_in = pca.feature_names_in_
print(f"PCA Feature names: {pca_features_in}")

pca_df = pd.DataFrame(data=pca_comp_df)
pca_df = pd.concat([pca_df, y_train], axis=1, ignore_index=True)



PCA Explained Variance Ratio: 
[1.59220242e-01 1.05975337e-01 9.57299418e-02 6.79956750e-02
 3.61046300e-02 2.94611898e-02 2.76173843e-02 1.51322893e-02
 1.37501321e-02 1.28456045e-02 9.87373210e-03 8.84108003e-03
 7.96643422e-03 7.26274067e-03 6.89395205e-03 6.19461554e-03
 5.79974979e-03 5.35740114e-03 4.75495425e-03 4.66714491e-03
 4.42726706e-03 4.06232774e-03 3.99256990e-03 3.93444182e-03
 3.71252300e-03 3.54546333e-03 3.44734177e-03 3.27411548e-03
 3.19962321e-03 2.97984751e-03 2.85453857e-03 2.81858141e-03
 2.77047037e-03 2.67311588e-03 2.59320334e-03 2.53469385e-03
 2.49188122e-03 2.46640161e-03 2.42110915e-03 2.36048855e-03
 2.33856292e-03 2.26867237e-03 2.22436344e-03 2.20445720e-03
 2.17801926e-03 2.11312046e-03 2.07962187e-03 2.05199674e-03
 2.03294979e-03 1.97534966e-03 1.92704601e-03 1.87474506e-03
 1.84842961e-03 1.84698704e-03 1.79900842e-03 1.76290252e-03
 1.74256262e-03 1.72126626e-03 1.68922194e-03 1.67952706e-03
 1.64823199e-03 1.62933155e-03 1.60052468e-03 1.587509

In [None]:
pca_df.to_csv(f"{DATA_ANALYSIS_DIR}PCA_ExpVarianceRatio.csv")
pca_df
# #TODO : How to get column names as part of PCA output. Also cleanup rows beyond 639

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,631,632,633,634,635,636,637,638,639,640
0,-51.070332,-45.412590,18.734251,-19.347881,45.952566,-33.455497,-102.502462,52.786352,-55.061777,16.577177,...,-0.227652,0.074595,0.169516,-0.043755,0.011883,-0.335815,-0.360455,0.268831,-9.225807e-14,PRAD
1,6.352113,-43.593935,-32.577610,87.716488,-11.109400,-37.356347,60.506773,15.429453,-10.541383,5.315070,...,0.811282,-0.392568,0.566179,-0.224008,0.683542,0.272688,0.248482,-0.483681,-9.225807e-14,LUAD
2,166.398019,33.762187,-5.989695,-49.206248,-6.477480,-24.217814,5.907173,-27.687875,-30.819695,21.609567,...,0.269610,0.189631,-0.776303,1.133406,0.648541,-1.220832,0.134893,0.579701,-9.225807e-14,PRAD
3,-56.541858,4.987669,52.860680,-22.613119,1.751871,4.210259,1.770921,-5.892456,-7.932082,-27.749779,...,-0.624888,-0.623122,-0.374542,0.086730,-0.427019,0.707169,-0.670984,-0.281473,-9.225807e-14,PRAD
4,-9.993597,-124.545834,-89.397177,-66.597823,-25.295944,22.772228,1.245184,-1.507611,8.685631,-3.391902,...,-0.020721,0.471720,0.865990,-0.333269,1.417393,0.074441,-0.481297,0.751671,-9.225807e-14,BRCA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
733,,,,,,,,,,,...,,,,,,,,,,KIRC
651,,,,,,,,,,,...,,,,,,,,,,BRCA
777,,,,,,,,,,,...,,,,,,,,,,KIRC
728,,,,,,,,,,,...,,,,,,,,,,BRCA


In [None]:
# -- Are there columns with all NaNs in the PCA identified feature list ?
features_with_nan_in_pca = get_list_of_items_in_both_lists(corr_nan_list, pca_features_in.tolist())
print(len(features_with_nan_in_pca))

267
