# Analysis of Molecular Subtypes in Breast Cancer Tumors using Data Visualization
This data set contains information on breast cancer patients with 4 major molecular subtypes of tumors; Luminal A, Luminal B, HER2 Enriched, and Basal(triple negative).  

#### This is a project for Data Visualization course for Master of Science in Data Science course for Dr.Bora Pajo. This can also be used as a guide for Data visualization

## Table of Contents

Part 1: Introduction

Part 2: Information on Dataset

Part 3: Visualization using Matplot on Clinical Dataset

Part 4: Visualization using seaborn on Breast Cancer Proteomes

# Part 1: Introduction 

Breast Cancer is a multifactorial disease that forms in the cells of the breast. Breast cancer can occur in both men and women, but it's far more common in women. Breast cancer is the most common form of cancer in American women, the average risk of an American Women developing breast cancer in her life time is about 1 in 7 or about 12%. Of that 12 percent of women who do develop breast cancer, the cancer that a breast cancer patient will die from her disease is 2.7%.    

**Density map of worldwide incidents of Breast Cancer**
![breastcancerstatsworldwide](http://www.worldwidebreastcancer.com/wp-content/uploads/2011/08/breastcancerstatsworldwide.jpg)

Breast cancer is often referred to as one disease, however there are many different types of breast cancers based on varying types of tumors. Tumors can vary in location, size, shape, and grade(severity). These characteristics, along with hormone receptor status and HER2 status affect prognosis. 

**Breast Cancer by the Numbers** 
![bythenumbers](http://www.julep.com/blog/wp-content/uploads/2014/10/FINAL-2014-Infographic-Julep_EdithSanford_5x5-1.jpg)

There are many factors that increase one’s chances in developing breast cancer, mutations in DNA are often the root causes of breast cancer. Other causes that may influence one’s susceptibility to developing breast cancer are mutations to genes that assist in cell differentiation (Proto-Oncogenes), and mutations to genes that modulate cell division, cellular repair, apoptosis (Tumor Suppressor Genes).  

Breast cancer awareness and research funding has helped created advances in the diagnosis and treatment of breast cancer. Medical advances and Innovation in treating breast cancer has increased survival rates, lower remission rates, and lower the number of deaths associated with the disease. More recently, the introduction of precision medicine and gene therapy has the potential to transform how physician treat breast cancer. Precision medicine refers to the tailoring of medical treatment based on the cellular profile of a disease and the patient’s genome. 

Research has identified 4 major molecular subtypes within breast cancer tumors, these subtypes are based on the genes that cancer cell express: Luminal A, Luminal B, HER2 Enriched, and Basal (triple negative). Identifying and studying these subtypes has potential in planning more effective treatment and developing new therapies. Currently, Prognosis and treatment decisions are guided mainly by tumor stage, tumor grade, hormone receptor status and HER2 status. Molecular subtypes are mostly used in research settings; they are not part of a patients report and are not used to guide treatment. However, the use of molecular subtypes has greatly expanded, based on determining what genes are expressed in tumor samples, identifying subtypes of tumors can improve prognosis.

**The scope of this project is to visualize what the differences in gene expression among molecular subtypes of breast cancer, overall survive time, average age at initial pathological diagnosis.** 
 

# Part 2: Information on Dataset


1. 77_cancer_proteomes_CPTAC_itraq.csv

	Includes 12553 unique genes from a total of 83 breast cancer patients


2. clinical_data_breast_cancer.csv

	Contains Clinical information from 105 breast cancer patients 
    
    Variables: **Complete TCGA ID',
 'Gender',
 'Age at Initial Pathologic Diagnosis',
 'ER Status',
 'PR Status',
 'HER2 Final Status',
 'Tumor',
 'Tumor--T1 Coded',
 'Node',
 'Node-Coded',
 'Metastasis',
 'Metastasis-Coded',
 'AJCC Stage',
 'Converted Stage',
 'Survival Data Form',
 'Vital Status',
 'Days to Date of Last Contact',
 'Days to date of Death',
 'OS event',
 'OS Time',
 'PAM50 mRNA',
 'SigClust Unsupervised mRNA',
 'SigClust Intrinsic mRNA',
 'miRNA Clusters',
 'methylation Clusters',
 'RPPA Clusters',
 'CN Clusters',
 'Integrated Clusters (with PAM50)',
 'Integrated Clusters (no exp)',
 'Integrated Clusters (unsup exp)'**


3. BCGENES.csv 

    This dataset was generated using subset reduction technique to trim down noisy gene variables and dylpr to inner_\join variables from the 77_cancer_proteomes_CPTAC_itraq.csv and clinical_data_breast_cancer.csv
    
    Variables: **'PAM50 mRNA',
 'myoferlin isoform a',
 'heat shock protein HSP 90-beta isoform a',
 'keratin, type II cytoskeletal 72 isoform 1',
 'dedicator of cytokinesis protein 1',
 'keratin, type I cytoskeletal 23',
 '1-phosphatidylinositol 4,5-bisphosphate phosphodiesterase beta-3 isoform 1',
 'TBC1 domain family member 1 isoform 2',
 'kinesin-like protein KIF21A isoform 1',
 'ubiquitin carboxyl-terminal hydrolase 4 isoform a',
 'PREDICTED: myomegalin-like',
 'KN motif and ankyrin repeat domain-containing protein 1 isoform a',
 'spermatogenesis-associated protein 5',
 'TBC1 domain family member 9',
 'receptor tyrosine-protein kinase erbB-2 isoform a precursor',
 'epidermal growth factor receptor isoform a precursor',
 'UPF0505 protein C16orf62',
 'MICAL-like protein 1',
 'UTP--glucose-1-phosphate uridylyltransferase isoform a',
 'probable ATP-dependent RNA helicase DDX6',
 'L-lactate dehydrogenase B chain',
 'myelin expression factor 2',
 'histone-lysine N-methyltransferase NSD3 isoform long',
 'histone-lysine N-methyltransferase NSD3 isoform short',
 'signal transducer and activator of transcription 6 isoform 1',
 'WD repeat-containing protein 91',
 'UPF0553 protein C9orf64',
 'ran-binding protein 9',
 'arfaptin-1 isoform 2',
 'tonsoku-like protein',
 'protein LSM14 homolog B',
 'oxysterol-binding protein-related protein 2 isoform 2',
 'PDZ domain-containing protein GIPC2',
 'calpain small subunit 2',
 'condensin-2 complex subunit G2',
 'telomere length regulation protein TEL2 homolog',
 'guanine nucleotide-binding protein G(olf) subunit alpha isoform 1',
 'squalene synthase',
 'cathepsin B preproprotein',
 "5'-nucleotidase domain-containing protein 2 isoform 1",
 'COBW domain-containing protein 1 isoform 2',
 'transcription elongation factor A protein-like 5',
 'DNA repair protein XRCC4 isoform 1',
 'transmembrane protein 132A isoform a precursor',
 'alpha-methylacyl-CoA racemase isoform 3',
 'trans-acting T-cell-specific transcription factor GATA-3 isoform 1',
 'uncharacterized protein C7orf43',
 'molybdopterin synthase catalytic subunit large subunit MOCS2B',
 'Golli-MBP isoform 1',
 'hepatocyte nuclear factor 3-gamma',
 '39S ribosomal protein L40, mitochondrial',
 'migration and invasion enhancer 1',
 'ubiquitin-conjugating enzyme E2 E3',
 'protein S100-A13',
 'transcription initiation factor TFIID subunit 8',
 'phosphoribosyltransferase domain-containing protein 1',
 'polyadenylate-binding protein-interacting protein 2'].**

### Data manipulation of Breast Cancer Proteomes in Rstudio

The data used in this portion of the project was gathered from kaggle. The original dataset includes 12553 unique genes from a total of 83 breast cancer patients. For visualization purposes, it will be difficult to apply data visualization techniques to all 12553 unique genes. 

It is common that genetic datasets exhibit cases where n, the number of observations is larger than the p, the number or predictors. When p >n, there is no longer a unique least squares coefficient estimate. Often when p>n, high
variance and overfitting are a major concern in this setting. Thus,
simple, highly regularized approaches often become the methods of choice. To address this issue, there are many techniques (Subset selection), (Lasso and Ridge) and (Dimension reduction) to exclude irrelevant variables from a regression or a dataset. 

Using a Shrinkage Method known as Lasso in Rstudio, we fit a model containing all 12553 predictors
that constrains or regularizes the coefficient estimates, or equivalently, that
shrinks the coefficient estimates towards zero. Variables with a coefficient estimate of zero are left out of the final model. **The results of the lasso method on the 77_cancer_proteomes_CPTAC_itraq.csv data indicate 57 relevant variables that will be used in predicting tumor type and for data visualization.**

The significant 57 genes from a sample of breast cancer patients along with the associated tumor types are then queried out in RStudio into dataset BCGENES.csv. **BCGENES.csv is the main dataset which will be used in seaborn for data visualization purposes.** A variable importance plot from Rstudio of the cancer data indicates the top 5 most influential genes we will use for data visualization purposes.

### Datasets used for Visualization 

Matplot used the on Clinical Dataset, **clinical_data_breast_cancer.csv**

Seaborn used on the dataset containing the significant 57 genes from a sample of breast cancer patients. **BCGENES.csv **

In [None]:
# Library required and Dataset is imported
# We need matplot to plot the data in for Iris dataset
import matplotlib.pyplot as plt

# We need numpy to import and work with the data
import numpy as np

# We need seaborn as it's a Python graphing library
import seaborn as sns

# Let's import python first as it an important package for data processing
import pandas as pd

# This method produces a lot of warning, so let's import the package and ignore those warnings 
import warnings 
warnings.filterwarnings("ignore")

# This handles  the error of jupyter not plotting
%matplotlib inline

Clinical = pd.read_csv('../input/breastcancerproteomes/clinical_data_breast_cancer.csv')
Clinical.head(5)

In [None]:
# Check the data for any null values
# We elect to not change any of the null values to numeric 0 or remove the data, in this dataset a null value indicates no information on a patient record
# The numeric 0 can not be used to represent no information.
#0 represents a measurement and a null value represents no information. 
Clinical.isnull().values.any()

In [None]:
# To get a list of the variables that might be of interest for data visualization 
list(Clinical)

# Proportion Of Different Tumor Types Using a Pie Chart
A Pie Chart is a standardize method for llustrating the numerical proportion of the data. The pie chart below displays the numerical proportion of tumor types within our data.

In [None]:
# The pie chart below displays the numerical proportion of tumor types that make up out data.
#Luminal_B is the most common type of tumor, follow by Luminal_A,Basal_like and Her2_enriched

fig = plt.figure()
fig,ax = plt.subplots(figsize=(8,8))
plt.rcParams['font.size'] = 15.0
Clinical['PAM50 mRNA'].value_counts(sort=False).plot(kind='pie',autopct='%1.0f%%')
plt.title('Numerical Proportion of Tumor Types',fontsize=20)
fig.savefig('PIE.png')

## Barplot of Overall Survival Time in different Cancer Tumor Types

A bar chart is a graphical representation of categorical data with rectangular bars proportional to the values that they represent. The bars can be plotted vertically or horizontally. 

Source:https://en.wikipedia.org/wiki/Bar_chart

In cancer treament Overall Survival is a measurement of the length of time from either the date of diagnosis or the start of treatment for a disease, such as cancer, that patients diagnosed with the disease are still alive. In a clinical trial, measuring the overall survival is one way to see how well a new treatment works.

**Let’s use a bar chart to visually represent the average overall survival based on tumor type.**

Source:https://www.cancer.gov/publications/dictionaries/cancer-terms?cdrid=655245

In [None]:
#Before we can produce a barplot of Overall Survival Time, we need to group the breast cancer tumor types.
#Grouping the tumor types allows for one to get the mean Overall Survival Time of our patients. 
TumorOS=Clinical.groupby(by='PAM50 mRNA').mean()[['OS Time']]
TumorOS

Consistent with current research, as HER2_Enriched tumors are known to be the more aggressive of the tumor types. Overall survival time among HER2_Enriched tumors have always been considered significantly shorter than other tumor types.

In [None]:
# The plot below indicates that HER2_Enriched tumors on average have an lower overall survival time.
fig = plt.figure()
TumorOS.plot(kind='bar',figsize=(12,8),grid=True,title='Comparing the average overall survival (OS) times among different tumor types')
plt.ylabel('Days',fontsize=15)
fig = plt.figure()

fig.savefig('BP1.png')

## Line Chart Of Average Age Of Initial Pathologic Diagnosis Of Breast Cancer

A line chart or line graph is a type of graphical representation that displays information as a series of data points, these data points are connected by straight line segments. 

Clinical research has established a positive correlation among age and breast cancer. Age is a risk factor for breast cancer, the older a woman is, there is a high chance that she will develop breast cancer. Rates of breast cancer are low in women under 40, less than 5 percent of breast cancer patients are younger than 40. The median age of diagnosis of breast cancer is 62. 

**Below is a line graph where we can display the average age of initial pathologic diagnosis of cancer based on tumor type.**

In [None]:
# Using the line graph we can display the average age of initial pathologic diagnosis of cancer based on tumor type
# We are grouping by tumor type and comparing average age of initial pathologic diagnosis of cancer. 
# Our visualization indicates that Basel like tumor types are diagnosis at a younger age and is rarer. 
fig = plt.figure()
diatime=Clinical.groupby(by='PAM50 mRNA').mean()[['Age at Initial Pathologic Diagnosis']]
diatime.plot(kind='line',figsize=(13,9),grid=True,title='Age at Initial Pathologic Diagnosis')

fig.savefig('LINE.png')

**Part 4: Visualization using seaborn on Breast Cancer Proteomes**

In [None]:
#Loading the BCGENES.csv for data Visualization

BC = pd.read_csv('../input/bcgenes/BCGENES.csv')
BC.head(5)

In [None]:
# Check the data for any null values
BC.isnull().values.any()

In [None]:
# To get the names of the genes
list(BC)

From Rstudio we identified and listed in descending order of the most influential gene in determining breast cancer tumor type. The following code queries and compares the 5 most influential genes from the Breast Cancer Proteomes csv among tumor types.

In [None]:
# We randomly select 4 of the top ten most influential protein from the Breast Cancer Proteomes csv 
#and assign them to a new dataframe. 

dv=BC[['PAM50 mRNA','MICAL-like protein 1',
    'hepatocyte nuclear factor 3-gamma',
   'keratin, type I cytoskeletal 23','TBC1 domain family member 9']] 


#Re-assigning proteins to gene name
dv.columns = ['PAM50 mRNA','MICAL-1','FOXA3','KRT23',
                     'TBC1D9']

## Pairplot Of The Four Most Influential Genes in Breast Cancer

We use a PairPlot function from Seaborn to identify where different tumors tend to cluster on a scatterplot in respect to certain gene measurements. Scatter plots are like line graphs; they show how much one variable is affected by another and how the data is dispersed when used to compare groups. A scatter plot can suggest various kinds of correlations between variables.

Below is a PairPlot containing scatterplots for the following genes : 'MICAL-like protein 1' (**MICAL-1**), 'hepatocyte nuclear factor 3-gamma' (**FOXA3**'), 'keratin, type I cytoskeletal 23' (**KRT23**), and 'TBC1 domain family member 9' (**TBC1D9**).
Important conclusions from Pairplot are that: Luminal_A and Luminal_B tumors have similar means in measurements of 'TBC1 domain family member 9', the mean of Luminal_A,Luminal_B,and HER2_Enriched tumors are similar and have overlapping measurements. Whereas measurements from Basel like tumors tend to stand out on their own with very little over against the other tumor types.


In [None]:
#PairPlot
pairplot = sns.pairplot(dv.iloc[:,0:6],
                        hue="PAM50 mRNA",
    
                        diag_kind="kde",
                        diag_kws=dict(shade=True),
                        plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}
                       
                       );


pairplot.savefig("pairplot.png")

## Heatmap Of The Four Most Influential Genes in Breast Cancer 

The heat map tumor of feature measurements indicates which measurements are highly correlated/related to one another. 

Important conclusions of the heatmap are that, 'hepatocyte nuclear factor 3-gamma' and 'TBC1 domain family member 9' are highly positively correlated with another. 
Further research has confirmed that 'hepatocyte nuclear factor 3-gamma' regulate the transcription of a diverse groups of genes into proteins and that TBC1 is a protein that has a role in regulating cell growth. **Perhaps mutations of the 'hepatocyte nuclear factor 3-gamma' genes leads to an over expression of the TBC1 protein causing a pro tumor environment and unregulated cell growth which is known as cancer.**

Source: https://en.wikipedia.org/wiki/Hepatocyte_nuclear_factors 

Other positively correlated predictors are 'keratin, type I cytoskeletal 23' and 'MICAL-like protein 1'.

In [None]:
#Heatmap

import seaborn as sns
corr = dv.iloc[:,0:5].corr()
heat1=sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

figure = heat1.get_figure()
figure.savefig('heat1.png', dpi=400)

## Seaborn FacetGrid of 'MICAL-like protein 1' gene  and 'hepatocyte nuclear factor 3-gamma' Gene Among Tumor Type

The facetgrid from seaborn is similar to a scatterplot. 

Scatter plots are like line graphs; they show how much one variable is affected by another and how the data is dispersed when used to compare groups. A scatter plot can suggest various kinds of correlations between variables. In Rstudio we identified that 'MICAL-like protein 1' and 'hepatocyte nuclear factor 3-gamma' are the most influential genes in determining tumor type.  We plot 'MICAL-like protein 1' and 'hepatocyte nuclear factor 3-gamma' among tumor types on a scatter plot. 

Data points from Luminal_A,Luminal_B,and HER2_Enriched tumors are similar and have overlapping measurements.

In [None]:
SP = sns.FacetGrid(dv, hue="PAM50 mRNA", size=8) \
   .map(plt.scatter, 'MICAL-1','FOXA3',)\
   .add_legend()

SP.savefig("SP.png")

## Boxplot of the most Influential predictor : MICAL-like protein 1

Boxplots is a standardized method for graphically plotting groups of numeric data based on the distribution of the data. Boxplots are extremely helpful in visualizing the minimum, first quartile, median, third quartile and maximum of different groups. The advantage of using boxplots is its ability to quickly display one or more sets of data graphically.

Source: https://en.wikipedia.org/wiki/Box_plot 
 
Below is a boxplot of the MICAL-like protein 1 gene from the four tumor types: Luminal A, Luminal B, HER2 Enriched, and Basal-like (triple negative).

MICAL-proteins participate in various cellular activates, they regulate axonal growth cone repulsion, membrane trafficking, and apoptosis in cells. Potentially mutations genes regulating the expression of MICAL-proteins has a pro-cancer effect on cells. 

Based on the boxplot of the MICAL-like protein 1 gene, Basal_like tumors have the highest mean of the tumor types; they are strikingly different in terms of expression of MICAL-like protein 1 gene. The mean from HER2_enriched tumors and Luminal_B closely resemble each other. 

Source: https://www.ncbi.nlm.nih.gov/pubmed/23834433

In [None]:
ax = sns.boxplot(x="PAM50 mRNA", y='MICAL-like protein 1', data=dv)
ax = sns.stripplot(x="PAM50 mRNA", y='MICAL-like protein 1', data=dv, jitter=True, edgecolor="gray")

figure = ax.get_figure()
figure.savefig('boxplot.png', dpi=400)

## Violin Plot of the most Influential predictor : MICAL-like protein 1

From the Violin plots of the MICAL-like protein 1 gene, the probability density of Basal_like tumors seem platykurtic, a statistical distribution with dispersed observations along the x axis that results in wider and flatter peak, whereas HER2_enriched, Luminal_B and Luminal_A seem leptokurtic or normal, data points that sit clustered resulting in a higher and narrower peak.

![leptokurtic](https://img.tfd.com/mk/K/X2604-K-11.png) 

Source: https://en.wikipedia.org/wiki/Violin_plot

The previous box plot indicated that the mean from HER2_enriched tumors and Luminal_B closely resemble each other; the Violin plot indicates that the position and relative amplitude of peaks are similar in Luminal_B and Luminal_A tumors. Other observations indicates that HER2_enriched  and Basal_like tumors have a larger ranges than Luminal_B and Luminal_A tumors. 

In [None]:
snsviolinplot = sns.violinplot(x="PAM50 mRNA", y='MICAL-like protein 1', data=dv, size=10)
figure = snsviolinplot.get_figure()