# Healthcare ML Project: Exploratory Data Analysis & Data Subsetting

In this notebook, we will accomplish the following:

* Load clinical data as a DataFrame `clinical`, required to define class label `clinical[Label] : bool`
* Load molecular data as an mxn DataFrame, `genes`. We will select the appropriate rows (patients) and match the `genes` DataFrame with the `clinical` data.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Sagemaker Resources

The data downloaded has been uploaded to S3 default bucket

In [2]:
import boto3
import sagemaker
from sagemaker import get_execution_role

In [3]:
sagemaker_session = sagemaker.Session()
sagemaker_session

<sagemaker.session.Session at 0x7f983a68bf28>

In [4]:
role = sagemaker.get_execution_role()
role

'arn:aws:iam::906713186745:role/service-role/AmazonSageMaker-ExecutionRole-20210419T071573'

In [5]:
bucket = sagemaker_session.default_bucket()
bucket

'sagemaker-us-east-1-906713186745'

In [6]:
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
     print(obj.key)

assets/
assets/data/brca_tcga_clinical_data.tsv
assets/data/data_RNA_Seq_v2_mRNA_median_all_sample_Zscores.txt
assets/data/meta_RNA_Seq_v2_mRNA_median_all_sample_Zscores.txt
sagemaker-scikit-learn-2021-04-22-02-22-29-590/debug-output/training_job_end.ts
sagemaker-scikit-learn-2021-04-22-02-22-29-590/output/model.tar.gz
sagemaker-scikit-learn-2021-04-22-02-22-29-590/profiler-output/framework/training_job_end.ts
sagemaker-scikit-learn-2021-04-22-02-22-29-590/profiler-output/system/incremental/2021042202/1619058240.algo-1.json
sagemaker-scikit-learn-2021-04-22-02-22-29-590/profiler-output/system/incremental/2021042202/1619058300.algo-1.json
sagemaker-scikit-learn-2021-04-22-02-22-29-590/profiler-output/system/training_job_end.ts
sagemaker-scikit-learn-2021-04-22-02-22-29-590/rule-output/ProfilerReport-1619058149/profiler-output/profiler-report.html
sagemaker-scikit-learn-2021-04-22-02-22-29-590/rule-output/ProfilerReport-1619058149/profiler-output/profiler-report.ipynb
sagemaker-scikit-le

### Load Data from S3

In [7]:
def get_s3_uri(prefix, bucket=bucket):
    return "s3://{}/{}".format(bucket, prefix)

In [8]:
path_input = get_s3_uri("assets/data/data_RNA_Seq_v2_mRNA_median_all_sample_Zscores.txt")
path_annot = get_s3_uri("assets/data/brca_tcga_clinical_data.tsv")

In [9]:
genes = pd.read_csv(path_input, delimiter="\t")
genes.shape

(20531, 1102)

In [10]:
clinical = pd.read_csv(path_annot, delimiter="\t")
# sorted(clinical.columns.to_list())
clinical.shape

(1108, 140)

## Wrangle with the Clinical Information

In [25]:
COLUMNS_CLIN = [
    'Patient ID', 'Sample ID', #ids
    'Sample Type','Fraction Genome Altered', #tumor info
    'Diagnosis Age','Sex','Race Category','Ethnicity Category', #patient info
    'Informed consent verified', #ethics
    'ER Status By IHC','PR status by ihc','IHC-HER2' #required for CLASS LABEL
]

clinical = clinical[COLUMNS_CLIN]
clinical.head()

Unnamed: 0,Patient ID,Sample ID,Sample Type,Fraction Genome Altered,Diagnosis Age,Sex,Race Category,Ethnicity Category,Informed consent verified,ER Status By IHC,PR status by ihc,IHC-HER2
0,TCGA-3C-AAAU,TCGA-3C-AAAU-01,Primary,0.7787,55.0,Female,WHITE,NOT HISPANIC OR LATINO,YES,Positive,Positive,Negative
1,TCGA-3C-AALI,TCGA-3C-AALI-01,Primary,0.7164,50.0,Female,BLACK OR AFRICAN AMERICAN,NOT HISPANIC OR LATINO,YES,Positive,Positive,Positive
2,TCGA-3C-AALJ,TCGA-3C-AALJ-01,Primary,0.534,62.0,Female,BLACK OR AFRICAN AMERICAN,NOT HISPANIC OR LATINO,YES,Positive,Positive,Indeterminate
3,TCGA-3C-AALK,TCGA-3C-AALK-01,Primary,0.0764,52.0,Female,BLACK OR AFRICAN AMERICAN,NOT HISPANIC OR LATINO,YES,Positive,Positive,Positive
4,TCGA-4H-AAAK,TCGA-4H-AAAK-01,Primary,0.2364,50.0,Female,WHITE,NOT HISPANIC OR LATINO,YES,Positive,Positive,Equivocal


In [13]:
# Create “Table 1” summary statistics for a patient population
!pip install tableone



In [26]:
## Focus on primary tumors from female patients only
clinical = clinical[(clinical["Sex"]=="Female") & (clinical["Sample Type"]=="Primary")]

In [27]:
## Define triple negative
clinical['Label'] = (clinical['ER Status By IHC']=='Negative') & \
                    (clinical['PR status by ihc']=='Negative') & \
                    (clinical['IHC-HER2']=='Negative')

### Make a summary table stratified by _Class Label=1 (True)_ vs. _Class Label=0 (False)_

In [32]:
from tableone import TableOne

myTable1 = TableOne(
    clinical,
    columns=COLUMNS_CLIN[3:9], 
    categorical=COLUMNS_CLIN[5:9],
    groupby='Label'
)

In [33]:
# Column 'True' means Class=1, i.e. the triple negative tumors
print(myTable1.tabulate(tablefmt="github"))

|                                    |                                  | Missing   | Overall      | False       | True        |
|------------------------------------|----------------------------------|-----------|--------------|-------------|-------------|
| n                                  |                                  |           | 1085         | 969         | 116         |
| Fraction Genome Altered, mean (SD) |                                  | 18        | 0.3 (0.2)    | 0.3 (0.2)   | 0.4 (0.2)   |
| Diagnosis Age, mean (SD)           |                                  | 1         | 58.4 (13.2)  | 58.9 (13.3) | 54.5 (12.0) |
| Sex, n (%)                         | Female                           | 0         | 1085 (100.0) | 969 (100.0) | 116 (100.0) |
| Race Category, n (%)               | AMERICAN INDIAN OR ALASKA NATIVE | 94        | 1 (0.1)      | 1 (0.1)     |             |
|                                    | ASIAN                            |           | 61 (6.2)   

### Important: 

Even though the data is public, I made sure that 100\% data points have informed consent, a basic ethical principle for Healthcare research

## Subset the mRNA data

* Ensure all rows in the dataframe `genes` have a matching record in `clinical` (with class labels)
* 