<img src="images/seemapldbig.png">

# Jupyter Notebooks

### &emsp;- an IDE (integrated development environment) for Python
### &emsp;- completely open-Source
### &emsp;- used extensively In academia and industry
### &emsp;- popular with data scientists and ML engineers
### &emsp;- oriented around "cells" ( documentation, code, or output )
### &emsp;- For example, this sentence resides in a editable documentation cell.

In [None]:
# This sentence is in a comment in an editable code cell

encoded = "\x54\x68\x69\x73\x20\x49\x73\x20\x41\x6E\x20\x4F\x75\x74\x70\x75\x74\x20\x43\x65\x6C\x6C"

print(encoded)


# What's Our Goal With Survival Analysis Prediction

<img src="images/predictor.png">

# Examples:

## &emsp;Health Outcomes

### &emsp;&emsp;-&emsp;predict patient survival time by symptoms, age, genetic expression activity, ...

## &emsp;Customer Activity

### &emsp;&emsp;-&emsp;predict when customer might cancel a subscription based on their product rating, ...

## &emsp;Machine Failure

### &emsp;&emsp;-&emsp;predict part failure from its age, temperature, lot,  ...



# Time-To-Event Is What We Want, But Realistically...

<img src="images/predictor_prob.png">

# Let's Now Train A Machine Learning Model Using A Typical Pipeline...


<img src="images/pipeline.png">

# Work-Through Example: Predicting Cancer Deaths

### -&emsp;dataset is cancer death ddata from 2012 called METABRIC

### -&emsp;contains about 2000 patient records with ***Gene Expression*** data

### -&emsp;goal is to train an ML model that can predict survival time of patients in clinical trial
 


# Load The Dataset...
<img src="images/load-data.png">

In [None]:
#
# Prepare this notebook by "import"ing the python packages we will need
#

# I've pre-package a bunch of high-level codee functions for the survival analysis tutorial/workshop.
# I recommend you explore the code in this python package at a later time!
import survival_analysis

import importlib
importlib.reload(survival_analysis)

In [None]:
#
# Load the dataset
#

survival_data = survival_analysis.get_cancer_death_data()

print("dataset shape=", survival_data.shape)


# Explore The Dataset...
<img src="images/explore-data.png">

In [None]:
#
# Explore the data
#

survival_data.head(20)


# A Note About The Dataset Columns...

| Factor | Type | Description |
| --- | --- | --- |
| MKI67  | numeric | gene expression for MKI67 |
| EGFR  | numeric | gene expression for EGFR  |
| PGR  | numeric | gene expression for PGR  |
| ERBB2  | numeric | gene expression for ERBB2  |
| hormone trtmnt  | binary | patient underwent hormone treatment |
| radiothrpy  | binary | patient underwent radiotherapy |
| chemothrpy  | binary | patient underwent chemotherapy  |
| ER-pos  | binary | patient underwent ER-positive therapy |
| age  | binary | patient age |
| duration  | numeric | days survived |
| event  | binary | patient died during study |

### Obviously, for customer churn or machine failure prediction the columns will be different...

In [None]:
#
# Explore the "numeric columns
#

survival_data[['MKI67', 'EGFR', 'PGR', 'ERBB2', 'age', 'days']].describe()

# Transform The Dataset...
<img src="images/transform-data.png">

In [None]:
# Transform the dataset for our machine learning model...

train_dataset, validation_dataset, test_dataset = survival_analysis.transform_data()

print("training dataset shape=", train_dataset.shape)

print("validation dataset shape=", validation_dataset.shape)

print("test dataset shape=", test_dataset.shape)

# So Why Do We Split The Original Dataset?

### -&emsp;Training Dataset: We usually keep most of it for training the model

### -&emsp;Validation Dataset: dWe take a small portion to measure how well the training is going ( and when to stop! )

### -&emsp;Test Dataseet: We keep a "hold-out" dataset to evaluate the final model



# Choose A Machine Learning Model...
<img src="images/choose-model.png">

In [None]:
# We will try a small neural network here called a multi-layer perceptron.
# Let's print out the "architecture" and discuss what the layers mean...

ml_model = survival_analysis.choose_multilayer_perceptron()

print("model architecture=", ml_model)

# Prepare To Train The Model...

<img src="images/prepare-model.png">


In [None]:
# There are some "hyper-parameters" we will use to prepare the training.
# Let's discuss what these mean...

epochs, stop_criteria = survival_analysis.prepare_training()

print("training epochs=", epochs)
print("stop_criteria=", stop_criteria)

# Train The Model...

<img src="images/train-model.png">

In [None]:
# Let's go ahead and train the model and discuss the output

trained_model = survival_analysis.train_model()

# Evaluate The Model...

<img src="images/eval-model.png">

In [None]:
# Remember, we removed some of the original data for testing our final model.
# It was not used in training the model and contains "ground truth" data - that is,
# we know how long the patient survived.  We can use to determine how well our 
# model predicts survival.

# Let's look at some of that 'test data'...

test_dataset.iloc[0:10,:] 

In [None]:
# Let's focus on the first two patients (patient 379 and 1711)

two_patients = test_dataset[ \
        (test_dataset['ID']=='patient_364') | (test_dataset['ID']=='patient_577') ]

two_patients

In [None]:
# Let's look at the predicted survival probability curve for these two patients and discuss

survival_analysis.predict( two_patients )

In [None]:
# We can perform this qualitative metric, called the concordance index, on the entire 
# "held-out" test set to get the overall quality of the predictive model.

model_quality = survival_analysis.predict_model_concordance_index()

print("The model quality is", model_quality, " out of 100.  You could say it can predict with that accuracy.")

# TODO: Let's review

# TODO: Other Models

# TODO: Discuss The Kaggle Challenge