# Machine learning for medicine (MedML@Emory) Workshop


## License
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

## Training/Testing Dataset - The Basics of Inference
Authors: Vineet Tiruvadi, Avinash Murugan, Alex Milani

## Overview
Machine learning (ML) is all about patterns, but how can we be sure the patterns we find in our data are *really there* in the thing we're trying to study?
One way is to do another experiment and see if the pattern holds.
But what if we could do this without having to run another experiment?

In this notebook we introduce the idea of 'training and testing' sets.
By leaving out a part of the data you collect, you can learn *and* validate complex patterns in your data.
This simple process becomes a major strength of ML by directly addressing how "generalizable" your results are to broader populations and/or how well your results track with the truth.

Outline:

* Code imports
* Background
  * Standard Experiments
  * ML and testing/training
* Testing/training Example
* What's "Big" about Big Data

### Code Imports



#### Standard Imports

In [0]:
# The big library for doing math + data in python
import numpy as np

# A big library that has a lot of useful functions for scientific use of python
import scipy

# The main library for plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
matplotlib.rcParams['figure.figsize'] = [20, 15]

# The main library used for statistics
import scipy.stats as stats

# Libraries that let us use interactive widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import interactive,interact, HBox, Layout,VBox

# Misc stuff related to cleaning up code and displaying results in a pretty way
#from example_systems import *
from IPython.display import Markdown as md

import logging
logging.getLogger().setLevel(logging.CRITICAL)

#### ML Related Imports

In [0]:
# Our machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#### General purpose functions used in the notebook

In [0]:
#### General Functions
def gen_data(slope=2.34,noise=10,trial=1):
    X = np.random.uniform(-10,10,size=(200,1))
    Y = slope * X + np.random.normal(0,noise,size=X.shape)
  
    return X, Y

def gen_data_split(slope=2.34, noise=10, trial=1, train_ratio=0.1):
    X,Y = gen_data(slope=slope,noise=noise,trial=trial)
    Xtr,Xte,Ytr,Yte = train_test_split(X,Y,test_size=1-train_ratio,random_state=trial)
    
    return {'Xtr':Xtr, 'Xte':Xte, 'Ytr':Ytr, 'Yte':Yte, 'slope':slope, 'Xb':X}

def gen_lin_Y(dataset,noise,train_ratio,trial,show_truth,show_tr,show_te,show_model,show_pred):
    #Do our linear regression analysis
    if dataset == 0:
      dataset = gen_data_split(noise=noise,train_ratio=train_ratio,trial=trial)

    X_basis = dataset['Xb']

    Xtr = dataset['Xtr']
    Ytr = dataset['Ytr']

    Xte = dataset['Xte']
    Yte = dataset['Yte']

    m = dataset['slope']

    reg = LinearRegression(fit_intercept=False).fit(Xtr, Ytr)
    reg.score(Xtr, Ytr)
    Ypred = reg.predict(Xte)

    slope_estimate = round(reg.coef_[0,0],4)
    if show_tr: plt.scatter(Xtr,Ytr,s=30,alpha=0.7,label='Training',color='blue')
    if show_te: plt.scatter(Xte,Yte,s=70,alpha=0.9,color='red',label='Test')
    if show_pred: plt.scatter(Xte,Ypred,s=90, facecolors='none', edgecolors='r',alpha=0.7,label='Prediction')
    if show_truth: plt.plot(X_basis,m*X_basis,'--',alpha=0.9,label='Truth',color='green')
    if show_model: plt.plot(X_basis,slope_estimate*X_basis,alpha=0.3,linewidth=6,color='green',label='Model'); plt.text(0,-40,'Estimated slope: ' + str(slope_estimate),color='green')
    
    plt.legend()
    plt.ylim((-50,50))
    plt.xlim((-15,15))
    plt.title('True slope = ' + str(m))
    sns.despine()




## Background

### Inference
The core goal in everything we're doing in science is *inference*.
This means that our goal is to build a mental understanding (model) of how something we care about works.
If we care about cardiac physiology, we need to have an understanding of how blood volume, stroke volume, coronary arteries, all interact.

We build this understanding today largely by using *data*.
While data is just one tool in a larger toolset, data is the one that keeps us most grounded in reality.

The thing *generating* our data is typically the thing we're interested in understanding: inference is the process of understanding that think *through* the data we collect from it.


### Experiments and data
In a typical experiment you're trying to study something about a *population*.
You then design careful controls and measurements, get an idea of how many 'samples' you need to collect to be able to confidently say there's an effect.
Then, and only then, do you recruit your sample, collect your data, and analyse *all* of it.
The statistics you do on your analysis try to give you the level of confidence you can have that what you found in  your sample is what's happening in the population.
Here, you're *implicitly* making conclusions about how well your results reflect the population, or **generalize**.


p-value gives us an idea of the probability we could be *wrong* in saying that there's a relationship.
In other words, the chance that we saw the data we saw given *there is actually no connection*.

### The ML approach
ML is all about finding patterns, but it doesn't care if they're simple or complex.
ML uses modern computing power to 'find' patterns that we can't easily 'see', like patterns across many, many variables.

ML typically relies on a very different approach than the experimental approach.

![picture](https://vineet.tiruvadi.com/tetr.png)

In ML, you collect your data and try to understand the limitations in the data you collected.
You can then **split your dataset** into a 'training' set and a 'testing' set, with zero overlapping datapoints.
The 'training' set is where you do your 'learning' -> let the algorithm of your choice find the patterns (or Model M) that it thinks are in your data.
We then take that model and see how well it 'holds' in the rest of our data.
This then gives us an idea of how the model will hold in the *other* datasets from the full population that we weren't able to observe.

In essence, this is like a 'simulated' experiment.
By the simple act of *not using all your data*, you generate a second experimentally derived dataset that is *exactly the same as the one you will analyse*.
This simple act turns out to be a powerful approach and the foundations of more sophisticated 'statistical validation'.

As a contrast, the standard 'p-value' approach uses all the data, finds the pattern (and it always does) and then tries to estimate from the variance *inside* the data that it just used how well it might reflect the population.
This makes the experiment itself, and a large sample size, critical since we're only indirectly commenting on generalizability.


### What is our goal?

Inference -> we want to make a *model* that matches the *truth*.
Data is a very powerful way of doing this.


### How much data do we need?
How much data we need depends on noise.
This is similar to doing a power analysis in EBM.

### Contrast 'experiment heavy' vs 'analysis heavy'
What we would do in a standard approach...
Then we'd publish the pattern we found

Instead, in ML...

## Example
We're going to deal with a concrete example: The measured Hemoglobin A1c and the lifetime risk of cardiovascular disease.

### Orientation to figure
First, let's get oriented to how we're going to be displaying our data.


In [0]:
### BLOCK LABEL - Splitting our data into training-testing
X, Y = gen_data()
train_size=0.8
X_training,X_testing,Y_training,Y_testing = train_test_split(X,Y,test_size=1-train_size)

In [0]:
### BLOCK LABEL - How to read a regression

dataset = gen_data_split()
#First we'll fix the data and just work with building up the plot
#w = interactive(gen_lin_Y,noise=(0.0,10.0,0.1),train_ratio=(0.05,0.95,0.05),trial=(1,1000,1),show_truth=True,show_tr=False,show_te=False,show_model=False,show_pred=False)
w = interactive(gen_lin_Y,dataset=fixed(dataset),noise=fixed(5.0),train_ratio=fixed(0.50),trial=fixed(1),show_truth=True,show_tr=False,show_te=False,show_model=False,show_pred=False)
#display(w)
controls = HBox(w.children[:-1], layout = Layout(flex_flow='row wrap'))
output = w.children[-1]
w_vb = VBox([controls, output])
display(w)

interactive(children=(Checkbox(value=True, description='show_truth'), Checkbox(value=False, description='show_…

In [0]:
### BLOCK LABEL - The effect of train-test ratios
w = interactive(gen_lin_Y,dataset=fixed(0),noise=(0.0,50.0,0.1),train_ratio=(0.01,0.99,0.01),trial=(1,1000,1),show_truth=fixed(True),show_tr=fixed(True),show_te=False,show_model=fixed(True),show_pred=fixed(True))
display(w)

interactive(children=(FloatSlider(value=25.0, description='noise', max=50.0), FloatSlider(value=0.5, descripti…

The 'noise' slider controls how much our measurements are off from the truth.
The train-ratio is the percent of our data we use for training.
The trial 

### How does this help us in inference?
Here we're directly seeing how well our results generalize by trying it in the other half of our data that we didn't look at yet.
It's "like" a second experiment.

At the end of the day, the goal is to *infer* what's happening in our paients *through* the data we collect from them.

Train-test splits is just one tool for inference.
Many others exist

All the different ways of doing machine learning rely on test-train in some way.
The page for [scikit-learn](https://scikit-learn.org/stable) is a great place to start to learn about other algorithms and more advanced approaches to testing statistical significance based on 'train-test' splits.




## Parting Thoughts
In summary, the testing/training split is a powerful approach to testing inference.
It lets us learn a complex pattern in half the data and assess how accurate this pattern, or *model*, is in the held-out set of observations, acting almost like a 'reproducibility' experiment.
Obviously this comes at a cost: your data!
You need a lot of *samples* to be able to do meaningful inference on half of it.

In future notebooks we'll cover more advances topics in testing/training.
For a preview, here are some external readings [cross-validation]() and [model selection]().

Copyright 2020 MedML