# Machine learning for medicine
## Training/Testing Dataset - The Basics of Inference

## Overview
In this notebook we introduce the idea of 'training and testing' sets.
By leaving out a part of the data you collect, you can learn *and* validate complex patterns in your data.
This simple process becomes a major strength of ML by directly addressing how "generalizable" your results are to broader populations and/or how well your results track with the truth.

## Background
ML is all about finding patterns in our data.
When we have a carefully controlled and measured experiment there are some pretty simple patterns that our data could take.
But, in the clinical world, there are a lot of uncontrolled variables and messy measurements - patterns can show up that don't actually have anything to do with what you're studying.
Experiments help isolate variables so we can see if they relate to each other, but experiments are tough in clinical settings.

One way ML takes this into account is with the idea of "training" sets and "testing" sets.
Basically, you take the whole dataset you collected, split it into two sets (not necessary equal), then learn the pattern in the "training" set and see how well that pattern shows up in your "testing" set.

### Code Imports

In [41]:
# The big library for doing math + data in python
import numpy as np

# A big library that has a lot of useful functions for scientific use of python
import scipy

# The main library for plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
matplotlib.rcParams['figure.figsize'] = [15, 10]

# The main library used for statistics
import scipy.stats as stats

# Libraries that let us use interactive widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import interactive,interact, HBox, Layout,VBox

# Misc stuff related to cleaning up code and displaying results in a pretty way
from example_systems import *
from IPython.display import Markdown as md

import logging
logging.getLogger().setLevel(logging.CRITICAL)

In [42]:
# Our machine learning libraries
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

## Experiments and data
In a typical experiment you're trying to study something about a *population*.
You then design careful controls and measurements, get an idea of how many 'samples' you need to collect to be able to confidently say there's an effect.
Then, and only then, do you recruit your sample, collect your data, and analyse *all* of it.
The statistics you do on your analysis try to give you the level of confidence you can have that what you found in  your sample is what's happening in the population.
Here, you're *implicitly* making conclusions about how well your results reflect the population, or **generalize**.



## Training vs Testing Set
ML typically relies on a very different approach.
In ML, you collect your data and try to understand the limitations in the data you collected.
You can then **split your dataset** into a 'training' set and a 'testing' set, with zero overlapping datapoints.
The 'training' set is where you do your 'learning' -> let the algorithm of your choice find the patterns that it thinks are in your data.
The 'testing' set is where you take the pattern you just learned and see how well it holds.

It's basically doing a 'simulated' experiment where your data is the 'population' and you subsample the population (training set) so you can *explicitly* test how well your results hold in the population, you're explicitly testing the **generalization**.
This simple concept addresses a common criticism of 'pattern matching': if you analyse your data as is you'll almost always find patterns

## COMMENT

subplot(211) average A1c of a population - to see if certain populatons have diabetes more often
subplot(212) training/test block diagram!!

In [43]:
slope_widg = widgets.FloatSlider(description='Relationship:')
display(slope_widg)

FloatSlider(value=0.0, description='Relationship:')

### Interactive Example of training testing

In [73]:
X = np.random.uniform(-10,10,size=(100,1))
def gen_data(noise=10,train_ratio=0.1,trial=1):
    m=2.4
    Y = m * X + np.random.normal(0,noise,size=X.shape)
    
    Xtr,Xte,Ytr,Yte = train_test_split(X,Y,test_size=1-train_ratio,random_state=trial)
    
    return Xtr, Xte, Ytr, Yte, m

def gen_lin_Y(noise,train_ratio,trial,show_truth,show_tr,show_te,show_pred,show_model):
    #Do our linear regression analysis 
    reg = LinearRegression(fit_intercept=False).fit(Xtr, Ytr)
    reg.score(Xtr, Ytr)
    Ypred = reg.predict(Xte)

    slope_estimate = round(reg.coef_[0,0],4)
    
    if show_tr: plt.scatter(Xtr,Ytr,s=30,alpha=0.7,label='Training')
    if show_te: plt.scatter(Xte,Yte,s=70,alpha=0.9,color='red',label='Test')
    if show_pred: plt.scatter(Xte,Ypred,s=90, facecolors='none', edgecolors='r',alpha=0.7,label='Prediction')
    if show_truth: plt.plot(X,m*X,'--',alpha=0.9,label='Truth',color='green')
    if show_model: plt.plot(X,slope_estimate*X,alpha=0.3,linewidth=6,color='green',label='Model'); plt.text(0,-40,'Estimated slope: ' + str(slope_estimate),color='green')
    
    plt.legend()
    plt.ylim((-50,50))
    plt.xlim((-15,15))
    plt.title('True slope = ' + str(m))
    sns.despine()
Xtr,Xte,Ytr,Yte,m = gen_data()
#First we'll fix the data and just work with building up the plot
#w = interactive(gen_lin_Y,noise=(0.0,10.0,0.1),train_ratio=(0.05,0.95,0.05),trial=(1,1000,1),show_truth=True,show_tr=False,show_te=False,show_model=False,show_pred=False)
w = interactive(gen_lin_Y,noise=fixed(5.0),train_ratio=fixed(0.20),trial=fixed(1),show_truth=True,show_tr=False,show_te=False,show_model=False,show_pred=False)
display(w)
#controls = HBox(w.children[:-1], layout = Layout(flex_flow='row wrap'))
#output = w.children[-1]
#VBox([controls, output])

interactive(children=(Checkbox(value=True, description='show_truth'), Checkbox(value=False, description='show_…

In [74]:
w = interactive(gen_data,noise=(0.0,10.0,0.1),train_ratio=(0.05,0.95,0.05),trial=(1,1000,1))
w

interactive(children=(FloatSlider(value=10.0, description='noise', max=10.0), FloatSlider(value=0.1, descripti…

When you set the training size to 0.2, it runs the train/test split once.
To get a better idea of 

## Let's do an example!

Let's revisit our example from the previous notebook: Diabetes