# Machine learning for medicine
## Training/Testing Dataset - The Basics of Inference

## Overview
ML is all about finding patterns in our data.
When we have a carefully controlled and measured experiment, there are some pretty simple patterns that our data could take.
But, in the clinical world, there are a lot of uncontrolled variables and messy measurements - patterns can show up that don't actually have anything to do with what you're studying.

One way ML takes this into account is with the idea of "training" sets and "testing" sets.
Basically, you take the whole dataset you collected, split it into two sets (not necessary equal), then learn the pattern in the "training" set and see how well that pattern shows up in your "testing" set.

In [50]:
# The big library for doing math + data in python
import numpy as np

# A big library that has a lot of useful functions for scientific use of python
import scipy

# The main library for plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
matplotlib.rcParams['figure.figsize'] = [15, 10]

# The main library used for statistics
import scipy.stats as stats

# Libraries that let us use interactive widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Misc stuff related to cleaning up code and displaying results in a pretty way
from example_systems import *
from IPython.display import Markdown as md

In [51]:
# Our machine learning libraries
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

## The standard approach
In a typical experiment you're trying to study something about a *population*.
You then design careful controls and measurements, get an idea of how many 'samples' you need to collect to be able to confidently say there's an effect.
Then, and only then, do you recruit your sample, collect your data, and analyse *all* of it.
The statistics you do on your analysis try to give you the level of confidence you can have that what you found in  your sample is what's happening in the population.
Here, you're *implicitly* making conclusions about how well your results reflect the population, or **generalize**.

## Training vs Testing Set
ML typically relies on a very different approach.
In ML, you collect your data and try to understand the limitations in the data you collected.
You then **split your dataset** into a 'training' set and a 'testing' set.
These sets do not overlap.
The 'training' set is where you do your 'learning' -> let the algorithm of your choice find the patterns that it thinks are in your data.
The 'testing' set is where you take the pattern you just learned and see how well it holds.

It's basically doing a 'simulated' experiment where your data is the 'population' and you subsample the population (training set) so you can *explicitly* test how well your results hold in the population, you're explicitly testing the **generalization**.

PICTURES

### Interactive Example of training testing

In [64]:
X = np.random.uniform(-10,10,size=(100,1))
def gen_lin_Y(noise,train_ratio,trial):
    m = 2.3
    Y = m * X + np.random.normal(0,noise,size=X.shape)
    
    Xtr,Xte,Ytr,Yte = train_test_split(X,Y,test_size=1-train_ratio,random_state=trial)
    
    #Do our linear regression analysis
    
    reg = LinearRegression(fit_intercept=True).fit(Xtr, Ytr)
    reg.score(Xtr, Ytr)
    Ypred = reg.predict(Xte)

    slope_estimate = round(reg.coef_[0,0],4)
    
    plt.scatter(Xtr,Ytr,s=60,alpha=0.5)
    plt.scatter(Xte,Yte,s=30,alpha=0.2,color='red')
    plt.scatter(Xte,Ypred,s=80, facecolors='none', edgecolors='r')
    plt.plot(X,m*X,'--',alpha=0.9)
    plt.plot(X,slope_estimate*X,alpha=0.3,linewidth=4,color='red')
    
    plt.ylim((-50,50))
    plt.xlim((-15,15))
    plt.title('True slope = ' + str(m))
    plt.text(0,-40,'Estimated slope: ' + str(slope_estimate),color='red')
    sns.despine()
    
w = interactive(gen_lin_Y,noise=(0.0,10.0,0.1),train_ratio=(0.05,0.95,0.1),trial=(1,1000,1))
display(w)

interactive(children=(FloatSlider(value=5.0, description='noise', max=10.0), FloatSlider(value=0.45, descripti…

When you set the training size to 0.2, it runs the train/test split once.
To get a better idea of 