# Machine learning for medicine
## Understanding Correlation

## Overview
In medicine, we care about how differents physiologic processes *relate* to each other.
Correlation is one way to measure how related two things are.
In this notebook we get hands on with correlation.

## Correlations
Correlations are the backbone of science.
Correlations are one way to assess whether two variables are *related* to each other.

Correlation checks to see whether there's a *linear* relationship between two variables $X$ and $Y$.
In order words: if we *double* $X$ do we double $Y$?

## Code Setup

In [1]:
import numpy as np
import scipy
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import scipy.stats as stats

## Introduction
In every question we're studying *relationships* between things we're interested in.
Correlation is one way that we can see if changing one variable changes another variable.

In [8]:
%%capture
def simple_eg(slope=1.0,noise=0.0):
    x = np.random.uniform(-10,10,size=(100,))
    y = slope * x + np.random.normal(0,noise,size=x.shape)
    
    plt.figure()
    plt.scatter(x,y)
    plt.scatter(x,np.random.normal(0,noise,size=x.shape),color='red',alpha=0.4)
    plt.ylim(-10,10)
    plt.xlim(-10,10)
    plt.axis('off')
    plt.legend(['Correlated','Uncorrelated'])
    corr_val = stats.pearsonr(x,y)
    plt.text(2,-5,s='Pearson: ' + str(corr_val[0]) + '\n p=' + str(corr_val[1]))
    plt.show()

In [9]:
interact(simple_eg,slope=(-5,5,0.1),noise=(0.0,10.0,0.5));

interactive(children=(FloatSlider(value=1.0, description='slope', max=5.0, min=-5.0), FloatSlider(value=0.0, d…

Here, we can see an example where the value of y correlates with the value of x, depending on where you set the slider (blue).
We can also see an example of when y *doesn't* correlate with the value of x (red).
Without any noise, it's pretty straightforward.
But the real world is noisy and this noise can make it tough to tell if there is or isn't a relationship.
Try this out yourself by setting the noise slider at 5.0

So much of EBM is designing experiments so that we can cleanly say that y and x are related.

## Correlation Coefficient

## Scaling one variable

## Correlation is not causation

## What is 'noise'?

In [4]:

def f(var):
    t = np.linspace(0,10,100)

    x = np.random.normal(np.sin(2 * np.pi * t),var,size=(100,1))
    fig1 = plt.figure()
    plt.plot(t,x)
    plt.xlim((-1,11))
    plt.ylim((-10,10))
    plt.show()

interact(f,var=(0,10,0.1))

interactive(children=(FloatSlider(value=5.0, description='var', max=10.0), Output()), _dom_classes=('widget-in…

<function __main__.f(var)>

## Linear Correlations
Linear functions are nice and easy.
We like linear functions so much that we often squint our eyes to see a line even when there isn't.
The whole point of linear correlation is to be able to say that a variable we're trying to explain is related to a variable we're measuring by a simple multiplication.

In [5]:
def g_lin(var,mag,fp,gain=1):
    #x = np.linspace(-4,4,100)
    x = np.random.uniform(-3,3,size=(100,))
    y = gain*np.random.normal(x,var)
    
    plt.figure()
    plt.scatter(x,y)
    plt.xlim((-5,5))
    plt.ylim((-50,50))
    
    pears = stats.pearsonr(x,y)
    spears = stats.spearmanr(x,y)
    plt.title('Correlation: ' + str(pears[0]) + ' vs ' + str(spears[0]))
    
interact(g_lin,var=(0,10.),mag = (1,10.,0.5),fp=(0,4,0.5),gain=(0.1,10,0.1))

interactive(children=(FloatSlider(value=5.0, description='var', max=10.0), FloatSlider(value=5.5, description=…

<function __main__.g_lin(var, mag, fp, gain=1)>

## Nonlinear functions and correlation

In [6]:
def g(var,mag,fp,gain=1):
    #x = np.linspace(-4,4,100)
    x = np.random.uniform(-3,3,size=(100,))
    y = gain*mag*(x-fp) * (x) * (x+fp) * x
    
    x = x - np.mean(x)
    y = y - np.mean(y) + np.random.normal(0,var,size=x.shape)
    
    plt.figure()
    plt.scatter(x,y)
    plt.xlim((-5,5))
    plt.ylim((-50,50))
    
    pears = stats.pearsonr(x,y)
    spears = stats.spearmanr(x,y)
    plt.title('Correlation: ' + str(pears[0]) + ' vs ' + str(spears[0]))
    
interact(g,var=(0,100.),mag = (1,10.,0.5),fp=(0,4,0.5),gain=(0.1,10,0.1))

interactive(children=(FloatSlider(value=50.0, description='var'), FloatSlider(value=5.5, description='mag', ma…

<function __main__.g(var, mag, fp, gain=1)>

### What is this telling us?
The Pearson Correlation tells us we're at only a 60\% correlation.
This is a *linear* correlation.
But this is a bit absurd.
We know that Y is a very, very clean calculation on X.
Meaning, if we know X, we **know** Y.

The reason the correlations are low is because we're using *linear* correlations.
There is, by definition (since we *defined it*), a nonlinear relationship between Y and X.


We'll do the same, but now with scatter plot observations

In [7]:
def relat(x):
    return (x-2) * (x) * (x+2)

def gr(nsamp,var,mag):
    x = np.random.uniform(-4.,4.,size=(nsamp,))
    y = mag*np.random.normal(relat(x),var)
    
    xc = np.linspace(-4,4,100)
    yc = mag*relat(xc)
    yl = mag*xc
    
    fig1 = plt.figure()
    
    plt.scatter(x,y)
    plt.xlim((-5,5))
    plt.ylim((-50,50))
    plt.plot(xc,yc,color='red')
    plt.plot(xc,yl,color='blue')
    
    pears = stats.pearsonr(x,y)
    plt.title('Correlation: ' + str(pears))
    
    plt.show()
    
interact(gr,nsamp=(10,100,5),var=(0,100.),mag = (0,5.))

interactive(children=(IntSlider(value=55, description='nsamp', min=10, step=5), FloatSlider(value=50.0, descri…

<function __main__.gr(nsamp, var, mag)>