## Lighthouse Labs - Synaptive Medical
### Introduction to Machine Learning
### W3D1 part 1 Machine Learning & Linear Regression

Instructor: Socorro Dominguez  
December 04, 2020


## Agenda
1. Machine Learning
    - Supervised vs. Unsupervised Learning
    
    

2. Supervised Learning 101
    - `X` and `y`
    - Regression vs. Classification    
    - The golden rule: train/test split


3. Simple Linear Regression

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Machine Learning
It is seen as a subset of AI. Machine learning algorithms build a model based on sample data (training data), in order to make predictions without being explicitly programmed to do so.

## Machine Learning: Supervised Learning
- In supervised learning, we have a set of observations (__*X*__) with an associated response (__*y*__)
- We wish to find a model function that relates __*X*__ to __*y*__
- Then use that model function to predict future observations

<img src='imgs/sup_learning.png' width="700">

## Machine Learning: Unsupervised Learning
- We have __*X*__ (the data) but no __*y*__ (associated response)

<img src='imgs/unsup_learning.png' width="630">

## Supervised Learning 101

Lots of terminology!

For tabular data:
- **examples** = rows = samples = records = instances (usually denoted by $n$)
- **features** = inputs = predictors = explanatory variables = regressors = independent variables = covariates (usually denoted by $d$) = **X**
- **targets** = outputs = outcomes = response variable = dependent variable = labels (if categorical) = **y**
- **training** = learning = fitting

## Classification vs. Regression

* Classification problems: predicting among two or more categories, also known as classes
    - Example1: Predict whether a patient has a liver disease or not
    - Example2: Predict whether the letter grade of a student (A,B,C,D or F)
* Regression problem: predicting a continuous (in other words, a number) value
    - Example1: Predict housing prices
    - Example2: Predict a student’s score in this course’s quiz2

## Let's load some toy data

In [2]:
Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
                'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
                'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
                'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
                'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        
                }

stock_df = pd.DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])

In [3]:
stock_df.head()

Unnamed: 0,Year,Month,Interest_Rate,Unemployment_Rate,Stock_Index_Price
0,2017,12,2.75,5.3,1464
1,2017,11,2.5,5.3,1394
2,2017,10,2.5,5.3,1357
3,2017,9,2.5,5.3,1293
4,2017,8,2.5,5.4,1256


In [4]:
stock_df.shape

(24, 5)

## Splitting out our X and y
- In this case, are we working with a regression problem. Could you say why?
- Can you help me identify what the features are? 
- On the same line, what would the output or target variable be?

In [5]:
X = stock_df[['Interest_Rate','Unemployment_Rate']]
y = stock_df['Stock_Index_Price']

In [6]:
X.head()

Unnamed: 0,Interest_Rate,Unemployment_Rate
0,2.75,5.3
1,2.5,5.3
2,2.5,5.3
3,2.5,5.3
4,2.5,5.4


In [7]:
y.head()

0    1464
1    1394
2    1357
3    1293
4    1256
Name: Stock_Index_Price, dtype: int64

## The golden rule
- When you're doing machine learning, now that you've identified **X** and **y**
- BEFORE YOU DO ANYTHING ELSE...
- **Including** exploratory data analysis, visualization etc.
- **You need to split your data into train and test**
- **You only work with the training data**

### Why?
- As soon as you start making decisions on what features to include, drop etc., you are letting a part of the test data influence your decision-making
- Your results will not be truly representative of "unseen data"

## So... how do we split?
- Most common way is to `train_test_split` in `sklearn`
- Shuffles the data first and then split it
- 80/20, 75/25, 70/30 are common splits

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [9]:
X_train.shape

(19, 2)

In [10]:
X_test.shape

(5, 2)

## The big picture
- We train using the **training data**
- We test what is learned by the model on the **test data**
- We have two scores: **training** vs. **test**

### Which matters more?
- It doesn't matter how good our **training score** is because the **test score is what matters**
- Good models that generalize well though will have **similar training and testing scores**

**We want to pick models that generalize well to unseen data**

## The fundamental tradeoff 
(aka the bias-variance tradeoff)

| Model | Training Score relative to Test Score | Performance |
|:-|:-|:-|
|Complex|High training score compared to test score| Overfit |
|Simple|Low training score and low test score|Underfit|

- Models that have **extremely high training scores** (that are too good to be true) that are **highly complex** that learned very complex relationships in the training data **can be overfit**
- On the other hand, models that have **low training scores** that are **very simple** may not have learned the necessary relationships in the training data needed to predict well on unseen data; they are **underfit**

## Linear Regression 101
- Used as a predictive model
- Assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate, **y**) and the independent variable/s (input variable/s used in the prediction, **X**)

### Let's start with **simple** linear regression
- Only one independent/input variable is used to predict the dependent variable.

## Simple Linear Regression

$$Y = C + M*X$$

$Y$ = Dependent variable (output/outcome/prediction/estimation)

$C$ = Constant (y-intercept)

$M$ = Slope of the regression line (the effect that X has on Y)

$X$ = Independent variable (input variable used in the prediction of Y)

## Multiple Linear Regression
- Many $X$ and $M$

$$Y = C + M_1X_1 + M_2X_2 + M_3X_3 ...$$

- The higher the M is, the more influence the relevant X has on the variable Y

## Matrix representation

- $\hat{y}$ is the linear function of features $x$ and weights $w$. 

$$\hat{y} = w^Tx + b$$
        
- $\hat{y} \rightarrow$ prediction
- $w \rightarrow$ weight vector
- $b \rightarrow$ bias
- $x \rightarrow$ features

$$\hat{y} = \begin{bmatrix}w_1 & w_2 & \cdots & w_d\end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_d\end{bmatrix} + b$$

## Let's try it!
Remembering our dataset...

In [11]:
import statsmodels.api as sm

In [12]:
X_train.head()

Unnamed: 0,Interest_Rate,Unemployment_Rate
15,1.75,6.1
16,1.75,6.2
7,2.25,5.5
3,2.5,5.3
17,1.75,6.1


In [13]:
y_train.head()

15     958
16     971
7     1195
3     1293
17     949
Name: Stock_Index_Price, dtype: int64

## Fitting a linear model
- Let's look at the results

In [14]:
X_train = sm.add_constant(X_train) # adding a constant

model = sm.OLS(y_train, X_train).fit()
model.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,Stock_Index_Price,R-squared:,0.908
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,78.95
Date:,"Fri, 04 Dec 2020",Prob (F-statistic):,5.14e-09
Time:,05:57:21,Log-Likelihood:,-104.99
No. Observations:,19,AIC:,216.0
Df Residuals:,16,BIC:,218.8
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1265.4821,982.801,1.288,0.216,-817.963,3348.927
Interest_Rate,389.7573,117.826,3.308,0.004,139.978,639.537
Unemployment_Rate,-173.4419,129.985,-1.334,0.201,-448.999,102.115

0,1,2,3
Omnibus:,5.102,Durbin-Watson:,1.668
Prob(Omnibus):,0.078,Jarque-Bera (JB):,2.731
Skew:,-0.772,Prob(JB):,0.255
Kurtosis:,4.031,Cond. No.,409.0


## Results interpretation

- Weights (coef of variables)
- Bias (coef of constant)

- **R-squared** measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model
- **Adjusted R-squared** adjusts the statistic based on the number of independent variables in the model