# Regression

This notebook is an **optional** resource that goes with the tutorial video. This may help you learn it faster by following along, and filling in the code. 

First, update the path to the nhanes.csv file.

In [27]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import KFold, GridSearchCV, cross_validate, train_test_split
import matplotlib.pyplot as plt

nhanes = pd.read_csv('nhanes.csv')
nhanes = nhanes.drop_duplicates('ID') # remove multiple entries from the same person
cols = ['Weight', 'Height', 'Gender', 'Age', 'BPSysAve', 'BPDiaAve',
        'TotChol', 'Diabetes', 'PhysActive', 'SmokeNow']
df = nhanes[cols].copy()
df = df.dropna()
df.head()

Unnamed: 0,Weight,Height,Gender,Age,BPSysAve,BPDiaAve,TotChol,Diabetes,PhysActive,SmokeNow
0,87.4,164.7,male,34,113.0,85.0,3.49,No,No,No
4,86.7,168.4,female,49,112.0,75.0,6.7,No,No,Yes
10,68.0,169.5,male,66,111.0,63.0,4.99,No,Yes,No
14,57.5,148.1,female,58,127.0,83.0,4.78,No,Yes,Yes
17,93.8,181.3,male,33,128.0,74.0,5.59,No,No,No


# Linear Regression

Create dummy variables for the categorical variables (Gender, Diabetes, PhysActive, SmokeNow). Creating two new variables for a binary variable is redundant, so you should also drop the first category. 

In [29]:
df=pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0,Weight,Height,Age,BPSysAve,BPDiaAve,TotChol,Gender_male,Diabetes_Yes,PhysActive_Yes,SmokeNow_Yes
0,87.4,164.7,34,113.0,85.0,3.49,1,0,0,0
4,86.7,168.4,49,112.0,75.0,6.7,0,0,0,1
10,68.0,169.5,66,111.0,63.0,4.99,1,0,1,0
14,57.5,148.1,58,127.0,83.0,4.78,0,0,1,1
17,93.8,181.3,33,128.0,74.0,5.59,1,0,0,0


Create a column-vector of responses for Weight and call this `y`. 

In [3]:
y=df.Weight.values.reshape(-1, 1)
y

array([[ 87.4],
       [ 86.7],
       [ 68. ],
       ...,
       [ 85.5],
       [113.9],
       [ 92.3]])

Create a design matrix, or matrix of features to use as predictors. This should include all variables excepts for Weight. Call this `X`. 

In [4]:
X= df.drop('Weight', axis=1).values
X

array([[164.7,  34. , 113. , ...,   0. ,   0. ,   0. ],
       [168.4,  49. , 112. , ...,   0. ,   0. ,   1. ],
       [169.5,  66. , 111. , ...,   0. ,   1. ,   0. ],
       ...,
       [173.6,  43. , 112. , ...,   0. ,   0. ,   0. ],
       [173.6,  69. , 108. , ...,   0. ,   0. ,   1. ],
       [177.3,  28. , 124. , ...,   0. ,   1. ,   1. ]])

Standardize the numeric columns of `X` using the `StandardScalar()` method from scikit-learn. 

In [5]:
num_cols=[0,1,2,3,4]

In [9]:
scaler=StandardScaler()
scaler.fit(X[:, num_cols])

StandardScaler(copy=True, with_mean=True, with_std=True)

Fit a regression model to predict Weight from the other variables using `LinearRegression` estimator.  

In [10]:
scaler.mean_

array([169.97836822,  49.02361782, 122.54804079,  69.98765432,
         5.04988191])

Print the intercept and coefficients from this model. What is the $R^2$? 

In [11]:
scaler.scale_

array([ 9.56038801, 17.21180861, 17.6642123 , 12.61002498,  1.09260241])

Transform the features to create cubic polynomials of the features (e.g. if a feature is x, create $x^2$ and $x^3$ as features, along with interactions). What is the shape of this matrix?

In [12]:
X[:, num_cols]=scaler.transform(X[:, num_cols])

In [13]:
X

array([[-0.55210816, -0.87286689, -0.54053023, ...,  0.        ,
         0.        ,  0.        ],
       [-0.16509458, -0.00137219, -0.59714187, ...,  0.        ,
         0.        ,  1.        ],
       [-0.05003649,  0.98632181, -0.65375351, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.3788164 , -0.34997007, -0.59714187, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.3788164 ,  1.16062075, -0.82358843, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.76582998, -1.22146477,  0.08219779, ...,  0.        ,
         1.        ,  1.        ]])

In [14]:
np.mean(X[1])

0.23823825871675539

Fit a linear regression model to this data. What is the $R^2$?

In [25]:
lm=LinearRegression()
lm

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [26]:
lm.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [27]:
lm.intercept_

array([87.20435293])

In [28]:
lm.coef_

array([[ 9.22239267, -1.08880535,  0.31035752,  2.16885274,  0.0136343 ,
        -2.81975676, 11.5808396 , -4.66002597, -4.24022203]])

In [29]:
lm.predict(X) 

array([[82.63803537],
       [82.14041686],
       [76.78372015],
       ...,
       [88.92814647],
       [80.9038575 ],
       [83.03038375]])

In [30]:
lm.score(X,y) #R-squared

0.2372999677636619

# Polynomial Regression

In [32]:
poly=PolynomialFeatures(degree=3)
Z=poly.fit_transform(X)

In [34]:
Z.shape

(1863, 220)

In [36]:
lm=LinearRegression()
lm.fit(Z,y)
lm.score(Z,y) #R-squared is higher,but might be over fitting. Examination

0.35851577547383245

# Regularized Regression

Import the Lasso and Ridge functions from sklearn.linear_model

In [37]:
from sklearn.linear_model import Lasso, Ridge

Fit a ridge regression model with $\alpha = 10$. Print the coefficients. 

In [38]:
ridge=Ridge(alpha=10)

In [39]:
ridge

Ridge(alpha=10, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [40]:
ridge.fit(X,y)
ridge.coef_

array([[ 9.09100312e+00, -1.01901338e+00,  3.08638019e-01,
         2.15895201e+00, -3.61251469e-03, -2.58724167e+00,
         1.10155318e+01, -4.56432483e+00, -4.13335250e+00]])

Fit a lasso regression model with $\alpha = 1$. Print the coefficients. 

In [41]:
lasso=Lasso(alpha=1)
lasso.fit(X,y)
lasso.coef_

array([ 7.16590153,  0.        ,  0.        ,  1.33553184, -0.        ,
        0.        ,  3.17215777, -0.38666931, -0.        ])

# Logistic Regression
Use Diabetes as the response variable and the other variables as predictors. Standardize the numeric columns. 

In [42]:
from sklearn.linear_model import LogisticRegression

y = df.Diabetes_Yes.values
X = df.drop('Diabetes_Yes', axis=1).values
num_cols = [0, 1, 2, 3, 4, 5]
scaler = StandardScaler()
X[:, num_cols] = scaler.fit_transform(X[:, num_cols])

Print the `y` vector. 


In [43]:
y

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

Fit a logistic regression model (with no regularization) to predict diabetes status from the other variables. Print the coefficients.  

**NOTE: In the video I set `C=10000` to skip regularization. In a recent version of sklearn you can simply specify `penalty='none'`.**

In [46]:
lr=LogisticRegression(C=10000)
lr.fit(X,y)

LogisticRegression(C=10000, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [48]:
lr.coef_

array([[ 0.68092679, -0.4651806 ,  0.75231187,  0.16409294, -0.13773492,
        -0.22438786,  0.38726601, -0.39333188, -0.16258423]])

Print the binary predictions and probability predictions from the model. 

In [50]:
lr.predict(X)

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

In [51]:
lr.predict_proba(X)

array([[0.8925019 , 0.1074981 ],
       [0.94122592, 0.05877408],
       [0.88885022, 0.11114978],
       ...,
       [0.90976258, 0.09023742],
       [0.55235967, 0.44764033],
       [0.95881831, 0.04118169]])