# Linear Models

In this lesson, we will look at regression and classification using linear models. For regression, we will be predicting housing prices using the Boston Housing dataset. For classification, we will be predicting survival using the Titanic dataset.

We will make use of both scikit-learn and TensorFlow.

# Regression

In [1]:
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [4]:
# read in the Boston housing data
boston_df = pd.read_csv('assets/Boston/train.csv', index_col='ID')

In [5]:
# what columns are contained in this dataset? what are the types? are there any null values?
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 1 to 506
Data columns (total 14 columns):
crim       333 non-null float64
zn         333 non-null float64
indus      333 non-null float64
chas       333 non-null int64
nox        333 non-null float64
rm         333 non-null float64
age        333 non-null float64
dis        333 non-null float64
rad        333 non-null int64
tax        333 non-null int64
ptratio    333 non-null float64
black      333 non-null float64
lstat      333 non-null float64
medv       333 non-null float64
dtypes: float64(11), int64(3)
memory usage: 39.0 KB


the data has 333 records, all columns are numeric, and we have no null entries. It's our lucky day

In [6]:
# take a peak at the records to see what they look like
boston_df.head()

Unnamed: 0_level_0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [7]:
boston_df.tail()

Unnamed: 0_level_0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
500,0.17783,0.0,9.69,0,0.585,5.569,73.5,2.3999,6,391,19.2,395.77,15.1,17.5
502,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
503,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273,21.0,396.9,9.08,20.6
504,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
506,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


# scikit-learn
our target/label is called `medv`. To use scikit-learn, we will simply split our data into two (a training set and a validation set). The normal ratio is 70% for training and 30% for validation. scikit-learn let's us create this using `train-test-split()`

Before splitting, we need to separate our `boston_df` into features and labels!

In [10]:
X = boston_df.drop('medv', axis=1) # this returns a DataFrame
y = boston_df['medv'] # this returns a Series

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42, train_size=0.3)



We now have our training and validation datasets, and are ready to train our first linear model.

We will use scikit-learn to build a Linear Regression model using all of our features.

In [14]:
from sklearn.linear_model import LinearRegression

linearModel = LinearRegression()
linearModel.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

_That was simple, we have trained a Machine Learning model!_

Is the model any good? We can find out by scoring the model.

In [15]:
linearModelScore = linearModel.score(X_val, y_val)
linearModelScore

0.6649640156673967

We can make predictions by calling `predict()` on our model

In [16]:
linearModelPredictions = linearModel.predict(X_val)

In [17]:
print(type(linearModelPredictions))

<class 'numpy.ndarray'>


In [18]:
linearModelPredictions

array([ 2.26235712e+01,  2.34212241e+01,  2.37586031e+01,  3.23650484e+01,
        2.42125794e+01,  1.61880600e+01,  1.87279403e+01,  3.10467105e+01,
        1.58209827e+01,  2.39169645e+01,  2.70319068e+01,  2.00346550e+01,
        1.93732446e+01,  3.44786844e+01,  2.31527662e+01,  3.53225705e+01,
        2.26105398e+01,  1.43837813e+01,  2.66898889e+01,  1.58980379e+01,
        3.56766500e+01,  3.17486222e+01,  2.24073651e+01,  2.79839745e+01,
        1.55038571e+01,  3.87740295e+01,  2.77790326e+00,  3.99005215e-02,
        3.04058173e+01,  1.00014711e+01,  1.88600247e+01,  2.01080332e+01,
        2.84773446e+01,  8.81440292e+00,  1.92562123e+01,  1.17036958e+01,
        2.65171844e+01,  3.24174746e+00,  1.65123971e+01,  2.49366431e+01,
        2.26242956e+01,  2.11199715e+01,  2.46397951e+01,  4.01565956e+01,
        3.49345881e+01,  2.18745406e+01,  1.14354196e+01,  2.07182010e+01,
        1.27531309e+01,  1.95981203e+01,  1.06047576e+01,  2.92340621e+01,
        2.18628778e+01,  

Let's compare our predicted values to our ground truths

In [20]:
comp_df = pd.DataFrame({'y': y_val, 'y_pred': linearModelPredictions})
comp_df.head(n=10)

Unnamed: 0_level_0,y,y_pred
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
44,24.7,22.623571
472,19.6,23.421224
109,19.8,23.758603
293,27.9,32.365048
85,23.9,24.212579
458,13.5,16.18806
435,11.7,18.72794
267,30.7,31.046711
317,17.8,15.820983
297,27.1,23.916965


In all honesty, we shouldn't be using our eyes to judge. We should use a metric like MAE or RMSE to compare one model to the next! Let's import another function from scikit-learn to give us a hand

In [21]:
import math
from sklearn.metrics import mean_squared_error

In [23]:
mse = mean_squared_error(y_val, linearModelPredictions)
rmse = math.sqrt(mse)
print('RMSE = {}'.format(rmse))

RMSE = 5.321052760278853


Recall that this RMSE does not mean much right now. However, we can make it a baseline. Let's try to improve it.

Right now, what we have is a linear model. Let's build a polynomial model. What that means is, we will **engineer** new features. We will keep it simple by taking each feature and crossing it with only itself, essentially creating a square of each feature.

