**DS 301: Applied Data Modeling and Predictive Analysis**

# Lab 4 â€“ Polynomial Regression

Nok Wongpiromsarn, 18 September 2020

**Credit:** https://github.com/asukul/DS301-f19/blob/master/Lab3_Polynomial-Regression_HousePrice-v2.ipynb by Adisak Sukul

- A portion of the code & theory has been taken from the book - Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems by A. Geron
- A portion of visualization has been taken from Kaggle kernels - COMPREHENSIVE DATA EXPLORATION WITH PYTHON
Pedro Marcelino - February 2017 https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

**Instructions:**
Please go over the sample code shown below and use it as a reference for your lab assignment. Perform linear and polynomial regression with 'SalePrice' as the output using the following selected features:
1. 'Year Built'
   1. Set up the training and test sets with 'YearBuilt' as input and 'SalePrice' as output.
   2. Perform linear regression and evaluate your linear regression model with MSE and RMSE.
   3. Perform polynomial regression and evaluate your polynomial regression model with MSE and RMSE. Determine the polynomial degree and explain your choice. (Hint: Use the MSE and RMSE to pick the polynomial degree.)
   4. Retrain your polynomial model by applying one of the regularization techniques. Keep the same polynomial degree. Try with at least 3 values of alpha. Don't forget to scale the data! Evaluate your new model.
   5. Plot the results of the 5 models.
2. 'Year Built' and 'Overall Quality'
   1. Set up the training and test sets with 'YearBuilt' and 'OverallQual' as input and 'SalePrice' as output.
   2. Perform linear regression and evaluate your linear regression model with MSE and RMSE.
   3. Perform polynomial regression with degree=3 and evaluate your polynomial regression model with MSE and RMSE.
   4. Retrain your polynomial model by applying one of the regularization techniques. Keep the same polynomial degree. Try with at least 3 values of alpha. Don't forget to scale the data! Evaluate your new model.
   5. Plot the results of all the 5 models.
3. Describe and compare the results with different models.
4. Explain the computation time for different models and features

**Visualize the data**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("datasets/house-price.csv")

In [None]:
df.head(20)

In [None]:
df.columns

In [None]:
df['SalePrice'].describe()

In [None]:
#histogram
sns.distplot(df['SalePrice']);

In [None]:
#correlation matrix
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

In [None]:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
#selected fewer feature for pairplot (scatterplot matrix)
#let's select fewer features that having hige correlation with our target SalePrice
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df[cols], height = 2.5)
plt.show();

In [None]:
#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'YearBuilt']
sns.pairplot(df[cols], height = 2.5)
plt.show();

### 1. Start with YearBuilt as input and SalePrice as output

**1.1 Set up the training and test sets**

In [None]:
from sklearn.model_selection import train_test_split

X = df['YearBuilt']
X = X.values.reshape(-1,1)
y = df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train)

**1.2 Linear regression**

Train the model

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

In [None]:
# Plot the result
X_plot = np.linspace(1870, 2010, 292).reshape(292, 1)
y_plot_linear = lin_reg.predict(X_plot)

plt.plot(X, y, "b.")
plt.plot(X_plot, y_plot_linear, "r.")
plt.show()

Evaluate the model

In [None]:
y_pred_linear = lin_reg.predict(X_test)

In [None]:
# MSE
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred_linear)
print("MSE: {}".format(mse))

# RMSE
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred_linear))
print("RMSE: {}".format(rmse))

**1.3 Polynomial regression**

Please play with this polynomial degree, take a look at the performance, and pick the degree that performs best.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Add the square of each feature in the training set as a new feature
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X_train)

# X_poly now contains the original feature of X_train plus the square of this feature
print(X_poly)

# Now fit a LinearRegression model to this extended training data
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y_train)

In [None]:
# Plot the result
X_plot_poly = poly_features.transform(X_plot)
y_plot_poly = poly_reg.predict(X_plot_poly)

plt.plot(X, y, "b.")
plt.plot(X_plot, y_plot_poly, "r.")
plt.show()

Evaluate the model

In [None]:
X_test_poly = poly_features.transform(X_test)
y_pred_poly = poly_reg.predict(X_test_poly)

In [None]:
mse = mean_squared_error(y_test, y_pred_poly)
rmse = sqrt(mean_squared_error(y_test, y_pred_poly))
print("MSE: {}".format(mse))
print("RMSE: {}".format(rmse))

**1.4 Regularized polynomial regression**

Feel free to pick your favourite regularization techniques. (The template is using Ridge Regression.) Try with at least 3 different values of alpha.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# First, we add polynomial features
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X_train)

# Then, apply scaling. This is very important when performing regularization
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)

# Train Ridge Regression model
ridge_reg = Ridge(alpha=0.05, solver="cholesky")
ridge_reg.fit(X_poly_scaled, y_train)

# Plot the result
X_plot_ridge = scaler.transform(X_plot_poly)
y_plot_ridge = ridge_reg.predict(X_plot_ridge)

plt.plot(X, y, "b.")
plt.plot(X_plot, y_plot_ridge, "r.")
plt.show()

# Evaluate the model
X_test_ridge = scaler.transform(X_test_poly)
y_pred_ridge = ridge_reg.predict(X_test_ridge)
mse = mean_squared_error(y_test, y_pred_ridge)
rmse = sqrt(mean_squared_error(y_test, y_pred_ridge))
print("MSE: {}".format(mse))
print("RMSE: {}".format(rmse))

**1.5 Plot the results of the 5 models**

In [None]:
plt.plot(X, y, "b.")
plt.plot(X_plot, y_plot_linear, '-', linewidth=3, label="Linear Regression")
plt.plot(X_plot, y_plot_poly, '--', linewidth=3, label="Polynomial Regression")
plt.plot(X_plot, y_plot_ridge, ':', linewidth=3, label="Ridge Regression")
plt.legend()

# Add the plot from regularized polynomial regression

plt.show()

### 2. Use both YearBuilt and OverallQual as input and SalePrice as output

### 3. Describe and compare the results with different models

### 4. Explain the computation time for different models and features