# Linear Regression - Gradient Descent


In this notebook, we investigate an iterative optimization method known as gradient descent (GD) to solve the Linear Regression problem. 


<font color=red size=3> **Application Scenario:**</font> 
The GD approach is suitable for the following scenario.
- Dataset: can be very large (unable to fit into computer's memory)
- No. of Features: Large
- Relationship between input (features) and output (target): Linear & Nonlinear (Polynomial Regression)
- Out-of-core support: Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. An out-of-core learning algorithm chops the data into mini-batches and uses online learning techniques to learn from these mini-batches.


## Tasks
We perform the following tasks.
- Task 1: Implement GD on a Linear Regression Model
- Task 2: Implement GD on a Polynomial Regression Model

We will see that the Polynomial Regression model has more power to discover the nonlinear pattern in the dataset. However, it suffers from the overfitting problem (high-variance and less generalizable). By implementing a regularized Polynomial Regression model using GD, we can obtain a better generalizable solution. Note that we obtained a similar optimal solution in notebook 3 using the OLS regularized Polynomial Regression model.



## Gradient Descent
The gradient descent-based methods are used when there are a large number of features or too many training instances to fit in memory.

There are three variants of the Gradient Descent Algorithm.
- Batch Gradient Descent
- Stochastic Gradient Descent (SGD)
- Mini-batch Gradient Descent

Scikit-Learn provides only the Stochastic Gradient Descent model. In this notebook, we will see how to use sklearn's SGDRegressor.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html


### Batch Gradient Descent
Batch Gradient Descent uses the whole batch of training data at every step. As a result, it is slow on very large training sets.


### Stochastic Gradient Descent

The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent (SGD) just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration.

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent. Instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.


## Implementation Issues:

We need to consider the following two issues when using the gradient descent algorithms.

####  Learning Rate
If the learning rate is set to a too-low value, the algorithm will eventually reach the solution. But it will take a long time. On the other hand, if the learning rate is too high, the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step.

To find a good learning rate, we can use a grid search. However, we may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.


#### No. of Iterations
Setting the number of iterations is tricky. If it is too low, we will still be far away from the optimal solution when the algorithm stops. But if it is too high, we will waste time while the model parameters do not change anymore. A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes small. That is when its norm becomes smaller than a tiny number $\epsilon$ (called the tolerance). This happens when Gradient Descent has (almost) reached the minimum.


## Dataset

We use the Boston housing dataset that provides housing values in the suburbs of Boston.

URL: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston


The **MEDV** variable is the target variable.

### Data description

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

- CRIM: per capita crime rate by town.

- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS: proportion of non-retail business acres per town.

- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

- NOX: nitrogen oxides concentration (parts per 10 million).

- RM: average number of rooms per dwelling.

- AGE: proportion of owner-occupied units built prior to 1940.

- DIS: weighted mean of distances to five Boston employment centers.

- RAD: index of accessibility to radial highways.

- TAX: full-value property-tax rate per $10,000.

- PTRATIO: pupil-teacher ratio by town.

- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

- LSTAT: lower status of the population (percent).

- MEDV: median value of owner-occupied homes in $1000s.

In [1]:
import warnings
import time
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Load Data as a Pandas DataFrame Object


In [2]:
# URL for the dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"

# Read the dataset
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)

# Feature names 
feature_names = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX",
    "PTRATIO", "B", "LSTAT"
]

# Extract features and target
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])  # Features
target = raw_df.values[1::2, 2]  # Target

# Create a DataFrame with feature names
df = pd.DataFrame(data, columns=feature_names)

# Add target column 'MEDV' to the DataFrame
df['MEDV'] = target

# Display the DataFrame shape, feature names, and target array shape
print("Dataset size: ", df.shape)
print("Feature Names: ", df.columns.tolist())


#Display the top five rows
df.head()

Dataset size:  (506, 14)
Feature Names:  ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


# Description of the Data

DataFrame’s info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


# Explore the Data: Describe Numerical Attributes

DataFrame's describe() method shows a summary of the numerical attributes.

In [4]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


# Looking for Numerical Correlations with the Target Column

Since the dataset is not too large, we can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using DataFrame's corr() method.


In [5]:
# Variable Correlations with the target "MEDV"
df.corr()['MEDV'].sort_values(ascending=False)

MEDV       1.000000
RM         0.695360
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
AGE       -0.376955
RAD       -0.381626
CRIM      -0.388305
NOX       -0.427321
TAX       -0.468536
INDUS     -0.483725
PTRATIO   -0.507787
LSTAT     -0.737663
Name: MEDV, dtype: float64

# Create a Separate Feature Set (Data Matrix X) and Target (1D Array y)

Create a data matrix (X) that contains all features and a 1D target array (y) containing the target.

First, we create separate data frame objects for X and y. Then, we convert the data frame objects into arrays.

In [6]:
# Make a deep copy of the data frame object for later use
allData = df.copy()

# Create separate data frame objects for X (features) and y (target)
X = df.drop(columns='MEDV')  
y = df['MEDV'] 


X = np.asarray(X) # Data Matrix containing all features excluding the target
y = np.asarray(y) # 1D target array


print("Data Matrix (X) Shape: ", X.shape)
print("Label Array (y) Shape: ", y.shape)

print("\nData Matrix (X) Type: ", X.dtype)
print("Label Array (y) Type: ", y.dtype)

Data Matrix (X) Shape:  (506, 13)
Label Array (y) Shape:  (506,)

Data Matrix (X) Type:  float64
Label Array (y) Type:  float64


# Create Train and Test Dataset

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale The Features

We should ensure that all features have a similar scale. Otherwise, optimization algorithms (e.g., Gradient Descent based algorithms) will take much longer time to converge.

Also, regularization techniques are sensitive to the scale of data. Thus, we must scale the features before applying regularization.

In [8]:
scaler = StandardScaler()

# Fit on the training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Linear Regression using Sklearn's Stochastic Gradient Descent (SGD) Model


Sklearn provides a SGDRegressor model. We use it to perform regularized regression.


### Regularization

Regularization is an effective technique for reducing overfitting.

For a linear model, regularization is typically achieved by constraining the weights of the model. We will now look at three different ways to constrain the weights.

- Ridge Regression ($l_2$ norm)
- Lasso Regression ($l_1$ norm)
- Elastic Net (it combines $l_1$ and $l_2$ priors as regularizer)


### Regularization: A Common Hyperparameter

In all regularization methods, we need to use the hyperparameter $\alpha$. It controls how much we want to regularize the model. 

- If $\alpha = 0$ then Ridge Regression is just Linear Regression. 
- If $\alpha$ is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean. 

## Evaluation Metrics

We use two evaluation metrics.

- Mean Squared Error (MSE)
- Coefficient of Determination or $R^2$ or $r^2$


### Note on $R^2$:
R-squared is a statistical measure of how close the data are to the fitted regression line. 

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. 

#### Compute $R^2$ using the sklearn:

- The "score" function of the OLS Linear Regression object
- The "r2_score" function from sklearn.metrics

#### Compute MSE using the sklearn:

- The "mean_squared_error" function from sklearn.metrics


## SGD Regressor



In the Stochastic Gradient Descent algorithm the gradient of the loss is estimated for each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared Euclidean norm $l_2$ or the absolute norm $l_1$ or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.


We need to set the following hyperparameters.


- penalty : str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’

The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.

- alpha : float

Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.

- l1_ratio : float

The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.

- learning_rate : string, optional

The learning rate schedule:

        -- ‘constant’: eta = eta0

        -- ‘optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.


        -- ‘invscaling’: [default] eta = eta0 / pow(t, power_t)


- eta0 : double

The initial learning rate for the ‘constant’ or ‘invscaling’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.


- max_iter : int, optional

The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit. Defaults to 5. 


- tol : float or None, optional

The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.



## Task 1: SGD on a Linear Regression Model

First, we apply SGD for a Linear Regression model without hyperparameter tuning. We do this just to get a sense of the model's performance. Then, we perform hyperparameter tuning to determine optimal values for the hyperparameters.

In [9]:
%%time
# SGD Regression

# Create an SGDRegressor linear regression object
lin_reg_sgd = SGDRegressor()

# Train the model
lin_reg_sgd.fit(X_train, y_train)


# The intercept
print("\nIntercept: \n", lin_reg_sgd.intercept_)

# The coefficients
print("Coefficients: \n", lin_reg_sgd.coef_)

# The number of iterations
print("\nNumber of Iterations: \n", lin_reg_sgd.n_iter_)


print("\n----------------------------- Model Evaluation -----------------------------")

# Make prediction 
y_train_predicted_sgd = lin_reg_sgd.predict(X_train)
y_test_predicted_sgd = lin_reg_sgd.predict(X_test)


print("Training: Mean squared error: %.2f"
      % mean_squared_error(y_train, y_train_predicted_sgd))

print("Test: Mean squared error: %.2f"
      % mean_squared_error(y_test, y_test_predicted_sgd))


# Explained variance score: 1 is perfect prediction
print("\nTraining: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" % 
      r2_score(y_train, y_train_predicted_sgd))




# Explained variance score: 1 is perfect prediction
print("Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" % 
      r2_score(y_test, y_test_predicted_sgd))


Intercept: 
 [22.79627292]
Coefficients: 
 [-0.9559949   0.55043358  0.04867877  0.7252725  -1.90866413  3.18593862
 -0.19163129 -2.93060126  1.59767903 -1.08124482 -1.98547262  1.12447212
 -3.58201265]

Number of Iterations: 
 34

----------------------------- Model Evaluation -----------------------------
Training: Mean squared error: 21.71
Test: Mean squared error: 24.73

Training: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.75
Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.66
CPU times: user 1.8 ms, sys: 563 µs, total: 2.36 ms
Wall time: 1.74 ms


## Model Selection: Hyperparameter Tuning


A regression model is defined by a set of parameters: alpha, l1_ratio, etc. These are called hyperparameters.

We need to select the best model based on the optimal values of these hyperparameters. This process is called hyperparameter tuning.

The best way to do hyperparameter tuning is to use **cross-validation**.

We will use Scikit-Learn’s GridSearchCV to search the combinations of hyperparameter values that provide best performance.

We need to tell which hyperparameters we want the GridSearchCV to experiment with, and what values to try out. It will evaluate all the possible combinations of hyperparameter values, using cross-validation. 


### Important:

The GridSearchCV takes an argument to define the scoring metric (performance measure). 

See the list of possible scoring functions:
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

For regression, we may use "neg_mean_squared_error" or "explained_variance" scoring function. 



## Linear Regression: Hyperparameter Tuning for SGD Regressor

In [10]:
%%time

warnings.filterwarnings('ignore')

# The param_grid tells Scikit-Learn to evaluate all combinations of the hyperparameter values
param_grid = {'alpha': [0.1, 0.01, 0.001, 0.0001], 'learning_rate': ["constant", "optimal", "invscaling"], 
              'l1_ratio': [1, 0.5, 0.2, 0], 'max_iter':[100, 500, 1000],'eta0': [0.01, 0.001],
              'loss': ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive']}



sgd = SGDRegressor(penalty='elasticnet')

sgd_cv = GridSearchCV(sgd, param_grid, scoring='neg_mean_squared_error', cv=5, verbose=1, n_jobs=-1)
sgd_cv.fit(X_train, y_train)


params_optimal_sgd = sgd_cv.best_params_

print("Best Score (negative mean squared error): %f" % sgd_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal_sgd)
print("\n")

Fitting 5 folds for each of 1152 candidates, totalling 5760 fits




Best Score (negative mean squared error): -23.312295
Optimal Hyperparameter Values:  {'alpha': 0.001, 'eta0': 0.01, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'squared_epsilon_insensitive', 'max_iter': 1000}


CPU times: user 564 ms, sys: 116 ms, total: 681 ms
Wall time: 1.97 s




## Linear Regression: Select The Best Model for the SGD Regressor

Using the optimal hyperparameter values, create the best model.
Then, fit the model.



In [11]:
# SGD Regression

# Create an SGDRegressor linear regression object using the optimal hyperparameter values
lin_reg_sgd = SGDRegressor(**params_optimal_sgd)

# Start timing
start_time = time.time()

# Train the model
lin_reg_sgd.fit(X_train, y_train)

# Stop timing
end_time = time.time()

# Calculate elapsed time
elapsed_time_sgd = end_time - start_time
# Print elapsed time
print("\nSGD Training Time: {:.10f} seconds".format(elapsed_time_sgd))


# The intercept
print("Intercept: \n", lin_reg_sgd.intercept_)

# The coefficients
print("Coefficients: \n", lin_reg_sgd.coef_)

# The number of iterations
print("Number of Iterations: \n", lin_reg_sgd.n_iter_)


print("\n----------------------------- Model Evaluation -----------------------------")

# Make prediction 
y_train_predicted_sgd = lin_reg_sgd.predict(X_train)


print("Mean squared error: %.2f"
      % mean_squared_error(y_train, y_train_predicted_sgd))


# Explained variance score: 1 is perfect prediction
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" % r2_score(y_train, y_train_predicted_sgd))


SGD Training Time: 0.0008523464 seconds
Intercept: 
 [22.80561048]
Coefficients: 
 [-0.98168553  0.6630387   0.26156267  0.66624178 -1.99119449  2.97796324
 -0.14868721 -3.09489518  2.10497729 -1.43335146 -1.99236282  1.07688153
 -3.4967105 ]
Number of Iterations: 
 35

----------------------------- Model Evaluation -----------------------------
Mean squared error: 21.89
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.75


## Linear Regression: Evaluate Model Performance for SGD Regressor

Evaluate the model's performance using cross-validation. 

Use Scikit-Learn's cross_val_score function. 

Note that the "scoring" argument should be set based on the type of classification (binary/multiclass).

In [12]:
# Scoring Parameter for Regression:
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

scores = cross_val_score(lin_reg_sgd, X_train, y_train, scoring='neg_mean_squared_error', cv=10)
print(scores)

print("Negative Mean Squared Error: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[-14.08107966 -17.8460883  -29.77176478 -44.66878857 -22.17516146
 -26.22915252 -20.3273811  -20.94556011 -13.23036903 -33.91168374]
Negative Mean Squared Error: -24.32 (+/- 18.34)


## Linear Regression: Evaluate Model Performance using Test Data

In [13]:
# Make prediction using the test data
y_test_predicted = lin_reg_sgd.predict(X_test)


test_mse_linear = mean_squared_error(y_test, y_test_predicted)

print("Mean squared error: %.2f"
      % test_mse_linear)


# Explained variance score: 1 is perfect prediction
test_r2_linear = r2_score(y_test, y_test_predicted)
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % test_r2_linear)

Mean squared error: 23.68
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.68


## Task 2: SGD on a Polynomial Regression Model

We will use the optimal polynomial degree (=2) obtained from the previous notebook on this dataset.

First, we need to create the dataset again. Because previously we standardized it.

### Create a New Feature Set (Data Matrix X) and Target (1D Array y)

In [14]:
# Create separate data frame objects for X (features) and y (target)
X = allData.drop(columns='MEDV')  
y = allData['MEDV'] 


X = np.asarray(X) # Data Matrix containing all features excluding the target
y = np.asarray(y) # 1D target array


print("Data Matrix (X) Shape: ", X.shape)
print("Label Array (y) Shape: ", y.shape)

print("\nData Matrix (X) Type: ", X.dtype)
print("Label Array (y) Type: ", y.dtype)

Data Matrix (X) Shape:  (506, 13)
Label Array (y) Shape:  (506,)

Data Matrix (X) Type:  float64
Label Array (y) Type:  float64


## Create Train & Test Dataset

In [15]:
X_train_new, X_test_new, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Add Polynomial Features

In [16]:
# Variable that specifies the degree of the polynomial to be added to the feature vector
poly_degree = 2


# Train data: Add polynomial terms with the feature vector using the sklearn PolynomialFeatures class
# Bias should be excluded because by default SGDRegressor adds bia via the"fit_intercept" parameter
poly_features = PolynomialFeatures(degree=poly_degree, include_bias=False)
X_train_poly = poly_features.fit_transform(X_train_new)


print("No. of Original Features: ", X_train_new.shape[1])
print("No. of Augmented Features: ", X_train_poly.shape[1])

# Test data: Add polynomial terms with the feature vector using the sklearn PolynomialFeatures class
# Bias should be excluded because by default SGDRegressor adds bias via the"fit_intercept" parameter
poly_features = PolynomialFeatures(degree=poly_degree, include_bias=False)
X_test_poly = poly_features.fit_transform(X_test_new)

No. of Original Features:  13
No. of Augmented Features:  104


## Standardize the Data

In [17]:
scaler = StandardScaler()

# Fit on the training set only.
scaler.fit(X_train_poly)

# Apply transform to both the training set and the test set.
X_train_poly = scaler.transform(X_train_poly)
X_test_poly = scaler.transform(X_test_poly)

## Polynomial Regression: Hyperparameter Tuning for SGD Regressor

In [18]:
%%time
warnings.filterwarnings('ignore')

param_grid = {'alpha': [0.1, 0.01, 0.001], 'learning_rate': ["invscaling"], 
              'l1_ratio': [1, 0.5, 0.2, 0], 'max_iter':[100, 500, 1000],'eta0': [0.01, 0.001, 0.0001],
              'loss': ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive']}


sgd = SGDRegressor(penalty='elasticnet')

sgd_cv = GridSearchCV(sgd, param_grid, scoring='neg_mean_squared_error', cv=10, verbose=2, n_jobs=-1)
sgd_cv.fit(X_train_poly, y_train)


params_optimal_sgd = sgd_cv.best_params_

print("Best Score (negative mean squared error): %f" % sgd_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal_sgd)
print("\n")



Fitting 10 folds for each of 432 candidates, totalling 4320 fits




[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squa



[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=1000; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=1000; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha



[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=500; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=500; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=huber, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=huber, max_iter=500; total time=   0

ence. Consider increasing max_iter to improve the fit.


[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=500; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=huber, max_iter=1000; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=huber, max_iter=1000; total tim



[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=squared_loss, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l



[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l1_ratio=0, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.01, eta0=0.001, l



[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=1000; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=1, learning_rate=invscaling, loss=huber, max_iter=1000; total time=   0.1s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=100; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squa



red_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=500; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.0s
[CV] END alpha=0.1, eta0=0.01, l1_ratio=0.2, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_ite



 END alpha=0.001, eta0=0.001, l1_ratio=0.5, learning_rate=invscaling, loss=squared_epsilon_insensitive, max_iter=1000; total time=   0.1s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100; total time=   0.0s
[CV] END alpha=0.001, eta0=0.001, l1_ratio=0.2, learning_rate=invscaling, loss=squared_loss, max_iter=100;



Best Score (negative mean squared error): -14.426809
Optimal Hyperparameter Values:  {'alpha': 0.001, 'eta0': 0.01, 'l1_ratio': 1, 'learning_rate': 'invscaling', 'loss': 'squared_epsilon_insensitive', 'max_iter': 1000}


CPU times: user 630 ms, sys: 116 ms, total: 746 ms
Wall time: 7.04 s




## Polynomial Regression: Select The Best Model for the SGD Regressor

Using the optimal hyperparameter values, create the best model.
Then, fit the model.



In [19]:
# SGD Regression

# Create an SGDRegressor linear regression object using the optimal hyperparameter values
lin_reg_sgd = SGDRegressor(**params_optimal_sgd)

# Start timing
start_time = time.time()

# Train the model
lin_reg_sgd.fit(X_train_poly, y_train)


# Stop timing
end_time = time.time()

# Calculate elapsed time
elapsed_time_sgd_poly = end_time - start_time
# Print elapsed time
print("\nPolynomial SGD Training Time: {:.10f} seconds".format(elapsed_time_sgd_poly))

# # The intercept
# print("Intercept: \n", lin_reg_sgd.intercept_)

# # The coefficients
# print("Coefficients: \n", lin_reg_sgd.coef_)

# The number of iterations
print("\nNumber of Iterations: \n", lin_reg_sgd.n_iter_)


print("\n----------------------------- Model Evaluation -----------------------------")

# Make prediction 
y_train_predicted_sgd = lin_reg_sgd.predict(X_train_poly)


print("Mean squared error: %.2f"
      % mean_squared_error(y_train, y_train_predicted_sgd))


# Explained variance score: 1 is perfect prediction
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" % r2_score(y_train, y_train_predicted_sgd))


Polynomial SGD Training Time: 0.0023150444 seconds

Number of Iterations: 
 33

----------------------------- Model Evaluation -----------------------------
Mean squared error: 11.61
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.87


## Polynomial Regression: Evaluate Model Performance using Test Data

In [20]:
# Make prediction using the test data
y_test_predicted = lin_reg_sgd.predict(X_test_poly)

test_mse_polynomial = mean_squared_error(y_test, y_test_predicted)

print("Mean squared error: %.2f"
      % test_mse_polynomial)

# Explained variance score: 1 is perfect prediction
test_r2_polynomial = r2_score(y_test, y_test_predicted)
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % test_r2_polynomial)

Mean squared error: 13.22
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.82


## SGD Linear Regression vs. SGD Polynomial Regression 

We observe that the SGD Polynomial Regression performs significantly better. It uses regularization to reduce the overfitting problem.

In [21]:
data = [["MSE (test)", test_mse_linear, test_mse_polynomial], 
        ["R2 Score (test)", test_r2_linear, test_r2_polynomial],
        ["Training time (sec)", elapsed_time_sgd, elapsed_time_sgd_poly]]
pd.DataFrame(data, columns=["Metric", "SGD Linear Regression", "SGD Polynomial Regression (degree 2)"])

Unnamed: 0,Metric,SGD Linear Regression,SGD Polynomial Regression (degree 2)
0,MSE (test),23.678161,13.221317
1,R2 Score (test),0.677118,0.81971
2,Training time (sec),0.000852,0.002315


# Beyond Linear Regression

Based on all experiments that we have done so far, we observe that even the optimal Linear Regression model (**regularized polynomial regression of degree 2**, implemented in notebook 3) is unable to reduce the MSE below 12. 

To further reduce the MSE we will have to use more sophisticated regression models. Below we apply two advanced regressors on the same dataset as well as use the K-Nearest Neighbors (K-NN) regressor model (for comparison):
- k-Nearest Neighbors (k-NN) Regressor 
- Support Vector Machine (Gaussian Radial Basis Function) Regressor
- Random Forest Regressor

We did not fine-tune the hyperparameters. Please refer to the notebooks on these three models for a detailed discussion of their hyperparameters. Below we use empirically obtained optimal values for the hyperparameters. Our goal is to illustrate the superiority of the advanced models.

## k-Nearest Neighbors (k-NN) Regressor 

In [22]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=3, p=2, weights='distance')
knn.fit(X_train, y_train)


print("\n----------------------------- Model Evaluation -----------------------------")

# Make prediction 
y_train_predicted_knn = knn.predict(X_train)


print("Train: Mean squared error: %.2f"
      % mean_squared_error(y_train, y_train_predicted_knn))


# Explained variance score: 1 is perfect prediction
print("Train: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" % r2_score(y_train, y_train_predicted_knn))


# Start timing
start_time = time.time()

# Make prediction using the test data
y_test_predicted_knn = knn.predict(X_test)

# Stop timing
end_time = time.time()

# Calculate elapsed time
elapsed_time_kNN = end_time - start_time
# Print elapsed time
print("\nk-NN Training Time: {:.10f} seconds".format(elapsed_time_kNN))

test_mse_knn = mean_squared_error(y_test, y_test_predicted_knn)

print("\nTest: Mean squared error: %.2f"
      % test_mse_knn)



# Explained variance score: 1 is perfect prediction
test_r2_knn = r2_score(y_test, y_test_predicted_knn)
print("Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % test_r2_knn)


----------------------------- Model Evaluation -----------------------------
Train: Mean squared error: 0.00
Train: Coefficient of determination r^2 variance score [1 is perfect prediction]: 1.00

k-NN Training Time: 0.0009522438 seconds

Test: Mean squared error: 18.21
Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.75


## Support Vector Machine (Gaussian Radial Basis Function) Regressor

In [23]:
from sklearn.svm import SVR

# Start timing
start_time = time.time()

svm = SVR(kernel='rbf', C=1000.0, gamma=0.01)
svm.fit(X_train, y_train)

# Stop timing
end_time = time.time()

# Calculate elapsed time
elapsed_time_svm = end_time - start_time
# Print elapsed time
print("\nSVM Training Time: {:.10f} seconds".format(elapsed_time_svm))


print("\n----------------------------- Model Evaluation -----------------------------")

# Make prediction 
y_train_predicted_svm = svm.predict(X_train)


print("\nTrain: Mean squared error: %.2f"
      % mean_squared_error(y_train, y_train_predicted_svm))


# Explained variance score: 1 is perfect prediction
print("Train: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" % r2_score(y_train, y_train_predicted_svm))


# Make prediction using the test data
y_test_predicted_svm = svm.predict(X_test)

test_mse_svm = mean_squared_error(y_test, y_test_predicted_svm)

print("\nTest: Mean squared error: %.2f"
      % test_mse_svm)



# Explained variance score: 1 is the perfect prediction

test_r2_svm = r2_score(y_test, y_test_predicted_svm)
print("Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % test_r2_svm)


SVM Training Time: 0.0454709530 seconds

----------------------------- Model Evaluation -----------------------------

Train: Mean squared error: 5.71
Train: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.93

Test: Mean squared error: 10.63
Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.86


## Random Forest Regressor

To use the Random Forest model, we need the **unscaled features**. Thus we read the features from the DataFrame object first.

In [24]:
# Create separate data frame objects for X (features) and y (target)
X = allData.drop(columns='MEDV')  
y = allData['MEDV'] 


X = np.asarray(X) # Data Matrix containing all features excluding the target
y = np.asarray(y) # 1D target array


print("Data Matrix (X) Shape: ", X.shape)
print("Label Array (y) Shape: ", y.shape)

print("\nData Matrix (X) Type: ", X.dtype)
print("Label Array (y) Type: ", y.dtype)


# Create train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Data Matrix (X) Shape:  (506, 13)
Label Array (y) Shape:  (506,)

Data Matrix (X) Type:  float64
Label Array (y) Type:  float64


In [25]:
from sklearn.ensemble import RandomForestRegressor


rnd_forest_reg = RandomForestRegressor(n_estimators=500, criterion="squared_error", max_features=1.0, 
                                       verbose=1, max_depth=8, 
                                       oob_score=True, n_jobs=-1)

# Start timing
start_time = time.time()

rnd_forest_reg.fit(X_train, y_train)

# Stop timing
end_time = time.time()

# Calculate elapsed time
elapsed_time_RForest = end_time - start_time
# Print elapsed time
print("\nRandom Forest Training Time: {:.10f} seconds".format(elapsed_time_RForest))


# Make prediction 
y_train_predicted_rnd_forest = rnd_forest_reg.predict(X_train)


train_mse_rnd_forest = mean_squared_error(y_train, y_train_predicted_rnd_forest)

print("\nTrain: Mean squared error: %.2f"
      % train_mse_rnd_forest)


# Explained variance score: 1 is perfect prediction
print("Train: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % r2_score(y_train, y_train_predicted_rnd_forest))




y_test_predicted_rnd_forest = rnd_forest_reg.predict(X_test)


test_mse_rnd_forest = mean_squared_error(y_test, y_test_predicted_rnd_forest)

print("Test: Mean squared error: %.2f"
      % test_mse_rnd_forest)


# Explained variance score: 1 is the perfect prediction

test_r2_rnd_forest = r2_score(y_test, y_test_predicted_rnd_forest)


print("Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % test_r2_rnd_forest)

#print("\nScore of the training dataset obtained using an out-of-bag estimate: ", rnd_forest_reg.oob_score_)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    0.1s



Random Forest Training Time: 0.3533222675 seconds

Train: Mean squared error: 2.74
Train: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.97
Test: Mean squared error: 8.76
Test: Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.88


[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    0.3s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 418 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 500 out of 500 | elapsed:    0.0s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 418 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 500 out of 500 | elapsed:    0.0s finished


# Comparison of Linear Regression & Advanced Regressor Models

In [26]:
data = [
        ["SGD Linear Regression", test_mse_linear, test_r2_linear, elapsed_time_sgd], 
        ["SGD Polynomial Regression (degree 2)", test_mse_polynomial, test_r2_polynomial, elapsed_time_sgd_poly],
        ["K-Nearest Neighbors Regressor", test_mse_knn, test_r2_knn, elapsed_time_kNN],
        ["Support Vector Machine (Gaussian RBF)", test_mse_svm, test_r2_svm, elapsed_time_svm],
        ["Random Forest", test_mse_rnd_forest, test_r2_rnd_forest, elapsed_time_RForest]
       ]


pd.DataFrame(data, columns=["Model", "MSE (test)", "R2 Score (test)", "Training Time (sec)"])

Unnamed: 0,Model,MSE (test),R2 Score (test),Training Time (sec)
0,SGD Linear Regression,23.678161,0.677118,0.000852
1,SGD Polynomial Regression (degree 2),13.221317,0.81971,0.002315
2,K-Nearest Neighbors Regressor,18.21336,0.751638,0.000952
3,Support Vector Machine (Gaussian RBF),10.628107,0.855072,0.045471
4,Random Forest,8.759039,0.880559,0.353322


# Final Observation

We observe that **Random Forest performs significantly better** than Linear Regression reducing the MSE below 9. Also, we see the K-NN regressor model's performance is comparable to that of Linear Regression making it a poor choice for this dataset.