# Workshop 8: Linear Regression


In this exercise, you will apply linear regression and Lasso regression methods to the dataset supplied to you, and then compare their results to determine whether Lasso regression is needed for this dataset:

**Dataset description**: You are provided a dataset with 20 variables. Variables $x1\ -\ x19$ refer to the independent variables, while variable $y$ is your dependent variable. Training data is stored in the file `/etc/data/regression-train.csv`.

**Note**: The TA will use a test set to verify your solution. The format (independent variables $x1\ -\ x19$, dependent variable  $y$) will be same, but TA's file may contain different number of data points than the split version from training set. Please ensure you take this into account, and do not hard code any dimensions.

---

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read the data
df = pd.read_csv("./data/regression-train.csv")
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
df

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,y
0,508,44,60,718,42,234,0,0,56,52,8,216,0,1998,472,136,236,12,4,610.0
1,1020,106,198,1620,126,680,2,2,112,104,36,614,0,5744,1642,294,348,14,20,2300.0
2,1118,146,828,704,32,698,2,2,96,122,18,842,0,6324,1748,282,718,16,4,1850.0
3,922,70,452,222,48,150,0,0,108,108,22,152,2,1360,320,224,98,4,36,270.0
4,526,60,294,162,18,164,2,2,52,46,8,166,0,1776,440,140,172,8,2,500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,630,50,78,996,48,188,2,2,70,120,26,136,0,1260,302,152,110,6,26,310.0
128,330,32,38,664,4,20,0,2,26,18,4,36,2,392,88,78,36,6,4,150.0
129,1146,156,262,2628,194,840,0,0,170,120,24,940,0,6396,1714,288,664,16,18,1920.0
130,472,22,398,250,2,128,0,0,54,30,26,232,2,2230,540,112,114,8,0,460.0


In [3]:
from sklearn.model_selection import train_test_split
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [4]:
# For regression, it is particularly important to normalize our data before
# training the model, so we can better interpret our coefficients
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train_unscaled)
X_train = scaler.transform(X_train_unscaled)
X_test = scaler.transform(X_test_unscaled)
#print(X_train[0])

We have given you a function to caculate the RMSE of a regression model, given its predictions and the ground truth y-values of the test data.

In [5]:
import math
def calculate_rmse(y_true, y_pred):
  
  # Inputs:
  # y_true: ground truth dependent variable values, of type vector
  # y_pred: prediction outcomes from any regression method, with the same length as y_true
  
  # Outputs:
  # a single value of type double, with the RMSE value
    return(math.sqrt(np.sum((y_true - y_pred)**2)/len(y_true)))

## Part 1: Linear Regression (Group)
You will write code in the function `alda_regression_linear` to train simple linear regression. Detailed instructions for implementation and allowed packages have been provided the comments.

Before your begin, read the documentation on sklearn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [6]:
from sklearn.linear_model import LinearRegression

def alda_regression_linear(X_train, X_test, y_train):
    # Perform linear regression
    # Inputs:
    # X_train: training data frame(19 variables, x1-x19)
    # X_test: test data frame(19 variables, x1-x19)
    # y_train: dependent variable, training data (vector, continous type)

    # Output:
    # A tuple containing:
    # - The regression model and 
    # - The list of predictions on test data (X_test) (vector) 
  
    # allowed packages: sklearn.linear_model
  
    # Function hints: Read the documentation for the functions LinearRegression (link above)
    regression = LinearRegression()
    regression.fit(X_train, y_train)
    pred = regression.predict(X_test)
    return (regression, pred)
    # write code for building a linear regression model using X_train, y_train
    #raise NotImplementedError()

In [7]:
# Now test your model
lr_model, lr_predictions = alda_regression_linear(X_train, X_test, y_train)

print(f'Training RMSE: {calculate_rmse(y_train, lr_model.predict(X_train))}')
print(f'Test RMSE: {calculate_rmse(y_test, lr_predictions)}')
print('')

# Which attributes are most predictive of the outcome variable?
print(f'Model coefficients:\n{lr_model.coef_}')


Training RMSE: 524.953283852617
Test RMSE: 787.9099436300614

Model coefficients:
[   48.18082989    65.73979512   210.77335056   122.5616241
   124.07603501   290.38608009   123.86744484   -83.03161175
   270.17717717   -48.56806094   -23.51778917  -941.63419412
   -89.00842661 -4044.98292672  5073.44151699  -544.21209605
   137.73175811  -101.04154575   235.26578305]


In [8]:
lr_model, simple_linear_regression_result = alda_regression_linear(X_train, X_test, y_train)
np.testing.assert_equal(simple_linear_regression_result.shape, (27,))
np.testing.assert_almost_equal(calculate_rmse(y_test,simple_linear_regression_result),787.9099436300563)
np.testing.assert_almost_equal(lr_model.coef_[0], 48.18082989)

In [9]:
# Here's some test cases to make sure you're right
df_test = pd.read_csv("./data/regression-test.csv")
X_test2 = df_test.iloc[:,:-1]
y_test2 = df_test.iloc[:,-1]

lr_model, simple_linear_regression_result = alda_regression_linear(X, X_test2, y)
np.testing.assert_equal(simple_linear_regression_result.shape, (66,))
np.testing.assert_almost_equal(calculate_rmse(y_test2,simple_linear_regression_result),946.4318403560575)
# END HIDDEN TESTS

## Part 2: Lasso Regression (Group)
You will write code in the function `alda_regression_lasso` to train simple lasso regression. Detailed instructions for implementation and allowed packages have been provided the comments. 

Before your begin, read the documentation on sklearn's [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) - a Lasso regression model that uses CV to tune its hyperparameters.

**Note** that the lasso regression model has *built-in* crossvalidation, which it performs on the training dataset provided, to select the best shrinkage coefficient for the validation data.

In [10]:
from sklearn.linear_model import LassoCV
def alda_regression_lasso(X_train, X_test, y_train, random_state=0):
    # Perform lasso regression
    # Inputs:
    # X_train: training data frame(19 variables, x1-x19)
    # X_test: test data frame(19 variables, x1-x19)
    # y_train: dependent variable, training data (vector, continous type)
    # random_state: a random state to use in CV model training
    # General Information:
        # use 10-fold cross validation to determine the best model hyperparameters
    
    # Output:
    # A tuple containing:
    # - The regression model and 
    # - The list of predictions on test data (X_test) (vector) 
  
    # allowed packages: sklearn.linear_model
  
    # Function hints: Read the documentation for the functions LassoCV (link above)
    
    # write code for lasso regression here
    # 10 fold cross validation
    # set up the random_state as 0 in order for reproducibility
    
    data = LassoCV(cv=10, random_state=random_state).fit(X_train, y_train)
    data_pred = data.predict(X_test)
    
    return (data, data_pred)
    raise NotImplementedError()

**Before testing your model**: Do you expect the training error to be higher or lower? What about the testing error? What do you expect to be different about the coefficients?

In [11]:
lasso_model, lasso_predictions = alda_regression_lasso(X_train, X_test, y_train)

# Should be ~541.7
print(f'Training RMSE: {calculate_rmse(y_train, lasso_model.predict(X_train))}')
# Should be ~639.4
print(f'Test RMSE: {calculate_rmse(y_test, lasso_predictions)}')
print('')

# Which attributes are most predictive of the outcome variable?
print(f'Model coefficients:\n{lasso_model.coef_}')

print()
# Note we called this 'lamda' in class, but sklearn calls it alpha (should be ~3.196)
print(f'The shinkage coefficient hyperparameter chosen by CV: {lasso_model.alpha_}')

Training RMSE: 541.6957360523048
Test RMSE: 639.4484997672894

Model coefficients:
[ -201.29823887   125.69640945   143.15397394   113.55339266
     0.           452.14702148    78.0967967    -28.89070358
    -0.            -8.62418868     0.           -92.58796998
   -95.66296382 -2130.41737003  2429.16734481    -0.
    -0.          -153.90630771   171.8676209 ]

The shinkage coefficient hyperparameter chosen by CV: 3.196361962538218


In [25]:
lasso_model, lasso_regression_result = alda_regression_lasso(X_train, X_test, y_train)
np.testing.assert_equal(lasso_regression_result.shape, (27,))
np.testing.assert_almost_equal(calculate_rmse(y_test, lasso_regression_result), 639.448499767289)
np.testing.assert_almost_equal(lasso_model.coef_[0], -201.29823887)

In [24]:
# Here's some more test cases
df_test = pd.read_csv("./data/regression-test.csv")
X_test2 = df_test.iloc[:,:-1]
y_test2 = df_test.iloc[:,-1]

lasso_model, lasso_regression_result = alda_regression_lasso(X, X_test2, y)
np.testing.assert_equal(lasso_regression_result.shape, (66,))
np.testing.assert_almost_equal(calculate_rmse(y_test2, lasso_regression_result), 978.5702064182492)

From the results, compare the two regression models, including the training and testing RMSE, and the coefficients. Use the output of these functions to answer the following questions below:

1. The dataset contains 19 attributes. Are all 19 attributes useful for predicting the dependent variable? Why or why not? Use your results to justify the answer.  The Use of lassoCV removed variables and set them to Zero because they were not essential in the classification of the data set.
2. If not all attributes are predictive, use your Lasso model to perform feature selection. Which attributes should be kept? Use a correlation and/or scatter plot to justify your answer for at least one attribute (in a new cell below).  