<a href="https://colab.research.google.com/github/michaelwnau/ai_academy_notebooks/blob/main/WKS8_nau_tues_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 8: Linear Regression


In this exercise, you will apply linear regression and Lasso regression methods to the dataset supplied to you, and then compare their results to determine whether Lasso regression is needed for this dataset:

**Dataset description**: You are provided a dataset with 20 variables. Variables $x1\ -\ x19$ refer to the independent variables, while variable $y$ is your dependent variable. Training data is stored in the file `/etc/data/regression-train.csv`.

**Note**: The TA will use a test set to verify your solution. The format (independent variables $x1\ -\ x19$, dependent variable  $y$) will be same, but TA's file may contain different number of data points than the split version from training set. Please ensure you take this into account, and do not hard code any dimensions.

---

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#Import necessary libraries
import pandas as pd
import numpy as np

In [4]:
#Read the data
df = pd.read_csv("/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/8- Week 8/WKS8_Student/data/regression-train.csv")
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
df

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,y
0,508,44,60,718,42,234,0,0,56,52,8,216,0,1998,472,136,236,12,4,610.0
1,1020,106,198,1620,126,680,2,2,112,104,36,614,0,5744,1642,294,348,14,20,2300.0
2,1118,146,828,704,32,698,2,2,96,122,18,842,0,6324,1748,282,718,16,4,1850.0
3,922,70,452,222,48,150,0,0,108,108,22,152,2,1360,320,224,98,4,36,270.0
4,526,60,294,162,18,164,2,2,52,46,8,166,0,1776,440,140,172,8,2,500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,630,50,78,996,48,188,2,2,70,120,26,136,0,1260,302,152,110,6,26,310.0
128,330,32,38,664,4,20,0,2,26,18,4,36,2,392,88,78,36,6,4,150.0
129,1146,156,262,2628,194,840,0,0,170,120,24,940,0,6396,1714,288,664,16,18,1920.0
130,472,22,398,250,2,128,0,0,54,30,26,232,2,2230,540,112,114,8,0,460.0


In [5]:
from sklearn.model_selection import train_test_split
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [6]:
# For regression, it is particularly important to normalize our data before
# training the model, so we can better interpret our coefficients
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train_unscaled)
X_train = scaler.transform(X_train_unscaled)
X_test = scaler.transform(X_test_unscaled)
print(X_train[0])

[ 0.55646407  0.15658678  0.96233481 -0.71601746  0.76470622  0.50389386
  0.93541435  0.93541435 -0.12200241  0.78852431  1.206875    0.12092592
 -1.04880885  0.40219789  0.30992371  0.13755462  0.10537768  0.09601967
  0.81641649]


We have given you a function to caculate the RMSE of a regression model, given its predictions and the ground truth y-values of the test data.

In [7]:
import numpy as np

def calculate_rmse(y_true, y_pred):
    # Inputs:
    # y_true: ground truth dependent variable values, of type vector
    # y_pred: prediction outcomes from any regression method, with the same length as y_true

    # Outputs:
    # a single value of type double, with the RMSE value

    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    return np.sqrt(np.mean((y_true - y_pred)**2))


In [8]:
from sklearn.linear_model import LinearRegression

# Define the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate RMSE
rmse = calculate_rmse(y_test, y_pred)

print(f"RMSE: {rmse}")


RMSE: 787.9099436300614


## Part 1: Linear Regression (Group)
You will write code in the function `alda_regression_linear` to train simple linear regression. Detailed instructions for implementation and allowed packages have been provided the comments.

Before your begin, read the documentation on sklearn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [10]:
from sklearn.linear_model import LinearRegression

def alda_regression_linear(X_train, X_test, y_train):
    # Perform linear regression
    # Inputs:
    # X_train: training data frame(19 variables, x1-x19)
    # X_test: test data frame(19 variables, x1-x19)
    # y_train: dependent variable, training data (vector, continous type)

    # Output:
    # A tuple containing:
    # - The regression model and 
    # - The list of predictions on test data (X_test) (vector) 
  
    # allowed packages: sklearn.linear_model
  
    # Function hints: Read the documentation for the functions LinearRegression (link above)
    
    # write code for building a linear regression model using X_train, y_train
    raise NotImplementedError()

In [11]:
def alda_regression_linear(X_train, X_test, y_train):
    # Perform linear regression
    # Inputs:
    # X_train: training data frame(19 variables, x1-x19)
    # X_test: test data frame(19 variables, x1-x19)
    # y_train: dependent variable, training data (vector, continuous type)

    # Output:
    # A tuple containing:
    # - The regression model and 
    # - The list of predictions on test data (X_test) (vector) 

    # Define the model
    model = LinearRegression()
    # Here we're defining our model as a LinearRegression model. Linear Regression is 
    # a simple machine learning model where the response y is modelled by a linear 
    # combination of the predictors in X.

    # Train the model
    model.fit(X_train, y_train)
    # The fit method here is used to train the model. We're passing in the X_train which 
    # contains our predictor variables and the y_train which contains our output variable. 
    # The fit method adjusts weights on the input variables to find the best possible 
    # coefficients that minimize the loss (in this case, mean squared error).

    # Make predictions on the test set
    y_pred = model.predict(X_test)
    # After we've fitted the model, we can make predictions on our test data. 
    # The predict method takes in the X_test data and tries to predict the output 
    # for each row in X_test based on the coefficients it learned in the training phase.

    return model, y_pred
    # The function returns a tuple: the model itself (which contains coefficients 
    # for each input feature) and the predictions it made on the test dataset.


In [12]:
# Now test your model
lr_model, lr_predictions = alda_regression_linear(X_train, X_test, y_train)

print(f'Training RMSE: {calculate_rmse(y_train, lr_model.predict(X_train))}')
print(f'Test RMSE: {calculate_rmse(y_test, lr_predictions)}')
print('')

# Which attributes are most predictive of the outcome variable?
print(f'Model coefficients:\n{lr_model.coef_}')


Training RMSE: 524.953283852617
Test RMSE: 787.9099436300614

Model coefficients:
[   48.18082989    65.73979512   210.77335056   122.5616241
   124.07603501   290.38608009   123.86744484   -83.03161175
   270.17717717   -48.56806094   -23.51778917  -941.63419412
   -89.00842661 -4044.98292672  5073.44151699  -544.21209605
   137.73175811  -101.04154575   235.26578305]


In [13]:
lr_model, simple_linear_regression_result = alda_regression_linear(X_train, X_test, y_train)
np.testing.assert_equal(simple_linear_regression_result.shape, (27,))
np.testing.assert_almost_equal(calculate_rmse(y_test,simple_linear_regression_result),787.9099436300563)
np.testing.assert_almost_equal(lr_model.coef_[0], 48.18082989)

In [14]:
# Here's some test cases to make sure you're right
df_test = pd.read_csv("/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/8- Week 8/WKS8_Student/data/regression-test.csv")
X_test2 = df_test.iloc[:,:-1]
y_test2 = df_test.iloc[:,-1]

lr_model, simple_linear_regression_result = alda_regression_linear(X, X_test2, y)
np.testing.assert_equal(simple_linear_regression_result.shape, (66,))
np.testing.assert_almost_equal(calculate_rmse(y_test2,simple_linear_regression_result),946.4318403560575)
# END HIDDEN TESTS

## Part 2: Lasso Regression (Group)
You will write code in the function `alda_regression_lasso` to train simple lasso regression. Detailed instructions for implementation and allowed packages have been provided the comments. 

Before your begin, read the documentation on sklearn's [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) - a Lasso regression model that uses CV to tune its hyperparameters.

**Note** that the lasso regression model has *built-in* crossvalidation, which it performs on the training dataset provided, to select the best shrinkage coefficient for the validation data.

In [28]:
from sklearn.linear_model import LassoCV

def alda_regression_lasso(X_train, X_test, y_train, random_state=0):
    # Instantiate the LassoCV model with 10-fold cross validation
    #lasso_cv = LassoCV(cv=10, random_state=random_state)
    lasso_cv = LassoCV(cv=10, random_state=random_state, max_iter=10000)
    #lasso_cv = LassoCV(cv=10, random_state=random_state, max_iter=10000, tol=0.0001)



    # Fit the model on the training data
    lasso_cv.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = lasso_cv.predict(X_test)
    
    # Return the fitted model and the predictions
    return lasso_cv, y_pred


**Before testing your model**: Do you expect the training error to be higher or lower? What about the testing error? What do you expect to be different about the coefficients?

In [29]:
lasso_model, lasso_predictions = alda_regression_lasso(X_train, X_test, y_train)

# Should be ~541.7
print(f'Training RMSE: {calculate_rmse(y_train, lasso_model.predict(X_train))}')
# Should be ~639.6
print(f'Test RMSE: {calculate_rmse(y_test, lasso_predictions)}')
print('')

# Which attributes are most predictive of the outcome variable?
print(f'Model coefficients:\n{lasso_model.coef_}')

print()
# Note we called this 'lamda' in class, but sklearn calls it alpha (should be ~3.196)
print(f'The shinkage coefficient hyperparameter chosen by CV: {lasso_model.alpha_}')

Training RMSE: 541.6564802479181
Test RMSE: 639.6168267325154

Model coefficients:
[ -201.26105526   125.5739836    143.30724503   113.59035547
     0.           452.53752144    78.21118683   -28.92898297
    -0.            -8.80656266     0.           -91.57679107
   -95.58105028 -2134.97942061  2431.93701951    -0.
    -0.          -153.36709652   172.02434475]

The shinkage coefficient hyperparameter chosen by CV: 3.196361962538218


In [26]:
lasso_model, lasso_regression_result = alda_regression_lasso(X_train, X_test, y_train)
np.testing.assert_equal(lasso_regression_result.shape, (27,))
np.testing.assert_almost_equal(calculate_rmse(y_test, lasso_regression_result), 639.6168267325169)
np.testing.assert_almost_equal(lasso_model.coef_[0], -201.26105526)

In [27]:
# Test Cases
df_test = pd.read_csv("/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/8- Week 8/WKS8_Student/data/regression-test.csv")
X_test2 = df_test.iloc[:,:-1]
y_test2 = df_test.iloc[:,-1]

lasso_model, lasso_regression_result = alda_regression_lasso(X, X_test2, y)
np.testing.assert_equal(lasso_regression_result.shape, (66,))
np.testing.assert_almost_equal(calculate_rmse(y_test2, lasso_regression_result), 978.6205927267488)

From the results, compare the two regression models, including the training and testing RMSE, and the coefficients. Use the output of these functions to answer the following questions below:

1. The dataset contains 19 attributes. Are all 19 attributes useful for predicting the dependent variable? Why or why not? Use your results to justify the answer.
2. If not all attributes are predictive, use your Lasso model to perform feature selection. Which attributes should be kept? Use a correlation and/or scatter plot to justify your answer for at least one attribute (in a new cell below).