#  Lab 1: Regression 
## Multivariate Regression

In this notebook we are going to implement multivariate regression. In particular, you will have to:

* Complete the function `multilinearNEWRegrPredict` to implement multivariate regression algorithm.
* Use the previous `SSR` function on the estimates and the true labels.


# Import libraries

The required libraries for this notebook are pandas, sklearn and numpy.

In [1]:
# import libraries
import pandas
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load the data
The data we are using is from ***multi_regr_data.csv***. It consists of 1000 data related to student marks. Each data point has 3 columns(marks) and we are going to use all of them for multivariate linear regression. In particular, we will use the first 2 marks to predict the 3rd mark.

In [3]:
# Loading the CSV file
dataset=pandas.read_csv('./datasets/multi_regr_data.csv')
print(dataset.shape) #(data_number,feature_number)
dataset

(1000, 3)


Unnamed: 0,Math,Reading,Writing
0,48,68,63
1,62,81,72
2,79,80,78
3,76,83,79
4,59,64,62
...,...,...,...
995,72,74,70
996,73,86,90
997,89,87,94
998,83,82,78


# Split data into training and testing

In [4]:
# Split the data, we will use first 2 columns as features and the 3rd columns as target.
X = dataset[list(dataset.columns)[:-1]]
#print(X.shape)
Y = dataset[list(dataset.columns)[-1]] 
#print(Y.shape)
# Split the data into training and testing(75% training and 25% testing data)
xtrain,xtest,ytrain,ytest=train_test_split(X, Y, random_state=0)
print(xtrain.shape)
print(xtest.shape)

(750, 2)
(250, 2)


# Use multivariate linear regression from a library

We will first see how multivariate linear regression can be implemented using already available functions from the scikit-learn library.

In [5]:
# sklearn functions implementation
def multilinearRegrPredict(xtrain, ytrain,xtest ):
    # Create linear regression object
    reg=LinearRegression()
    # Train the model using the training sets
    reg.fit(xtrain,ytrain)
    # Make predictions using the testing set
    y_pred = reg.predict(xtest)
    # See how good it works in test data, 
    # we print out one of the true target and its estimate
    print('For the true target: ',list(ytest)[-1])
    print('We predict as: ', list(y_pred)[-1]) # print out the 
    print("Overall Accuracy Score from library implementation:", reg.score(xtest, ytest)) #.score(Predicted value, Y axis of Test data) methods returns the Accuracy Score or how much percentage the predicted value and the actual value matches

    return y_pred

y_pred = multilinearRegrPredict(xtrain, ytrain, xtest )



For the true target:  25
We predict as:  20.603310452986545
Overall Accuracy Score from library implementation: 0.9112675801400184


# Implement your own multivariate linear regression function 

You are supposed to complete the `multiLinparamEstimates(xtrain, ytrain)` function that estimates beta as follows:

\begin{align}
\hat{\beta} & = \left(X^T X \right)^{-1} X^Ty
\end{align}

You are asked to complete the `multilinearNEWRegrPredict(xtrain, ytrain,xtest)` function, or write your own, that returns the output variable y given the input varables.

***Remember that this time we train on `xtrain` and `ytrain`!***

In [26]:

def multiLinparamEstimates(xtrain, ytrain):  
    # Q: why need 'intercept'?
    intercept = np.ones((xtrain.shape[0], 1))
    print(xtrain.shape)
    xtrain = np.concatenate((intercept, xtrain), axis=1)
    print(xtrain.shape)
    
    # Complete your code here.
    # beta = ...
    beta = np.linalg.inv(np.matmul(np.transpose(xtrain), xtrain))
    beta = np.matmul(beta, np.transpose(xtrain))
    beta = np.matmul(beta, ytrain)
    
    return beta

def multilinearNEWRegrPredict(xtrain, ytrain,xtest):
    beta = multiLinparamEstimates(xtrain, ytrain)
    # Complete your code here.
    
    # intercept = ...
    # xtest = ...
    #y_pred = ...
    
    
    intercept = np.ones((xtest.shape[0], 1))
    xtest = np.concatenate((intercept, xtest), axis=1)
    y_pred = np.matmul(xtest, beta)
    
    
    
    
    return y_pred


# Model Evaluation - R2 Score
def r2_score(Y, Y_pred):
    mean_y = np.mean(Y)
    ss_tot = sum((Y - mean_y) ** 2)
    ss_res = sum((Y - Y_pred) ** 2)
    r2 = 1 - (ss_res / ss_tot)
    print("Accuracy Score from scratch implementation:", r2) 
    return r2


In [27]:
y_pred1 = multilinearNEWRegrPredict(np.array(xtrain.values), np.array(ytrain.values).flatten(),
                             np.array(xtest.values))
#print (y_pred1)
r2=r2_score(ytest, y_pred1)



(750, 2)
(750, 3)
Accuracy Score from scratch implementation: 0.9112675801400184


# Sum of Squared Residuals

You are now asked to re-use the previous function in order to compute the SSR associated with the predictions delivered by your own or the library's implementation of multivariate linear regression.

In [29]:
def SSR(y_pred, yTest):
    # Complete your code here.
    #ssr = ...
    ssr = np.sum(np.square(np.subtract(y_pred, yTest)))
    
    return ssr

y_pred_SSR = SSR(y_pred, np.array(ytest.values).flatten())
#print(y_pred.shape)
#print(np.array(ytest.values).flatten().shape)
y_pred1_SSR = SSR(y_pred1, np.array(ytest.values).flatten())

print("Scikit-learn multivariate linear regression SSR: %.4f" % y_pred_SSR)
print("From scratch implementation of multivariate linear regression SSR: %.4f" % y_pred1_SSR)

Scikit-learn multivariate linear regression SSR: 4685.9382
From scratch implementation of multivariate linear regression SSR: 4685.9382
