#  Lab 1: Regression 
## Multivariate Regression

In this notebook we are going to implement multivariate regression. In particular, you will have to:

* Complete the function `multilinearNEWRegrPredict` to implement multivariate regression algorithm.
* Use the previous `SSR` function on the estimates and the true labels.


# Import libraries

The required libraries for this notebook are pandas, sklearn and numpy.

In [1]:
# import libraries
import pandas
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load the data
The data we are using is from ***multi_regr_data.csv***. It consists of 1000 data related to student marks. Each data point has 3 columns(marks) and we are going to use all of them for multivariate linear regression. In particular, we will use the first 2 marks to predict the 3rd mark.

In [2]:
# Loading the CSV file
dataset=pandas.read_csv('./datasets/multi_regr_data.csv')
print(dataset.shape) #(data_number,feature_number)

(1000, 3)


# Split data into training and testing

In [3]:
# Split the data, we will use first 2 columns as features and the 3rd columns as target.
X = dataset[list(dataset.columns)[:-1]]
#print(X.shape)
Y = dataset[list(dataset.columns)[-1]] 
#print(Y.shape)
# Split the data into training and testing(75% training and 25% testing data)
xtrain,xtest,ytrain,ytest=train_test_split(X, Y, random_state=0)
print(xtrain.shape)
print(xtest.shape)

(750, 2)
(250, 2)


# Use multivariate linear regression from a library

We will first see how multivariate linear regression can be implemented using already available functions from the scikit-learn library.

In [4]:
# sklearn functions implementation
def multilinearRegrPredict(xtrain, ytrain,xtest ):
    # Create linear regression object
    reg=LinearRegression()
    # Train the model using the training sets
    reg.fit(xtrain,ytrain)
    
    print(reg.coef_)
    # Make predictions using the testing set
    y_pred = reg.predict(xtest)
    # See how good it works in test data, 
    # we print out one of the true target and its estimate
    print('For the true target: ',list(ytest)[-1])
    print('We predict as: ', list(y_pred)[-1]) # print out the 
    print("Overall Accuracy Score from library implementation:", reg.score(xtest, ytest)) #.score(Predicted value, Y axis of Test data) methods returns the Accuracy Score or how much percentage the predicted value and the actual value matches

    return y_pred

y_pred = multilinearRegrPredict(xtrain, ytrain, xtest )



[0.08986326 0.92350445]
For the true target:  25
We predict as:  20.603310452986527
Overall Accuracy Score from library implementation: 0.9112675801400184


# Implement your own multivariate linear regression function 

You are supposed to complete the `multiLinparamEstimates(xtrain, ytrain)` function that estimates beta as follows:

\begin{align}
\hat{\beta} & = \left(X^T X \right)^{-1} X^Ty
\end{align}

You are asked to complete the `multilinearNEWRegrPredict(xtrain, ytrain,xtest)` function, or write your own, that returns the output variable y given the input varables.

***Remember that this time we train on `xtrain` and `ytrain`!***

In [5]:

def multiLinparamEstimates(xtrain, ytrain):  
    # Q: why need 'intercept'?
    intercept = np.ones((xtrain.shape[0], 1))
    print(xtrain.shape)
    xtrain = np.concatenate((intercept, xtrain), axis=1)
    print(xtrain.shape)
    
    beta = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.transpose(xtrain), xtrain)), 
                               np.transpose(xtrain)), ytrain)
    return beta

def multilinearNEWRegrPredict(xtrain, ytrain,xtest):
    beta = multiLinparamEstimates(xtrain, ytrain)
    print(beta)
    intercept = np.ones((xtest.shape[0], 1))
    xtest = np.concatenate((intercept, xtest), axis=1)
    y_pred = np.matmul(xtest, beta)
    return y_pred



In [6]:
y_pred1 = multilinearNEWRegrPredict(np.array(xtrain.values), np.array(ytrain.values).flatten(),
                             np.array(xtest.values))
print (y_pred1)
# r2=r2_score(ytest, y_pred1)



(750, 2)
(750, 3)
[-2.02536054  0.08986326  0.92350445]
[62.83017328 59.49560853 77.0173213  66.72878943 37.49598038 71.74588438
 72.26981614 60.89330111 66.13986622 84.520092   67.06337067 62.17625862
 83.59658754 79.40350979 67.62742207 47.9891106  41.99876756 89.7666571
 68.1016102  44.07524663 87.29060535 57.67347144 55.19741969 89.85652037
 66.27947311 73.68275655 70.66752521 52.56651322 61.22788235 78.35002244
 96.47590624 62.47072022 50.19557256 63.84354099 65.71542171 74.65600463
 65.15137032 46.4365633  66.88364414 68.84538813 52.72136794 83.9560406
 96.54089769 63.30436141 61.1628909  73.68275655 54.3637785  96.86023111
 72.42467086 74.90072261 68.82051631 52.87622265 48.88774323 72.46479049
 63.18962633 66.58918254 65.87027643 95.61739324 62.02140391 64.29285731
 97.15469272 63.59882302 69.6541575  62.38085696 78.68460368 86.4970838
 62.83017328 60.174395   50.96422229 65.62555845 74.83573116 58.03292449
 61.43248069 31.1461843  57.04442859 64.31772913 83.00766433 67.7421571

# Sum of Squared Residuals

You are now asked to re-use the previous function in order to compute the SSR associated with the predictions delivered by your own or the library's implementation of multivariate linear regression.

In [7]:
def SSR( y_pred,yTest):
    ssr=np.sum(np.multiply(np.subtract(yTest, y_pred),np.subtract(yTest, y_pred)))
    return ssr

y_pred_SSR = SSR(y_pred, np.array(ytest.values).flatten())
#print(y_pred.shape)
#print(np.array(ytest.values).flatten().shape)
y_pred1_SSR = SSR(y_pred1, np.array(ytest.values).flatten())

print("Scikit-learn multivariate linear regression SSR: %.4f" % y_pred_SSR)
print("From scratch implementation of multivariate linear regression SSR: %.4f" % y_pred1_SSR)

Scikit-learn multivariate linear regression SSR: 4685.9382
From scratch implementation of multivariate linear regression SSR: 4685.9382
