<h1 style="text-align: center;"><a title="Data Science-AIMS-Cmr-2021-22">Linear Regression (Practicals) </h1>


Credits:
* These material are adapted from Emmanuel Dufourq's SSMDS 2019, AIMS SA


* This notebook will make use of Scikit-learn's Linear Regression class. The documentation is found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html


* The notebooks on linear regression are modifications of various sources which include:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
https://medium.com/analytics-vidhya/linear-regression-using-python-ce21aa90ade6
https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html/2

## Various Python imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

## Load the dataset

In [None]:
dataset = pd.read_csv('https://drive.google.com/uc?export=download&id=1AWNAffJ6dG4ZGsLDn5T6snCSvk08IpxE')


## Take a look at what has been downloaded

In [None]:
dataset.head(6)

## Shuffle the data around

In [None]:
dataset = dataset.sample(frac=1)
dataset.head()

## Extract the data from the Pandas dataframe into X and Y

In this case we know the target is called _target_ so to get the values of X we can use the built in `drop` function.

In [None]:
X = dataset.drop('target', axis=1).values
Y = dataset['target'].values

In [None]:
X[:, [0,-1]]

The code below builds a regression model between every single feature and the output at a time and outputs for each of the models the MSE , MAE and RSME 


In [None]:
def FeaturesAndScores(size_of_features):
    
    scores = []
    for i in range(size_of_features):
        X_train , X_test , Y_train , Y_test = train_test_split(X[:,i] , Y , random_state=0)
        X_train = X_train.reshape(-1, 1)
        X_test = X_test.reshape(-1, 1)
        model1 = LinearRegression()
        model1.fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        scores.append(  ( mean_squared_error(Y_test, Y_pred) , 
                        mean_absolute_error(Y_test, Y_pred), \
                        np.sqrt(mean_squared_error(Y_test, Y_pred)) ))
    df = pd.DataFrame(scores)
    df.columns = [  'MSE' , 'MAE' , 'RMSE']
    df.index = list(dataset.columns)[:-1]  
    return df

The code below increments on the number of features whem constructing the feature matrix and at each iteration,  builds a regression model between an increasing subset of features and the output, and at a time and outputs the MSE , MAE and RSME for each of the models . 


In [None]:
def FeaturesAndScores_Acc(size_of_features):
    
    scores = []
    for i in range( 1,size_of_features):
        X_train , X_test , Y_train , Y_test = train_test_split(X[:,:i] , Y , random_state=0)
#         X_train = X_train.reshape(-1, 1)
#         X_test = X_test.reshape(-1, 1)
        model2 = LinearRegression()
        model2.fit(X_train, Y_train)
        Y_pred = model2.predict(X_test)
        scores.append(  ( mean_squared_error(Y_test, Y_pred) , 
                        mean_absolute_error(Y_test, Y_pred), \
                        np.sqrt(mean_squared_error(Y_test, Y_pred)) ))
    df = pd.DataFrame(scores)
    df.columns = [  'MSE' , 'MAE' , 'RMSE']
    #df.index = list(dataset.columns)[:-1]  
    return df

In [None]:
FeaturesAndScores_Acc(11)

Sometimes, it is good practice to normalize your set of features before training the model. Here, you will learn how to build a pipeline that instantiate the models together with the normalization process

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X_train , X_test , Y_train , Y_test = train_test_split(X , Y ,test_size=0.25, random_state=1)

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())


## Task 1:


1. Adjust the processes `FeaturesAndScores` and `FeaturesAndScores_Acc` in order to include the pipeline in it and train your  models under the new frameworks?

2. Use your knowledge from previous notebooks  and build a regression model between the given inputs and outputs.


In [None]:
## YOUR CODE HERE




This section is up to you to add code and solve the problem. The code below will allow you to download another test set and you will make predictions on that data. When you are satisfied with the performance of your model you will submit your final score online for comparisons.


## Download another test set for evaluation

Do not change the code below.

In [None]:
dataset_evaluation = pd.read_csv('https://drive.google.com/uc?export=download&id=10pz4QqiTnxTXouvJulW8h9z71ojSJ2Xk')
X_evaluation = dataset_evaluation.drop('target', axis=1).values
Y_evaluation = dataset_evaluation['target'].values

## Store your model predictions here

Do not change the name `Y_eval_prediction`.

Make sure your linear regression model variable is called `model` otherwise the code below will not work.

In [None]:
X_evaluation.shape

In [None]:
Y_eval_prediction = model.predict(X_evaluation)

## Evaluate your model on this test data. 

Do not change the code.

In [None]:
mean_absolute_error(Y_evaluation, Y_eval_prediction)