# Introduction to Regression: Linear Regression with sklearn

In this introduction, we will develop linear regression from basic principles.  Other tutorials will forgo the theory and focus on existing python libraries that are commonly used for building regression models.

While it is fun to build up a regression model by hand, there are packages in place that take care of the dirty work.

According to the website, scikit-learn (https://scikit-learn.org/stable/) is a collection of _"simple and efficient tools for data mining and data analysis."_  We will be using the linear regression tools in this tutorial.

In [None]:

import pandas as pd  # Pandas is a package which creates data frames
import numpy as np # Numpy is the package which creates/manages/operates on numerical data

from sklearn import linear_model # Notice the new syntax here ... it allows the developer to load a subset of functionality
from sklearn import  metrics

import matplotlib.pyplot as plt # Matplotlib is the plotting library

# This allow for multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## The Data

This tutorial is a subset of the tutorial developed by _RishSD_ on github (https://github.com/RishiSD/Linear-regression-on-Swedish-Auto-Insurance-dataset).

To quote his tutorial:  _"The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims."_

X: Number of claims. <br>
Y: Total payment for all claims in thousands of Swedish Kronor.


In [None]:
# Pull the data directly from github
swede = 'https://github.com/RishiSD/Linear-regression-on-Swedish-Auto-Insurance-dataset/raw/master/slr06.xls'
data = pd.read_excel(swede)

data.head()

_ = plt.scatter(data.X,data.Y)
_ = plt.title('Scatter plot between feature and Label')
plt.show()

## Running the model

With sklearn, model development is much more straightforward.

First, select the model with: linear_model.LinearRegression() <br>
(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#) <br>
This creates an object which specifies the fact that you are running a linear regression.

Second, you simply run the model with the .fit() method.

Third, you can run the model in your input to get your estimates (which was called $\bar{y}$ in the previous tutorials.

Fourth, you look at the results with metrics.mean_squared_error() method.



In [None]:
# sklearn.linear_model can be a bit finicky
# X: should be a dataframe containing the independant variables
# Y: should be a vector of the outputs

X = pd.DataFrame(data.X)
Y = data.Y

reg = linear_model.LinearRegression()
_ = reg.fit(X,Y)
data.Y_ = reg.predict(X)

# Look how tricky I am, I switched from mse to rmse
mse = metrics.mean_squared_error(data.Y_,data.Y)
print('RMSE for Training set : %f' % (np.sqrt(mse)))

_ = plt.scatter(data.X,data.Y,color='r')
_ = plt.scatter(data.X,data.Y_,color='g')
_ = plt.plot(data.X,data.Y_,color='b')
plt.show()


## Moving on to multiple variables:  Multi-variate linear regression

We still have one output variable, but now you can have multiple input variables.

Fortunatley, sklearn has some built-in datasets to use for demonstrating.

Portions of this tutorial were derived from:
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

In [None]:
# The boston housing dataset contains several variables which can be used in a regression model
from sklearn.datasets import load_boston
boston = load_boston()

boston

In [None]:
# Unfortunately, the data doesn't look too much like we are used to.
# It's not in a table, so we have to create the table. It's a dict

boston.keys()
print(boston.DESCR)
boston.feature_names

In [None]:
# Creating new dataframes is fairly straightforward:
bostonDF = pd.DataFrame(boston.data, columns=boston.feature_names)

# As noted in the description, the goal is to predict median home value.
# By default, that isn't in our dataframe, so lets add it:
bostonDF['MEDV'] = boston.target

bostonDF.head()


In [None]:
# Let's see what the data looks like graphically.
# Choose a couple of variables to look at:

feature1 = _______
feature2 = _______

features = [feature1, feature2]
target = bostonDF['MEDV']



_ = plt.figure(figsize=(20, 5))

# This loops over the features that we want to see
for i, col in enumerate(features):
    # It creates a subplot for each feature
    _ = plt.subplot(1, len(features) , i+1)
    x = bostonDF[col]
    y = target
    _ = plt.scatter(x, y, marker='o')
    _ = plt.title(col)
    _ = plt.xlabel(col)
    _ = plt.ylabel('MEDV')
    
plt.show()

In [None]:
# Let's see what happens if we only look at one independent variable

# Remember that the independent variables are stored in a dataframe because there could be several
X = pd.DataFrame(np.c_[bostonDF[feature1]], columns = [feature1])
# Whereas the dependent variable is a list
Y = bostonDF['MEDV']

reg = linear_model.LinearRegression()
_ = reg.fit(X,Y)
Y_ = reg.predict(X)

# Look how tricky I am, I switched from mse to rmse
mse = metrics.mean_squared_error(Y_,Y)
print('RMSE for LSTAT variable : %f' % (np.sqrt(mse)))



In [None]:
# Now for the other variable

# Remember that the independent variables are stored in a dataframe because there could be several
X = pd.DataFrame(np.c_[bostonDF[feature2]], columns = [feature2])
# Whereas the dependent variable is a list
Y = bostonDF['MEDV']

reg = linear_model.LinearRegression()
_ = reg.fit(X,Y)
Y_ = reg.predict(X)

# Look how tricky I am, I switched from mse to rmse
mse = metrics.mean_squared_error(Y_,Y)
print('RMSE for RM variable : %f' % (np.sqrt(mse)))



In [None]:
# Remember that the independent variables are stored in a dataframe because there could be several
X = pd.DataFrame(np.c_[bostonDF[feature1], bostonDF[feature2]], columns = [feature1,feature2])
# Whereas the dependent variable is a list
Y = bostonDF['MEDV']

reg = linear_model.LinearRegression()
_ = reg.fit(X,Y)
Y_ = reg.predict(X)

# Look how tricky I am, I switched from mse to rmse
mse = metrics.mean_squared_error(Y_,Y)
print('RMSE for Training set : %f' % (np.sqrt(mse)))


<font color='red'>
# Try a few more models
# Does any of the univariate models do better?

In [None]:
# Ok, let's go for it:

X = bostonDF
X = bostonDF.loc[:,bostonDF.columns != 'MEDV']
# Whereas the dependent variable is a list
Y = bostonDF['MEDV']

reg = linear_model.LinearRegression()
_ = reg.fit(X,Y)
Y_ = reg.predict(X)

# Look how tricky I am, I switched from mse to rmse
mse = metrics.mean_squared_error(Y_,Y)
print('RMSE for Training set : %f' % (np.sqrt(mse)))


