# Simple Linear Regression

If you start with machine learning, linear regression models are the first predictive models you may learn. Regression models estimate the nature of the relationship between independent and dependent variables. Although they are conceptually simple, they have some key features that make them flexible, powerful and easy to explain. 

While newer and conceptually more complicated models can outperform linear regression, linear models are still widely used, especially where data collection can be expensive and highly interpretable models are of considerable value. Extensions to linear regression such as ridge and lasso can help to avoid over fitting in feature-rich models and even perform feature selection. Logistic regression adapts the linear frame to classification problems. 

This is a very simple demo for Vanilla Linear Regression Model (LRM). Let’s look at how a plane-vanilla linear regression works.

- ***weatherww2 Dataset for regression analysis***

You can find detailed data [here][1].


[1]: https://www.kaggle.com/smid80/weatherww2/data


- ***Goal***
  - Find a relationship between minimum and maximum temperature.
  - Predict maximum temperature by given the minimum temperature.

> ## Reading Data and Extracting the Variables

In [None]:
# Some usefull packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing dataset
df = pd.read_csv('../input/weatherww2/Summary of Weather.csv')

# Selecting min and max temperature columns
df = df[['MinTemp', 'MaxTemp']]
df.head()

To experiment simple linear regression, two features `MinTemp` and `MaxTemp` are extracted from this dataset. 

Let's now visualise our target and pradictor variable.

In [None]:
# Scatter plot
plt.figure(figsize=(10, 5))
plt.scatter(df['MinTemp'], df['MaxTemp'],s=10)
plt.xlabel('Min Temperature °C',fontsize=15)
plt.ylabel('Max Temperature °C',fontsize=15)
plt.show()

Above graph showing the scatter data points of dependent variable Maximum Temperature and independent variable Minimum Temperature. With one predictor variable it looks like a line but there are few data points which are deviating from normal trend. 

In linear regression, outliers can greatly affect the regression (the slope, r-value, and r-squared).  It may be best to remove them from linear regression.

- Outlier treatment

In [None]:
# Drop anomalies data points
df.drop(df[(df['MinTemp'] < -15) & (df['MaxTemp'] > 15)].index, inplace = True)
df.drop(df[(df['MinTemp'] > 8) & (df['MaxTemp'] < -15)].index, inplace = True)

# Scatter plot after removing anomalies datapoint
plt.figure(figsize=(10, 5))
plt.scatter(df['MinTemp'], df['MaxTemp'],s=10)
plt.xlabel('Min Temperature °C',fontsize=15)
plt.ylabel('Max Temperature °C',fontsize=15)
plt.show()

This is, like every real world data set, a little noisy, but there’s clearly a trend: as we increase x, y increases as well. Perhaps this relationship can be well estimated with a line. Let’s get a sense for how the model works.

## Performing Simple Linear Regression

A linear model attempts to find the simplest relationship between a feature variable and the output as possible. Often this is described as ‘fitting a line’.

Equation of linear regression<br>
$y = c + m_1x_1 + m_2x_2 + ... + m_nx_n$

-  $y$ is the response
-  $c$ is the intercept
-  $m_1$ is the coefficient for the first feature
-  $m_n$ is the coefficient for the nth feature<br>

For each unit we increase x, y increases by m units (or decreases if m is negative). The term c is an intercept term which shifts our line up or down without changing the slope.

In our case:

$MaxTemp = c + m_1 \times MinTemp$

The $m$ values are called the model **coefficients** or **model parameters**.

In [None]:
# Scatter plot with few possible regression lines
plt.figure(figsize=(10, 5))
plt.scatter(df['MinTemp'], df['MaxTemp'],s=10)
plt.xlabel('Min Temperature °C',fontsize=15)
plt.ylabel('Max Temperature °C',fontsize=15)

x1 = [-38,35]
y1 = [-32,53]
plt.plot(x1, y1, color='orange')

x2 = [-38,35]
y2 = [-27,44]
plt.plot(x2, y2, color='red')
plt.show()

Which line seems to capture the trend best? That is not necessarily clear. The orange line seems to be closest to the points on the left, but then, as we approach the center of the distribution, it is not clear. On the right side, the orange line may have crossed the mark and is too high. But how do we choose which line?

### Generic Steps in model building using `statsmodels`

We first assign the feature variable, `MinTemp`, in this case, to the variable `X` and the response variable, `MaxTemp`, to the variable `y`.

In [None]:
X = df['MinTemp']
y = df['MaxTemp']

#### Train-Test Split

We now need to split our variable into training and testing sets. We perform this by importing `train_test_split` from the `sklearn.model_selection` library. It is usually a good practice to keep 70% of the data in our train dataset and the rest 30% in our test dataset.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
# Let's now take a look at the train dataset

X_train.head()

In [None]:
y_train.head()

#### Building a Linear Model

First need to import the `statsmodel.api` library using which we will perform the linear regression.

In [None]:
import statsmodels.api as sm

By default, the `statsmodels` library fits a line on the dataset which passes through the origin. But in order to have an intercept, we need to manually use the `add_constant` attribute of `statsmodels`. And once we've added the constant to our `X_train` dataset, we can go ahead and fit a regression line using the `OLS` (Ordinary Least Squares) attribute of `statsmodels` as shown below.

In [None]:
# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)

# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

In [None]:
# Print the parameters, i.e. the intercept and the slope of the regression line fitted
lr.params

In [None]:
# Performing a summary operation lists out all the different parameters of the regression line fitted
print(lr.summary())

####  Looking at some key statistics from the summary

The values we are concerned with are - 
1. The coefficients and significance (p-values)
2. R-squared and Adjusted R-squared 
3. F statistic and its significance

##### 1. The coefficient for MinTemp has a very low p value
The coefficient is statistically significant. So the association is not purely by chance. 

##### 2. R-squared and Adjusted R-squared are 0.778
Meaning that 77.8% of the variance in `MaxTemp` is explained by `MinTemp`

This is a decent R-squared value.

Since there is only one independent variable, adjusted R-squared is same as absolute R-squared.

###### 3. F statistic has a very low p value (practically low)
Meaning that the model fit is statistically significant, and the explained variance isn't purely by chance.

The fit is significant. Let's visualize how well the model fit the data.

From the parameters that we get, our linear regression equation becomes:

$ MaxTemp = 10.67 + 0.92 \times MinTemp $

In [None]:
# Best fit line
plt.figure(figsize=(10, 5))
plt.scatter(X_train, y_train, s=10)
plt.plot(X_train, 10.6760 + 0.9201*X_train, 'r')
plt.xlabel('Min Temperature °C',fontsize=15)
plt.ylabel('Max Temperature °C',fontsize=15)
plt.show()

## Residual analysis 
To validate assumptions of the model, and hence the reliability for inference

#### Distribution of the error terms
We need to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_pred = lr.predict(X_train_sm)
res = (y_train - y_train_pred)

In [None]:
fig = plt.figure(figsize=(8, 4))
sns.distplot(res, bins = 15)
fig.suptitle('Error Terms', fontsize = 15)                  # Plot heading 
plt.xlabel('y_train - y_train_pred', fontsize = 15)         # X-label
plt.show()

The residuals are following the normally distributed with a mean 0. All good!

#### Looking for patterns in the residuals

In [None]:
plt.figure(figsize=(8, 4))
plt.scatter(X_train,res)
plt.show()

We are confident that the model fit isn't by chance, and has decent predictive power. The normality of residual terms allows some inference on the coefficients.

Although, the variance of residuals increasing with X indicates that there is significant variation that this model is unable to explain.

As we can see, the regression line is a pretty good fit to the data

## Predictions on the Test Set

Now that we have fitted a regression line on our train dataset, it's time to make some predictions on the test data. For this, we first need to add a constant to the `X_test` data like we did for `X_train` and then we can simply go on and predict the y values corresponding to `X_test` using the `predict` attribute of the fitted regression line.

In [None]:
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)

# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

In [None]:
y_pred.head()

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

##### Looking at the MSE

In [None]:
# Returns the root mean squared error
mean_squared_error(y_test, y_pred)

##### Looking at the RMSE

In [None]:
# Returns the mean squared error
np.sqrt(mean_squared_error(y_test, y_pred))

###### Checking the R-squared on the test set

In [None]:
r_squared = r2_score(y_test, y_pred)
r_squared

##### Visualizing the fit on the test set

In [None]:
plt.figure(figsize=(10, 5))
plt.scatter(X_test, y_test, s=10)
plt.plot(X_test, 10.6760 + 0.9201*X_test, 'r')
plt.show()

## Linear Regression using `linear_model` in `sklearn`

Apart from `statsmodels`, there is another package namely `sklearn` that can be used to perform linear regression. We will use the `linear_model` library from `sklearn` to build the model. Since, we hae already performed a train-test split, we don't need to do it again.

There's one small step that we need to add, though. When there's only a single feature, we need to add an additional column in order for the linear regression fit to be performed successfully.

In [None]:
from sklearn.model_selection import train_test_split
X_train_lm, X_test_lm, y_train_lm, y_test_lm = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
X_train_lm.shape

In [None]:
X_train_lm = X_train_lm.values.reshape(-1,1)
X_test_lm = X_test_lm.values.reshape(-1,1)

In [None]:
print(X_train_lm.shape)
print(y_train_lm.shape)
print(X_test_lm.shape)
print(y_test_lm.shape)

In [None]:
from sklearn.linear_model import LinearRegression

# Representing LinearRegression as lr(Creating LinearRegression Object)
lm = LinearRegression()

# Fit the model using lr.fit()
lm.fit(X_train_lm, y_train_lm)

In [None]:
print(lm.intercept_)
print(lm.coef_)

The equationwe get is the same as what we got before!

$ MaxTemp = 10.67 + 0.92 \times MinTemp $

Sklearn linear model is useful as it is compatible with a lot of sklearn utilites (cross validation, grid search etc.)

---
                                   End
---