# Linear Regression with sklearn API

Objective: Build linear regression model with sklearn.

1. Dataset: California housing
2. Linear Regression API: LinearRegression
3. Training: fit(normal eq) and cross_validate(normal with cross validate)
4. Evaluation: score (r2 score) and cross_val_score with different scoring parameters

We will study the model diagnosis with LearningCurve and learn how to examine the learned model or weight vector

In [None]:
#Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model LinearRegression

from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
np.random.seed(306)
plt.style.use('seaborn')

We will use ShuffleSplit cross validate with:

* 10 folds (n_splits) and
* set aside 20% examples as test examples (test_size)


In [None]:
shuffle_split_cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

Creates 10 folds through shuffle split by keeping aside 20% examples as test in each fold.

# STEP 1: Load the dataset

In [None]:
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)

The feature matrix is loaded in features dataframe and the labels dataframe. Let's examine the shapes of these two dataframes.

In [None]:
print(features.shape, labels.shape)

# STEP 2: Data Exploration

Covered in separate notebook

#STEP 3: Preprocessing and model building

## 3.1 Train test split

The first step is to split the data into training and test

In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features,labels,random_state=42)


Let's examine the shapes

In [None]:
print(train_features.shape, test_features.shape)

## 3.2 Pipeline preprocessing+model

1. StandardScaler
2. LinearRegression

In [None]:
lin_reg_pipeline = Pipeline([('feature_scaling',StandardScaler()),
                             ('lin_reg',LinearRegression())])
lin_reg_pipeline.fit(train_features, train_labels)

Let's look at the learnt weight vectors

In [None]:
print(lin_reg_pipeline[-1].intercept_,lin_reg_pipeline[-1].coef_)

# STEP 4: Model Evaluation



## Score

With twin objectives

* Estimation of model performance
* Comparison of errors for model diagnostics

In [None]:
test_score = lin_reg_pipeline.score(test_features, test_labels)
print(test_score)
train_score = lin_reg_pipeline.score(train_features, train_labels)
print(train_score)

The score method returns r2 score whose best value is 1.

## Crss validated score

In [None]:
lin_reg_score = cross_val_score(lin_reg_pipeline,
                                train_features,
                                train_labels,
                                scoring='neg_mean_squared_error',
                                cv=shuffle_split_cv)

print(lin_reg_score)

Here we got negative errors, we can convert that as follows

In [None]:
lin_reg_mse = -lin_reg_score

We can also use other scoring parameters, choices are as below:

* explained_variance
* max_error
* neg_mean_absolute_error
* neg_root_mean_squared_error
* neg_mean_squared_log_error
* neg_median_absolute_error
* neg_mean_absolute_percentage_error
* r2

## Cross Validation

To access the models trained in each fold along with some other stats

In [None]:
lin_reg_cv_results = cross_validate(lin_reg_pipeline,
                                    train_features,
                                    train_labels,
                                    cv=shuffle_split_cv,
                                    scoring='neg_mean_squared_error',
                                    return_train_score,
                                    return_estimator=True)

lin_reg_cv_results is a dictionary with following contents:

* trained estimators,
* time taken for fitting and scoring the models in cv,
* training score
* test scores

In [None]:
lin_reg_cv_results

10 values for cv=10

## Model Examination

Let's examine how much variability exists between the cross validated models

In [None]:
feature_names = train_features.columns
feature_names

In [None]:
coefs = [est[-1].coef_ for est in lin_reg_cv_results['estimator']]
weights_df = pd.DataFrame(coefs, columns=feature_names)

color = {'whiskers':'black','medians':'black','caps':'black'}
weights_df.plot.box(color=color,vert=False)
_=plt.title('Linear Regression Coefficients')

In [None]:
weights_df.describe()

## Selecting best model



In [None]:
best_model_index = np.argmin(test_error)
selected_model = lin_reg_cv_results['estimator'][best_model_index]

## Model Performance

In [None]:
from sklearn.model_selection import cross_val_predict
cv_predictions = cross_val_predict(in_reg_pipeline, train_features, train_labels)

In [None]:
mse_cv = mean_squared_error(train_labels,cv_predictions)

plt.scatter(train_labels,cv_predictions, color='blue')
plt.plot(train_labels,train_labels,'-r')
plt.show()

# STEP 5: Predictions

We can use the best performing model from cross validation for getting predictions on the test set.

In [None]:
test_predictions_cv = selected_model.predict(test_features)
test_predictions_cv[:5]

We cana also obtain predictions using the initial model that we built without cross validation.

In [None]:
test_predictions_cv = lin_reg_pipeline.predict(test_features)
test_predictions_cv[:5]

# STEP 6: Report model performance

In [None]:
score_cv = selected_model.score(test_features, test_labels)
score = lin_reg_pipeline.score(test_features, test_labels)
print(score_cv,score)