### What is data fitting?
The line used to represent the relationship is a straight line that passes through the data \
points and the variables have linear relationship

### Bias and Variance
Bias occurs when an algorithm has limited flexibility to learn from data: pay very little attention on training data.

Variance defines the algorithm's sensitivity to specific sets of data: pay too much attention on training data

### Overfitting
A scenario where the machine learning model tries to learn from the details along with the noise in the data and tries to fit each data point on the curve.

###### Reason:
a. data used for training is not cleaned and contains noise

b. The model has high variance

c. Size of training data is not enough

d. The model is too complex

### Underfitting
A scenario where machine learning model can neither learn the relationship between variables in the data nor predict or classify a new dataset.

##### Reason:

a. data used for training is not cleaned and contains noise

b. The model has high bias

c. Size of training data is not enough

d. The model is too simple

## What is Regularization

Regularization techiques are used to calibrate linear regression models in order to minimise the adjusted loss function and prevent both overfitting and underfitting.

Ridge (L2) Regularization

Lasso (L1) Regularization

## Ridge (L2) Regularization

It modifies the overfitted or underfitted models by adding the penalty equivalent to the sum of the squares of the magnitude of coefficients.

### $ Cost \, Function = Loss + \lambda \times \sum(w)^2$

$ Loss$ = Sum of squared residuals

$\lambda$ = penalty for the errors

$w$ = slope of the curve/ line

### When to use:
Useful when we have many variables with relatively smaller data samples.

The model does not encourage convergence towards zero but is likely to make them closer to zero and prevent overfitting.

## Lasso (R1) Regularization

It modifies the overfitted or underfitted models by adding the penalty equivalent to the sum of the absolute values of coefficients.

### $ Cost \, Function = Loss + \lambda \times \sum||w||$

$ Loss$ = Sum of squared residuals

$\lambda$ = penalty for the errors

$w$ = slope of the curve/ line

### When to use:
Preferred when we are fitting a linear model with fewer variables.

It encourages the coefficients of the variables to go towards zero because of the shape of the constraints of the abosulte value.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [13]:
boston_dataset = datasets.load_boston()

boston_pd = pd.DataFrame(boston_dataset.data)
boston_pd.columns = boston_dataset.feature_names
boston_pd_target = np.asarray(boston_dataset.target)
boston_pd['House Price'] = pd.Series(boston_pd_target)

X = boston_pd.iloc[:,:-1]
y = boston_pd.iloc[:,-1]

print(boston_pd.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  House Price  
0     15.3  396.90   4.98         24.0  
1     17.8  396.90   9.14         21.6  
2     17.8  392.83   4.03         34.7  
3     18.7  394.63   2.94         33.4  
4     18.7  396.90   5.33         36.2  



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [16]:
x_train,x_test,y_train,y_test = train_test_split(boston_pd.iloc[:,:-1],boston_pd.iloc[:,-1],test_size=0.25)

print("Train data shape of X_train is {} and the shape of Y_train is {} ".format(x_train.shape,y_train.shape))

print("Train data shape of X_test is {} and the shape of Y_test is {} ".format(x_test.shape,y_test.shape))

Train data shape of X_train is (379, 13) and the shape of Y_train is (379,) 
Train data shape of X_test is (127, 13) and the shape of Y_test is (127,) 


In [17]:
# Apply multiple Linear Regression Model
lreg = LinearRegression()
lreg.fit(x_train,y_train)

lreg_y_pred = lreg.predict(x_test)

# Calculate the Mean Squared Error (MSE)
mean_square_error = np.mean( (lreg_y_pred - y_test)**2)
print("MSE on test set is: ",mean_square_error)

MSE on test set is:  18.46406761968643


In [18]:
# Putting together the coefficient and their corresponding variable names
lreg_coefficient = pd.DataFrame()
lreg_coefficient["Columns"] = x_train.columns
lreg_coefficient["Coefficient Estimate"] = pd.Series(lreg.coef_)
print(lreg_coefficient)

    Columns  Coefficient Estimate
0      CRIM             -0.115707
1        ZN              0.049178
2     INDUS              0.064928
3      CHAS              3.682408
4       NOX            -21.229517
5        RM              3.295127
6       AGE              0.008618
7       DIS             -1.602754
8       RAD              0.360581
9       TAX             -0.015245
10  PTRATIO             -0.965489
11        B              0.007418
12    LSTAT             -0.601118


### Now we want to reduce the coefficient score

### 1. Ridge (L2) score

In [19]:
from sklearn.linear_model import Ridge

ridgeR = Ridge(alpha = 1)
ridgeR.fit(x_train,y_train)
y_pred = ridgeR.predict(x_test)

# calculate the mean squared error
mean_squared_error_ridge = np.mean((y_pred - y_test)**2)
print("The MSE is :",mean_squared_error_ridge )

# get the ridge coefficient and print them
ridge_coefficient = pd.DataFrame()
ridge_coefficient["Columns"] = x_train.columns
ridge_coefficient['Coefficient Estimate'] = pd.Series(ridgeR.coef_)
print(ridge_coefficient)

The MSE is : 17.973915218391458
    Columns  Coefficient Estimate
0      CRIM             -0.109326
1        ZN              0.050946
2     INDUS              0.025695
3      CHAS              3.376115
4       NOX            -11.021863
5        RM              3.367979
6       AGE              0.000203
7       DIS             -1.439719
8       RAD              0.344392
9       TAX             -0.016490
10  PTRATIO             -0.855682
11        B              0.007915
12    LSTAT             -0.618737


### 2. Lasso (R1) Regression

In [20]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha = 1)
lasso.fit(x_train,y_train)
y_pred = lasso.predict(x_test)

mean_squared_error_lasso = np.mean((y_pred - y_test)** 2)
print("The MSE is: ", mean_squared_error_lasso)

lasso_coefficient = pd.DataFrame()
lasso_coefficient["Columns"] = x_train.columns
lasso_coefficient["Coefficient Estimate"] = pd.Series(lasso.coef_)
print(lasso_coefficient)

The MSE is:  23.556117175525216
    Columns  Coefficient Estimate
0      CRIM             -0.074297
1        ZN              0.054410
2     INDUS             -0.000000
3      CHAS              0.000000
4       NOX             -0.000000
5        RM              0.443685
6       AGE              0.030163
7       DIS             -0.733494
8       RAD              0.327972
9       TAX             -0.018220
10  PTRATIO             -0.681699
11        B              0.006557
12    LSTAT             -0.862328
