
# Name: Rohit Chandra

# Regularization

## 1 Explain what is Regularization? 

👉 One of the major aspects of training your machine learning model is avoiding **overfitting.** The model will have a low accuracy if it is overfitting. This happens because your **model is trying too hard to capture the noise in your training dataset.** By noise we mean the data points that don’t really represent the true properties of your data, but random chance. Learning such data points, makes your model more flexible, at the risk of overfitting.


👉 Overfitting is a phenomenon that occurs when a **Machine Learning model is constraint to training set and not able to perform well on unseen data.**

![](images/overtfitting.JPG)

**Regularization:**

👉 This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique **discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.**

***(OR)***

👉 **Regularization** is a technique used to reduce the errors by fitting the function appropriately on the given training set and avoid overfitting. 

👉 The commonly used regularization techniques are : 

+ **LASSO Regression (L1 regularization)**


+ **Ridge regression (L2 regularization)**

👉 ***Note:***

+ The full form of LASSO is **Least Absolute Shrinkage and Selection Operator regression**


+ One of the ways of avoiding overfitting is using **cross validation**, that helps in estimating the error over test set, and in deciding what parameters work best for your model

## 2 Explain how regularization is performed in linear regression? 

👉 Regularization works by **adding a penalty or complexity term or shrinkage term with Residual Sum of Squares (RSS) to the complex model.**

👉 **Let’s consider the Simple linear regression equation:**

Here **Y represents the dependent feature or response which is the learned relation**. Then,

Y is approximated to **β0 + β1X1 + β2X2 + …+ βpXp**

Here, **X1, X2, …Xp** are the independent features or predictors for Y, and

**β0, β1,…..βn** represents the coefficients estimates for different variables or predictors(X), which describes the weights or magnitude attached to the features, respectively.

👉 In simple linear regression, our **optimization function or loss function is known as the residual sum of squares (RSS).**

We choose those set of coefficients, such that the following loss function is minimized:


👉 **Below is th Cost Function For Simple Linear Regression**

![](images/cost_function_Linear_regression.JPG)

👉 Now, this will **adjust the coefficient estimates based on the training data.** If there is **noise present in the training data, then the estimated coefficients won’t generalize well and are not able to predict the future data.**

👉 This is where **regularization comes into the picture, which shrinks or regularizes these learned estimates towards zero, by adding a loss function with optimizing parameters to make a model that can predict the accurate value of Y.**

## 3 Explain what is Ridge and Lasso regression? 

### Ridge regression:

👉 Ridge regression is one of the types of linear regression in which we introduce a small amount of bias, known as Ridge regression penalty so that we can get better long-term predictions.

👉 In Statistics, it is known as the L-2 norm.

👉 In this technique, the cost function is altered by adding the penalty term (shrinkage term), which multiplies the lambda with the squared weight of each individual feature. Therefore, the optimization function(cost function) becomes:

![](images/ridge.JPG)

👉 Here, **λ is the tuning parameter that decides how much we want to penalize the flexibility of our model.** The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.

👉 When **λ = 0, the penalty term has no eﬀect, and the estimates produced by ridge regression will be equal to least squares.** However, as **λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coeﬃcient estimates will approach zero.** As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates produced by this method are also known as the L2 norm.

👉 **Usage of Ridge Regression:**

+ When we have the independent variables which are having high collinearity (problem of ) between them, at that time general linear or polynomial regression will fail so to solve such problems, Ridge regression can be used.


+ If we have more parameters than the samples, then Ridge regression helps to solve the problems.

👉 **Limitation of Ridge Regression:**

+ **Not helps in Feature Selection:** It decreases the complexity of a model but does not reduce the number of independent variables since it never leads to a coefficient being zero rather only minimizes it. Hence, this technique is not good for feature selection.


+ **Model Interpretability:** Its disadvantage is model interpretability since it will shrink the coefficients for least important predictors, very close to zero but it will never make them exactly zero. In other words, the final model will include all the independent variables, also known as predictors.

### Lasso Regression:


👉 Lasso regression is another variant of the regularization technique used to reduce the complexity of the model. It stands for **Least Absolute and Selection Operator.**

👉 It is similar to the Ridge Regression except that the **penalty term includes the absolute weights instead of a square of weights.** Therefore, the optimization function becomes:

![](images/lasso.JPG)

👉 In statistics, it is known as the **L-1 norm.**

👉 In this technique, the **L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter λ is suﬃciently large.** Therefore, the lasso method also performs Feature selection and is said to yield sparse models.

👉 **Limitation of Lasso Regression:**

+ Problems with some types of Dataset: If the number of predictors is greater than the number of data points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.


+ Multicollinearity Problem: If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.

**Key Differences between Ridge and Lasso Regression:**

👉 Ridge regression helps us to reduce only the overfitting in the model while keeping all the features present in the model. It reduces the complexity of the model by shrinking the coefficients whereas Lasso regression helps in reducing the problem of overfitting in the model as well as automatic feature selection.

👉 Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets the value of coefficient to absolute zero.

## 4 Perform Ridge and Lasso regression continuing the task of the previous homework

In [90]:
#import all the libraries
import warnings
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [68]:
boston_dataset = load_boston()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [69]:
#save in dataframe
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()

<IPython.core.display.Javascript object>

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [70]:
#add the target varaible to the datatframe
boston['MEDV'] = boston_dataset.target

In [71]:
# Selecting the indpendent and dependent variables
X = boston.iloc[:, boston.columns != 'MEDV']  #selecting all columns except "MEDV"
y = boston.iloc[:, -1] #selecting target(MEDV in this case)

In [72]:
# Splitting into train and test for model 1 and reseting the index to avoid jumbled index

X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.25, random_state = 100)

X_train1 = X_train1.reset_index(drop = True)
X_test1 = X_test1.reset_index(drop = True)
y_train1 = y_train1.reset_index(drop = True)
y_test1 = y_test1.reset_index(drop = True)

print("shape of X_train1 is: ", X_train1.shape)
print("shape of X_test1 is: ", X_test1.shape)
print("shape of y_train1 is: ", y_train1.shape)
print("shape of y_test1 is: ", y_test1.shape)

shape of X_train1 is:  (379, 13)
shape of X_test1 is:  (127, 13)
shape of y_train1 is:  (379,)
shape of y_test1 is:  (127,)


In [73]:
#we can either use Standardization or Normalization here. Let's choose normalization

# min max scaling the variables

scaler =  MinMaxScaler()
scaler.fit(X_train1)
X_train_scaled1 = scaler.transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)

In [74]:
# training linear regression model on the training set

from sklearn.linear_model import Lasso, Ridge
lasso_regressor = Lasso(alpha = 0.1)

#train 1st model
lasso_regressor.fit(X_train_scaled1,y_train1)


# making predictions for the training set 1
y_train_pred1 = lasso_regressor.predict(X_train_scaled1)

# making predictions for the testing set
#dataset 1
y_test_pred1 = lasso_regressor.predict(X_test_scaled1)

In [76]:
print("R2_score_Lasso for train =", r2_score(y_train1, y_train_pred1))
print("R2_score_Lasso for test =", r2_score(y_test1, y_test_pred1))

R2_score_Lasso for train = 0.6944635338817471
R2_score_Lasso for test = 0.6657090543720645


In [77]:
ridge_regressor = Ridge(alpha = 0.1)

#train 1st model
ridge_regressor.fit(X_train_scaled1,y_train1)

y_pred = ridge_regressor.predict(X_test_scaled1)


# making predictions for the training set 1
y_train_pred1 = ridge_regressor.predict(X_train_scaled1)

# making predictions for the testing set
#dataset 1
y_test_pred1 = ridge_regressor.predict(X_test_scaled1)


In [78]:
print("R2_score_Ridge for train =", r2_score(y_train1, y_train_pred1))
print("R2_score_Ridge for test=", r2_score(y_test1, y_test_pred1))

R2_score_Ridge for train = 0.7421080815376433
R2_score_Ridge for test= 0.7233326093985133


## 5 Perform Ridge and Lasso regression on HCC.csv dataset after performing necessary pre-processing steps as mentioned in the previous homework

In [79]:
Hcc_df = pd.read_csv('D:/Masters/SJSU/Academics/sem_2/CMPE_257_ML/Assignments/HW3/data/HCC.csv')

<IPython.core.display.Javascript object>

In [80]:
Hcc_df.head()

Unnamed: 0.1,Unnamed: 0,1.Gen,2.Sym,3.Alc,4.HepB,6.HepB,7.HepC,8.Cir,11.Dia,12.Obe,...,37.Bil,38.Ala,39.Aspa,40.Gam,41.Alk,42.Prot,43.Crea,44.NNod,45.dnod,Class
0,0,1,0.0,1,0.0,0.0,0.0,1,1.0,0.0,...,2.1,34.0,41,183.0,150.0,7.1,0.7,1.0,3.5,1
1,2,1,0.0,1,1.0,1.0,0.0,1,0.0,0.0,...,0.4,58.0,68,202.0,109.0,7.0,2.1,5.0,13.0,1
2,3,1,1.0,1,0.0,0.0,0.0,1,1.0,0.0,...,0.4,16.0,64,94.0,174.0,8.1,1.11,2.0,15.7,0
3,4,1,1.0,1,1.0,1.0,0.0,1,0.0,0.0,...,0.7,147.0,306,173.0,109.0,6.9,1.8,1.0,9.0,1
4,5,1,0.0,1,0.0,0.0,0.0,1,0.0,1.0,...,3.5,91.0,122,242.0,396.0,5.6,0.9,1.0,10.0,0


In [81]:
Hcc_df.shape

(156, 41)

In [82]:
Hcc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 41 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     156 non-null    int64  
 1   1.Gen          156 non-null    int64  
 2   2.Sym          156 non-null    float64
 3   3.Alc          156 non-null    int64  
 4   4.HepB         156 non-null    float64
 5   6.HepB         156 non-null    float64
 6   7.HepC         156 non-null    float64
 7   8.Cir          156 non-null    int64  
 8   11.Dia         156 non-null    float64
 9   12.Obe         156 non-null    float64
 10  13.Hem         156 non-null    float64
 11  14.Art         156 non-null    float64
 12  15.CRen        156 non-null    float64
 13  16.HIV         156 non-null    float64
 14  17.Non         156 non-null    float64
 15  19.Spl         156 non-null    float64
 16  20.PHyp        156 non-null    float64
 17  21.Thr         156 non-null    float64
 18  22.LMet   

In [83]:
# Selecting the indpendent and dependent variables
X_hcc = Hcc_df.iloc[:, Hcc_df.columns != 'Class']  #selecting all columns except "MEDV"
y_hcc = Hcc_df.iloc[:, -1] #selecting target(MEDV in this case)

In [84]:
# Splitting into train and test for model 1 and reseting the index to avoid jumbled index

X_train1, X_test1, y_train1, y_test1 = train_test_split(X_hcc, y_hcc, test_size = 0.25, random_state = 100)

X_train1 = X_train1.reset_index(drop = True)
X_test1 = X_test1.reset_index(drop = True)
y_train1 = y_train1.reset_index(drop = True)
y_test1 = y_test1.reset_index(drop = True)

print("shape of X_train1 is: ", X_train1.shape)
print("shape of X_test1 is: ", X_test1.shape)
print("shape of y_train1 is: ", y_train1.shape)
print("shape of y_test1 is: ", y_test1.shape)

shape of X_train1 is:  (117, 40)
shape of X_test1 is:  (39, 40)
shape of y_train1 is:  (117,)
shape of y_test1 is:  (39,)


In [85]:
#1st model
scaler =  MinMaxScaler()
scaler.fit(X_train1)
X_train_scaled1 = scaler.transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)

In [86]:
# training linear regression model on the training set

from sklearn.linear_model import Lasso, Ridge
lasso_regressor = Lasso(alpha = 0.1)

#train 1st model
lasso_regressor.fit(X_train_scaled1,y_train1)


# making predictions for the training set 1
y_train_pred1 = lasso_regressor.predict(X_train_scaled1)

# making predictions for the testing set
#dataset 1
y_test_pred1 = lasso_regressor.predict(X_test_scaled1)

In [87]:
print("R2_score_Lasso for train =", r2_score(y_train1, y_train_pred1))
print("R2_score_Lasso for test =", r2_score(y_test1, y_test_pred1))

R2_score_Lasso for train = 0.0
R2_score_Lasso for test = -0.0028571428571422253


**Observation:**

+  The lasso regressor model is performing worse(on the test data when comapred to train data) than the average fit line hence the R2 score is negative because the predictor is a categorical data

In [88]:
ridge_regressor = Ridge(alpha = 0.1)

#train 1st model
ridge_regressor.fit(X_train_scaled1,y_train1)

y_pred = ridge_regressor.predict(X_test_scaled1)


# making predictions for the training set 1
y_train_pred1 = ridge_regressor.predict(X_train_scaled1)

# making predictions for the testing set
#dataset 1
y_test_pred1 = ridge_regressor.predict(X_test_scaled1)

In [89]:
print("R2_score_Lasso for train =", r2_score(y_train1, y_train_pred1))
print("R2_score_Lasso for test =", r2_score(y_test1, y_test_pred1))

R2_score_Lasso for train = 0.5551765520335733
R2_score_Lasso for test = -0.1486760724824705


**Observation:**

+  The Ridge regressor model is performing worse(on the test data when comapred to train data) than the average fit line hence the R2 score is negative because the predictor is a categorical data and we need to apply logistic regression model on this data to get better results


Sources:

+ https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a


+ https://www.analyticsvidhya.com/blog/2021/05/complete-guide-to-regularization-techniques-in-machine-learning/#:~:text=Regularization%20works%20by%20adding%20a,RSS)%20to%20the%20complex%20model.&text=%CE%B20%2C%20%CE%B21%2C%E2%80%A6..&text=In%20simple%20linear%20regression%2C%20our,sum%20of%20squares%20(RSS)