## **EX NO:**
##   **DATE :**
# <center>**REGULARIZATION**</center>

##**AIM:**


To evaluate the impact of Ridge and Lasso regularization on model performance by analyzing error metrics and selecting the optimal regularization technique for improved prediction accuracy.

## **Importing header files**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('diamonds.csv')
print(df.head())
print(df.info())
print(df.describe())


   Unnamed: 0  carat      cut color clarity  depth  table  price     x     y  \
0           1   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98   
1           2   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84   
2           3   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07   
3           4   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23   
4           5   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35   

      z  
0  2.43  
1  2.31  
2  2.31  
3  2.63  
4  2.75  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940

##**Check for null values**

In [None]:
#missing value check
print(df.isnull().sum())
print(df.dtypes)

Unnamed: 0    0
carat         0
cut           0
color         0
clarity       0
depth         0
table         0
price         0
x             0
y             0
z             0
dtype: int64
Unnamed: 0      int64
carat         float64
cut            object
color          object
clarity        object
depth         float64
table         float64
price           int64
x             float64
y             float64
z             float64
dtype: object


####**No null  values so skip preprocessing to handle null values**

## **Label encoder**

In [None]:
from sklearn.preprocessing import LabelEncoder
categorical_columns = ['cut', 'color', 'clarity']
label_encoders = {}

for col in categorical_columns:
  le = LabelEncoder()
  df[col] = le.fit_transform(df[col])
  label_encoders[col] = le
  print(df.head())

   Unnamed: 0  carat  cut color clarity  depth  table  price     x     y     z
0           1   0.23    2     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1           2   0.21    3     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2           3   0.23    1     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3           4   0.29    3     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4           5   0.31    1     J     SI2   63.3   58.0    335  4.34  4.35  2.75
   Unnamed: 0  carat  cut  color clarity  depth  table  price     x     y  \
0           1   0.23    2      1     SI2   61.5   55.0    326  3.95  3.98   
1           2   0.21    3      1     SI1   59.8   61.0    326  3.89  3.84   
2           3   0.23    1      1     VS1   56.9   65.0    327  4.05  4.07   
3           4   0.29    3      5     VS2   62.4   58.0    334  4.20  4.23   
4           5   0.31    1      6     SI2   63.3   58.0    335  4.34  4.35   

      z  
0  2.43  
1  2.31  
2  2.31  
3  2.63  
4  2.75  
   

###  **modified the categorical to numeric value**

# **Model building (without ridge lasso)**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,cross_val_score,KFold,GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score


X = df.drop(columns=['price'])
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5
r2 = r2_score(y_test, y_pred)

print("Root Mean Squared Error (RMSE):", rmse)
print("R² Score:", r2)

print("\nModel Coefficients:")
print("Intercept:", model.intercept_)




Root Mean Squared Error (RMSE): 1346.1144269890717
R² Score: 0.8860134363272641

Model Coefficients:
Intercept: 15095.377604124044


#### **WITHOUT  RIDGE LASSO R^2=0.88 & RMSE=1346.11**

In [None]:
print("Coefficients:", model.coef_)

Coefficients: [ 6.90036248e-03  1.08985532e+04  7.46032041e+01 -2.69096300e+02
  2.84594504e+02 -1.49355477e+02 -9.25129437e+01 -1.09577881e+03
  2.63434645e+01  1.17521232e+01]


# **MODEL WITH RIDGE & LASSO**

##**Ridge regression**

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
print("Ridge RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))
r2_ridge = r2_score(y_test, y_pred_ridge)
print("Ridge R² :", r2_ridge)

Ridge RMSE: 1346.097759438186
Ridge R² : 0.8860162590666244


##**Lasso regression**

In [None]:
# Lasso Regression
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
print("Lasso RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))
r2_lasso = r2_score(y_test, y_pred_lasso)
print("Ridge R² :", r2_lasso)

Lasso RMSE: 1346.114166732125
Ridge R² : 0.8860134804034461


##**Ridge regression along with K-Fold**

### k-fold alone

In [None]:
# K-Fold Cross-Validation
model = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform Cross-Validation
cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Convert negative MSE to positive RMSE
cv_rmse = np.sqrt(-cv_scores)
cv_r2_scores = cross_val_score(model, X, y, cv=kf, scoring='r2')
print("K-Fold Cross-Validation r2 Scores:", cv_r2_scores)
print("Mean r2 (K-Fold):", np.mean(cv_r2_scores))
print("K-Fold Cross-Validation RMSE Scores:", cv_rmse)
print("Mean RMSE (K-Fold):", np.mean(cv_rmse))

K-Fold Cross-Validation r2 Scores: [0.88601344 0.8870416  0.87798075 0.88614142 0.8897397 ]
Mean r2 (K-Fold): 0.8853833806985815
K-Fold Cross-Validation RMSE Scores: [1346.11442699 1334.84684679 1391.80971533 1361.22093824 1317.79025686]
Mean RMSE (K-Fold): 1350.3564368430518


### k-fold with ridge

In [None]:
ridge = Ridge()
param_grid = {"alpha": np.logspace(-3, 3, 10)}
ridge_cv = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train, y_train)
y_pred_ridge = ridge_cv.best_estimator_.predict(X_test)
print("Best Ridge alpha:", ridge_cv.best_params_['alpha'])
print("Ridge Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))
print("Ridge Regression r2:",r2_score(y_test, y_pred_ridge) )

Best Ridge alpha: 2.154434690031882
Ridge Regression RMSE: 1346.0846370728375
Ridge Regression r2: 0.8860184813851615


##**Lasso regression along with K-Fold**

In [None]:
lasso = Lasso()
#Selection of hyper parameter alpha between 10 power -3 and 10 power 3
param_grid = {"alpha": np.logspace(-3, 3, 10)}
lasso_cv = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
lasso_cv.fit(X_train, y_train)
y_pred_lasso = lasso_cv.best_estimator_.predict(X_test)
print("Best Lasso alpha:", lasso_cv.best_params_['alpha'])
print("Lasso Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))

Best Lasso alpha: 2.154434690031882
Lasso Regression RMSE: 1346.3865368272543


#**FEATURE SELECTION**

####N=5 FEATURES

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

X = df.drop(columns=['price']) # Independent variables
y = df['price'] # Target variable
model = LinearRegression()
rfe = RFE(model, n_features_to_select=5) # Selecting the best 5 features
X_selected = rfe.fit_transform(X, y)
print("Selected Features:", X.columns[rfe.support_])

Selected Features: Index(['carat', 'color', 'clarity', 'depth', 'x'], dtype='object')


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,cross_val_score,KFold,GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

s_f=['carat','color','clarity','depth','x']
X = df[s_f]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5
r2 = r2_score(y_test, y_pred)

print("Root Mean Squared Error (RMSE):", rmse)
print("R² Score:", r2)

print("\nModel Coefficients:")
print("Intercept:", model.intercept_)


Root Mean Squared Error (RMSE): 1365.3635150316768
R² Score: 0.882730171317868

Model Coefficients:
Intercept: 8795.031319940857


###**FEATURE SELECTION WITH RIDGE**

In [None]:

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
print("Ridge RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))


Ridge RMSE: 1365.3492319554011


###**FEATURE SELECTION WITH LASSO**

In [None]:
# Lasso Regression
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
print("Lasso RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))

Lasso RMSE: 1365.362763890327


####N=3 FEATURES

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

X = df.drop(columns=['price']) # Independent variables
y = df['price'] # Target variable
model = LinearRegression()
rfe = RFE(model, n_features_to_select=3) # Selecting the best 3 features
X_selected = rfe.fit_transform(X, y)
print("Selected Features:", X.columns[rfe.support_])

Selected Features: Index(['carat', 'clarity', 'x'], dtype='object')


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,cross_val_score,KFold,GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

s_f=['carat','clarity','x']
X = df[s_f]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5
r2 = r2_score(y_test, y_pred)

print("Root Mean Squared Error (RMSE):", rmse)
print("R² Score:", r2)

print("\nModel Coefficients:")
print("Intercept:", model.intercept_)


Root Mean Squared Error (RMSE): 1438.0109311851272
R² Score: 0.8699189366535871

Model Coefficients:
Intercept: -171.02305432538287


In [None]:

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
print("Ridge RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))


Ridge RMSE: 1438.0046893572148


In [None]:
# Lasso Regression
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
print("Lasso RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))

Lasso RMSE: 1438.0106800142912


In [None]:
from prettytable import PrettyTable

# Create table
table = PrettyTable()
table.field_names = ["Model", "Root Mean Squared Error (RMSE)","R^2 score"]

table.add_row([
    "Without ridge and lasso:\n",1346.11])
table.add_row([
    "with ridge alone:\n",1346.097])
table.add_row([
    "with lasso alone: \n",1346.114])
table.add_row([
    "K-fold with ridge: \n",1346.084])
table.add_row([
    "K-fold with lasso: \n",1346.386])
table.add_row([
    "Feature selection with n=5 ridge: \n",1365.349])
table.add_row([
    "Feature selection with n=5 lasso: \n",1365.362])
table.add_row([
    "Feature selection with n=3 ridge: \n",1438.004])
table.add_row([
    "Feature selection with n=3 lasso: \n",1438.010])



# Set column alignments
table.align["Model"] = "l"
table.align["Root Mean Squared Error (RMSE)"] = "c"

# Print table
print(table)

+------------------------------------+--------------------------------+
| Model                              | Root Mean Squared Error (RMSE) |
+------------------------------------+--------------------------------+
| Without ridge and lasso:           |            1346.11             |
|                                    |                                |
| with ridge alone:                  |            1346.097            |
|                                    |                                |
| with lasso alone:                  |            1346.114            |
|                                    |                                |
| K-fold with ridge:                 |            1346.084            |
|                                    |                                |
| K-fold with lasso:                 |            1346.386            |
|                                    |                                |
| Feature selection with n=5 ridge:  |            1365.349      

# **RESULT:**

###The best-performing model is **K-fold with Ridge** (**1346.084**), achieving the lowest error. Ridge consistently outperforms Lasso, as Lasso's feature selection may remove useful features. K-fold validation improves stability, while excessive feature selection (n=3, n=5) leads to performance degradation. Thus, **K-fold Ridge is the optimal choice** for this dataset.