In this analysis I will use the same dataset on which some time ago I perform backward and forward selection "Diamonds-Feature Selection - implementation forward and backward selection" - the notebook name. It is available here:https://github.com/sylwiaSekula/Analyses-statistics-algorithms--English/blob/main/Diamonds-Feature%20Selection%20-%20implementation%20of%20Forward%20and%20Backward%20selection.ipynb .
I will use the LASSO regression to select features and then I will compare them with the ones I selected in the above file.

In [84]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV

In [42]:
data_frame = pd.read_csv('./Diamonds Prices2022.csv',index_col=0)
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53943 entries, 1 to 53943
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53943 non-null  float64
 1   cut      53943 non-null  object 
 2   color    53943 non-null  object 
 3   clarity  53943 non-null  object 
 4   depth    53943 non-null  float64
 5   table    53943 non-null  float64
 6   price    53943 non-null  int64  
 7   x        53943 non-null  float64
 8   y        53943 non-null  float64
 9   z        53943 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.5+ MB


In [43]:
#Label encoding for variable cut, it will order them in the order - Fair - the worst cut, Premium - the best cut
cut_names =['Fair', 'Good', 'Very Good', 'Ideal','Premium']
cut_labels = pd.factorize(cut_names)[0]
data_frame['cut'] = data_frame['cut'].map(dict(zip(cut_names, cut_labels)))

In [44]:
#Label encoding, we can also encode the colors of diamonds using Label endocding,
# because with the variable color we can order theoretically the most valuable diamonds are those with color D, 
# and the least valuable are those with color J, between them we keep the alphabetical order
color_names =['J', 'I', 'H', 'G','F','E','D']
color_labels = pd.factorize(color_names)[0]
data_frame['color'] = data_frame['color'].map(dict(zip(color_names, color_labels)))

In [45]:
#We can also order this categorical variable, since diamonds have a clarity scale 
clarity_names =['I1', 'VVS1','VVS2', 'VS1', 'VS2','SI1','SI2','IF']
clarity_labels = pd.factorize(clarity_names)[0]
data_frame['clarity'] = data_frame['clarity'].map(dict(zip(clarity_names, clarity_labels)))

In [52]:
final_df = data_frame.copy()
y = final_df['price']
X = final_df.drop('price', axis=1)

## Feature selection with Lasso 

In [53]:
#create the train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [54]:
#create the Lasso model
lasso = Lasso(alpha=2.81)

In [55]:
lasso.fit(X_train, y_train)

Lasso(alpha=2.81)

In [56]:
print(list(zip(lasso.coef_, X)))#use "coef" method to see coefficients for the variables. The variables Y and Z have
# equal 0 coefficient which means they are irrelevant

[(10555.482660661643, 'carat'), (114.25054992937571, 'cut'), (272.7465981311071, 'color'), (-210.82242788002804, 'clarity'), (-138.75305961623533, 'depth'), (-89.55106123905037, 'table'), (-990.3575428751835, 'x'), (0.0, 'y'), (-0.0, 'z')]


Feature selection with the LASSO regression allowed us to select variables for the model. These are all the features in the dataset except "Y" and "Z" columns. This is a different result than I got with the previous implementation of Forward and Backward selection - there the variables selected were: "Carat", "Color" and "Clarity".

In [57]:
lasso.predict(X_test)

array([ 445.67724492, 7495.71916963,  421.2426126 , ..., 1055.59908941,
       1254.89277241, 6177.22390295])

In [58]:
lasso.score(X_train, y_train)

0.8770077067648161

In [59]:
lasso.score(X_test, y_test) #score for the train and test datasets are similar

0.8793629962522866

## Ridge regression
Now I will perform the RIDGE regression usuing the features I selected with LASSO.

In [60]:
#scale the data
scaler = StandardScaler()

In [61]:
final_df = data_frame.copy()
final_df[["carat", "depth", "table", "price", 'x', 'y', 'z']] = scaler.fit_transform(data_frame
                                                                                     [["carat", "depth", "table", "price", 'x', 'y', 'z']])

In [63]:
y_ridge = final_df['price']#target
X_ridge = final_df.drop(['price', 'y', 'z'], axis=1)#remove the irrelevant features and the target

In [67]:
#create the train and test dataset
X_train_ridge, X_test_ridge, y_train_ridge, y_test_ridge = train_test_split(X, y, test_size=0.33, random_state=42)

In [68]:
ridge = Ridge(alpha=2.81)

In [73]:
ridge.fit(X_train_ridge, y_train_ridge) #fit

Ridge(alpha=2.81)

In [74]:
ridge.predict(X_train_ridge) #predict

array([  39.953041  ,  358.43810025, 1602.25413741, ..., -131.57678891,
       3319.17302609, 6698.72946293])

In [71]:
ridge.score(X_train_ridge, y_train_ridge)

0.8771381103111118

In [72]:
ridge.score(X_test_ridge, y_test_ridge)

0.8794895300504209

## Cross validation

In [90]:
print(cross_val_score(lasso, X_train, y_train, cv=10))

[0.86015154 0.88175656 0.87303158 0.86968177 0.86939401 0.89218804
 0.87610569 0.88371352 0.87693974 0.88411653]


In [88]:
print(cross_val_score(ridge, X_train_ridge, y_train_ridge, cv=10))

[0.85992679 0.8822906  0.87327348 0.86902155 0.8690113  0.88948488
 0.8765135  0.88372417 0.87707722 0.88457447]
