### Wassim Mecheri Lab 2
# Selecting the Best Regression Model for Job Satisfaction Prediction

## Objective

The goal of this task is to identify the best model to predict job satisfaction based on employee characteristics. You will compare three models: Lasso, Ridge, and Elastic Net, and find the best penalty value for each model.

# 0 Imports

We import everything needed fo this Lab:
- Pandas for data manipulation and analysis
- The function to split the DataSet into train, validation and test
- All the needed models (Ridge, Lasso, ElasticNet)
- Numpy because we need np.inf
- The MSE

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
import numpy as np
from sklearn.metrics import mean_squared_error

# 1 Data Splitting

## 1.1 Read the DataSet

We first read the DataSet in csv format using Pandas.

In [2]:
df = pd.read_csv('modified_data.csv')

## 1.2 Create y and X

Then we create the features X and the target y. We must first encode Gender and Education Level since they are not numerical values!

In [3]:
df = pd.get_dummies(df, columns=['Gender', 'Education_Level'], prefix=['Gender', 'Education_Level'])

y = df['Job_Satisfaction']
X = df.drop(columns='Job_Satisfaction')

## 1.3 Split into train, validation and test

Now now split the data into train, validation and test sets: three for the features and three for the target.
We use a temporary set because:
- First we cut the data set a first time: 80% for training and 20% for both validation and test
- Then we cut the 20% left in half to have 10% for validation and 10% for test.

In [4]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42) 
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# 2 Model Selection

To select the best model:
- We generate 10 different penalty values which will be stored in an array
- We train the three models using the training data
- We try do predictions using the validation data
- We calculate and store the best MSE for each to find the best penalty value (Error is everything!)

## 2.1 Training

In [5]:
penalty_values = np.logspace(-4, 4, 10) # It generates an array!

# To store the best results
results_df = pd.DataFrame({'Models': ['Ridge', 'Lasso', 'ElasticNet'], 'Penalty_Value': [np.inf, np.inf, np.inf], 'MSE': [np.inf, np.inf, np.inf]})

for penalty_value in penalty_values:
    # Create models
    ridge = Ridge(alpha = penalty_value)
    lasso = Lasso(alpha = penalty_value)
    elastic = ElasticNet(alpha = penalty_value, l1_ratio = penalty_value % 1) # 0 <= l1_ratio <= 1

    # Train models (with training data!)
    ridge.fit(X_train, y_train)
    lasso.fit(X_train, y_train)
    elastic.fit(X_train, y_train)

    # Try to predict y (with validation data!)
    y_pred_ridge = ridge.predict(X_val)
    y_pred_lasso = lasso.predict(X_val)
    y_pred_elastic = elastic.predict(X_val)

    # Calculate the mse and store if better than previous
    mse_ridge = mean_squared_error(y_val, y_pred_ridge)
    mse_lasso = mean_squared_error(y_val, y_pred_lasso)
    mse_elastic = mean_squared_error(y_val, y_pred_elastic)

    if mse_ridge < results_df.loc[results_df['Models'] == 'Ridge', 'MSE'].values[0]:
        results_df.loc[results_df['Models'] == 'Ridge', 'Penalty_Value'] = penalty_value
        results_df.loc[results_df['Models'] == 'Ridge', 'MSE'] = mse_ridge

    if mse_lasso < results_df.loc[results_df['Models'] == 'Lasso', 'MSE'].values[0]:
        results_df.loc[results_df['Models'] == 'Lasso', 'Penalty_Value'] = penalty_value
        results_df.loc[results_df['Models'] == 'Lasso', 'MSE'] = mse_lasso

    if mse_elastic < results_df.loc[results_df['Models'] == 'ElasticNet', 'MSE'].values[0]:
        results_df.loc[results_df['Models'] == 'ElasticNet', 'Penalty_Value'] = penalty_value
        results_df.loc[results_df['Models'] == 'ElasticNet', 'MSE'] = mse_elastic

results_df

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Unnamed: 0,Models,Penalty_Value,MSE
0,Ridge,1291.549665,0.893453
1,Lasso,2.782559,0.928071
2,ElasticNet,2.782559,0.883021


## 2.2 Model Selection Results

We return the best penalty value for each model, the MSE looks fairly small given the scale of our target values, so we can assume that the models are good, but we need to try predictions on the test data. We can now try to compare them between each other using the test data.

## 2.3 Personal Notes

- 10 values is a very small sample to find the best penalty value, we could use much more values to find a better one
- The DataSet is small, only 100 row. With more rows, the results would be better
- I have no idea what the warning message is about, and I really dislike warning messages. I hope it’s nothing serious.. :(
BUT if we increase the number of iterations as suggested by the error message AND set the l1_ratio parameter to 0.5, apparently a common practice, the warning message disappears! Not sure why, though... :D

# 3 Model Comparison

Now that we found the best versions for each model, we can evaluate them on the test data.

In [6]:
# Create models using best values
penalty_value = results_df.loc[results_df['Models'] == 'Ridge', 'Penalty_Value'].values[0]
best_ridge = Ridge(alpha = penalty_value)
best_lasso = Lasso(alpha = penalty_value)
best_elastic = ElasticNet(alpha = penalty_value, l1_ratio = penalty_value % 1)

# Train models (with training data!)
best_ridge.fit(X_train, y_train)
best_lasso.fit(X_train, y_train)
best_elastic.fit(X_train, y_train)

# Try to predict y (with test data!)
y_pred_best_ridge = best_ridge.predict(X_test)
y_pred_best_lasso = best_lasso.predict(X_test)
y_pred_best_elastic = best_elastic.predict(X_test)

# Calculate the mse and store if better than previous
mse_best_ridge = mean_squared_error(y_test, y_pred_best_ridge)
mse_best_lasso = mean_squared_error(y_test, y_pred_best_lasso)
mse_best_elastic = mean_squared_error(y_test, y_pred_best_elastic)

# Print results for test
print("Ridge ", mse_best_ridge)
print("Lasso ", mse_best_lasso)
print("ElasticNet ", mse_best_elastic)

Ridge  1.5871022985200427
Lasso  4.364248321370532
ElasticNet  4.330725570737246


## 3.1 Model Comparison Results

According to the results we got, Ridge sounds better than Lasso and ElasticNet to predict job satisfaction based on employee characteristics: its MSE is much lower than the other two models.

## 3.2 Personal Notes

- Why the MSE are higher now? What happened? 
- Why do we do train 80%, validation 10% and test 10% when we train on the train data, then do the validation, then the test? It sounds like the same thing to me when we predict on validation and test which are both 10% each, the model never saw validation data like test data, so it's basicaly the same no?!