# Model Tuning

Anthony Amadasun

December 15th 20223

---

### Introduction

In this section of the notebook, I will focus on refining our predictive models to achieve optimal performance. The primary objectives include fine-tuning hyperparameters, revisiting feature selection, and ultimately identifying a production-ready model. The objective are found below:

- Hyperparameter Tuning/Feature Removal: implement strategies to optimize model performance.
- Model Evaluation: Analyze the tuned models and compare their performance using relevant metrics.
- Identify a Production Model: Choose the model that best aligns with the problem statement and dataset.


---

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

---

### Load in Data

In [2]:
# load preprocessed data
df_cleaned = pd.read_csv('../data/test_clean3.csv')
file_path_train = '../data/train.csv'
df_train = pd.read_csv(file_path_train)

---

### Hyperparameter Tuning/Feature Removal

In [3]:
# Create interaction term to optimize model
df_cleaned['Garage_Interaction'] = df_cleaned['Garage_Cars'] * df_cleaned['Garage_Area']


In [4]:
#Feature Removal
df_cleaned = df_cleaned.drop(['Garage_Cars', 'Garage_Area', 
                              'Fireplaces', 'TotRms_AbvGrd'], axis=1)

In [5]:
numeric_columns = df_cleaned.select_dtypes(include=['number']).columns


In [6]:
#tabular format
df_cleaned[numeric_columns].corr()[['SalePrice']].sort_values(by= 'SalePrice', ascending= False)

Unnamed: 0,SalePrice
SalePrice,1.0
Overall_Qual,0.800207
Gr_Liv_Area,0.697038
Garage_Interaction,0.690596
interaction_total_bathrooms,0.630207
Year_Built,0.571849
Year_Remod/Add,0.55037
Mas_Vnr_Area,0.503579
Neighborhood_NridgHt,0.448647
Open_Porch_SF,0.333476


In [7]:
X = df_cleaned.select_dtypes(include=['float64', 'int64']).drop('SalePrice', axis=1)
y = df_cleaned['SalePrice']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [9]:
# Scale our data.
scaler = StandardScaler()

# Fit/transform from training to learn mean, stdev. 
#and then transform both using things learned from training
Z_train = scaler.fit_transform(X_train)
Z_test = scaler.transform(X_test)

**Tuned Ridge Model**

In [10]:
#logic for hyperparater tuning attained from leson 4.06 and 
#article by Tara Boyle https://towardsdatascience.com/linear-regression-models-4a3d14b8d368
#Define the hyperparameter grid for Ridge model
#in data preprocessing, already have an idea, what the optimal alha is
param_grid = {'alpha': [1, 10, 100, 200, 300]}

In [11]:
# Create the Ridge model
ridge_model = Ridge()

In [12]:
# search through the hyperparameter grid 
# and find the best combination thatreduced the negative mse
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)


In [13]:
# Fit the grid search
grid_search.fit(Z_train, y_train)

In [14]:
# select best hyperparameters 
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')

Best Hyperparameters: {'alpha': 100}


In [15]:
# use best model for predictions
best_ridge_model = grid_search.best_estimator_
y_pred_train = best_ridge_model.predict(Z_train)
y_pred_test = best_ridge_model.predict(Z_test)

In [16]:
# Evaluate the tuned model's performance
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
rmse_ridge_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse_ridge_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

In [17]:
print(f'Tuned Ridge Regression Results:')
print(f'MAE (Test): {mae_test:.2f}')
print(f'R² (Test): {r2_test:.4f}')
print(f'RMSE (Train): {rmse_ridge_train:.2f}')
print(f'RMSE: (Test): {rmse_ridge_test:.2f}')

Tuned Ridge Regression Results:
MAE (Test): 19930.94
R² (Test): 0.8655
RMSE (Train): 29628.60
RMSE: (Test): 28272.03


**Tuned Lasso Model**

In [18]:
#define the hyperparameter grid for Lasso Regression
lasso_param_grid = {'alpha': [1, 10, 100, 500, 1000]}

In [19]:
#Create the Lasso  model
lasso_model = Lasso()

In [20]:
#search through the hyperparameter grid 
# and find the best combination thatreduced the negative mse
lasso_grid_search = GridSearchCV(estimator=lasso_model, param_grid=lasso_param_grid, scoring='neg_mean_squared_error', cv=5)

In [21]:
# Fit the grid search
lasso_grid_search.fit(Z_train, y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [22]:
# select best hyperparameters 
best_lasso_params = lasso_grid_search.best_params_
print(f'Best Lasso Hyperparameters: {best_lasso_params}')

Best Lasso Hyperparameters: {'alpha': 500}


In [23]:
# use best model for predictions
best_lasso_model = lasso_grid_search.best_estimator_
y_pred_lasso_train = best_lasso_model.predict(Z_train)
y_pred_lasso_test = best_lasso_model.predict(Z_test)

In [24]:
# Evaluate the tuned model's performance
mae_lasso_test = mean_absolute_error(y_test, y_pred_lasso_test)
r2_lasso_test = r2_score(y_test, y_pred_lasso_test)
rmse_lasso_train = np.sqrt(mean_squared_error(y_train, y_pred_lasso_train))
rmse_lasso_test = np.sqrt(mean_squared_error(y_test, y_pred_lasso_test))

In [25]:
print(f'Tuned Lasso Regression Results:')
print(f'MAE (Test): {mae_lasso_test:.2f}')
print(f'R²(Test): {r2_lasso_test:.4f}')
print(f'RMSE (Train): {rmse_lasso_train:.2f}')
print(f'RMSE: (Test): {rmse_lasso_test:.2f}')

Tuned Lasso Regression Results:
MAE (Test): 19902.55
R²(Test): 0.8687
RMSE (Train): 29694.36
RMSE: (Test): 27929.80


---

### Model Evaluation

In [26]:
print(" Lasso ".center(18, "="))
print(f'Tuned Lasso Regression Results:')
print(f'MAE (Test): {mae_lasso_test:.2f}')
print(f'RMSE (Train): {rmse_lasso_train:.2f}')
print(f'RMSE: (Test): {rmse_lasso_test:.2f}')
print(f'R²(Test): {r2_lasso_test:.4f}')
print()
print(" Ridge ".center(18, "="))
print(f'Tuned Ridge Regression Results:')
print(f'MAE (Test): {mae_test:.2f}')
print(f'RMSE (Train): {rmse_ridge_train:.2f}')
print(f'RMSE: (Test): {rmse_ridge_test:.2f}')
print(f'R² (Test): {r2_test:.4f}')

Tuned Lasso Regression Results:
MAE (Test): 19902.55
RMSE (Train): 29694.36
RMSE: (Test): 27929.80
R²(Test): 0.8687

Tuned Ridge Regression Results:
MAE (Test): 19930.94
RMSE (Train): 29628.60
RMSE: (Test): 28272.03
R² (Test): 0.8655


--- 

### Identify Production Model

For the production model I decided to prioritize generalizability and interpretability and selected the Lasso Regression Model. There are a few reasons for this selection. 

- The first reason is that both lasso and Ridge show comparable performance on the test set with similiar MAE, RMSE, and R-square values, which is a good indicator that one model isn't significantly outperforming the other. 
- The second reason is that the lasso model sets some coefficients to exactly zero, which helps remove certain features. This attribute can lead to a more interpretable model by highlighting the most influential features. This is crucial for stakeholders in the real estate investment company.
- The third reason is that both model show good genralization, but lasso feature selction helps in building a simpler model moreso than Ridge that generalizes well.

As such, given the nature of the problem statement, knowing which features are driving predictions is as important as the accuracy of predictions. The result of this model are more straightforward and interpretable, which is crucial for stakeholders in real estate investment company.

**Further Information**

- Mean Absolute Error (MAE): On average, the model's predictions deviate by $19930.94 from the actual sale prices."

- Root Mean Square Error (RMSE): The model's predictions have a RMSE of 27929.80 on the test set and 29694.36 on the train set, indicating the test performance is good and the model isnt overfitting because they are relatively close to each other. In addition, compared to the baseline model the RMSE on the tuned model is lower which is a positive sign and indicates better generalizarion performances.

- R-squared(Test): The R² value of 0.8687 suggests that the model explains 86.87% of the variance in the test set.

In [27]:
# # Save preprocessed data to a CSV file
# df_cleaned.to_csv('../data/test_clean4.csv', index=False) 