#  Model_Benchmarks

Anthony Amadasun

December 15th 20223

---

### Introduction

In this notebook, I will establish baseline models and benchmarks for predicting housing prices using the Ames Housing dataset. We'll start by loading the preprocessed data and then proceed to build and evaluate several basic models.


---

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

---

### Load in Data

In [2]:
# load preprocessed data
df_cleaned = pd.read_csv('../data/test_clean2.csv')
file_path_train = '../data/train.csv'
df_train = pd.read_csv(file_path_train)

---

### Baseline Model

In [3]:
X = df_cleaned.select_dtypes(include=['float64', 'int64']).drop('SalePrice', axis=1)
y = df_cleaned['SalePrice']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

**Linear Regression**

In [5]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [6]:
# Evaluate Linear Regression
lr_train_predictions = lr.predict(X_train)
lr_test_predictions = lr.predict(X_test)

lr_train_rmse = mean_squared_error(y_train, lr_train_predictions, squared=False)
lr_test_rmse = mean_squared_error(y_test, lr_test_predictions, squared=False)

In [7]:
print("Linear Regression:")
print(f"Train RMSE: {lr_train_rmse:.4f}")
print(f"Test RMSE: {lr_test_rmse:.4f}")


Linear Regression:
Train RMSE: 29563.6447
Test RMSE: 28339.5874


- The test RMSE is slightly lower than the training RMSE, which suggests that the model generalizes well to new data.

**Lasso Regression**

In [8]:
lasso = Lasso()
lasso.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [9]:
# Predict/Evaluate Lasso Regression
lasso_train_predictions = lasso.predict(X_train)
lasso_test_predictions = lasso.predict(X_test)

lasso_train_rmse = mean_squared_error(y_train, lasso_train_predictions, squared=False)
lasso_test_rmse = mean_squared_error(y_test, lasso_test_predictions, squared=False)

In [10]:
print("Lasso Regression:")
print(f"Train RMSE: {lasso_train_rmse:.4f}")
print(f"Test RMSE: {lasso_test_rmse:.4f}")


Lasso Regression:
Train RMSE: 29563.7079
Test RMSE: 28336.2118


- The RMSE values are close to those of Linear Regression, indicating comparable performance on performing well on unseen data

**Ridge Regression**

In [11]:
ridge = Ridge()
ridge.fit(X_train, y_train)

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


In [12]:
ridge_train_predictions = ridge.predict(X_train)
ridge_test_predictions = ridge.predict(X_test)

ridge_train_rmse = mean_squared_error(y_train, ridge_train_predictions, squared=False)
ridge_test_rmse = mean_squared_error(y_test, ridge_test_predictions, squared=False)

In [13]:
print("Ridge Regression:")
print(f"Train RMSE: {ridge_train_rmse:.4f}")
print(f"Test RMSE: {ridge_test_rmse:.4f}")


Ridge Regression:
Train RMSE: 29593.6042
Test RMSE: 28294.3983


- Compared to Linear and Lasso, predictions are slightly higher which indicates that the Ridge Regression model may not generalize as well to unseen data as Linear and Lasso Regression. Moreover,the Ridge Regression's predictions on the training set are also slightly higher in terms of rmse, which mean that this model is fitting the training data less closely. 

---

### Model Evaluation Metric

**Linear Regression Metric**

In [14]:
# Calculate and print evaluation metrics for lr
mae_lr = mean_absolute_error(y_test, lr.predict(X_test))
mse_lr = mean_squared_error(y_test, lr.predict(X_test))
r2_lr = r2_score(y_test, lr.predict(X_test))

In [15]:
print(f'MAE: {mae_lr:.4f}, MSE: {mse_lr:.4f}, R²: {r2_lr:.4f}')

MAE: 20278.8879, MSE: 803132213.9853, R²: 0.8648


**Lasso Regression Metric**

In [16]:
# Calculate and print evaluation metrics for lasso
mae_lasso = mean_absolute_error(y_test, lasso.predict(X_test))
mse_lasso = mean_squared_error(y_test, lasso.predict(X_test))
r2_lasso = r2_score(y_test, lasso.predict(X_test))

In [17]:
print(f'MAE: {mae_lasso:.4f}, MSE: {mse_lasso:.4f}, R²: {r2_lasso:.4f}')

MAE: 20280.1919, MSE: 802940901.0642, R²: 0.8649


**Ridge Regression Metric**

In [18]:
# Calculate and print evaluation metrics for Ridge
mse_train_ridge = mean_squared_error(y_train, ridge.predict(X_train))
mae_ridge = mean_absolute_error(y_test, ridge.predict(X_test))
mse_ridge = mean_squared_error(y_test, ridge.predict(X_test))
r2_ridge = r2_score(y_test, ridge.predict(X_test))

In [19]:
print(f'MAE: {mae_ridge:.4f}, MSE: {mse_ridge:.4f}, MSETrain: {mse_train_ridge}, R²: {r2_ridge:.4f}')

MAE: 20277.9923, MSE: 800572975.7018, MSETrain: 875781409.2956297, R²: 0.8653


**Summary:**

The following metrics provide an overview of the performance of each baseline model. The lower the MAE and MSE, and the higher the R², the better the model's performance. In this summary, all models perform relatively well. R² value of 0.8648(lr), 0.8649(lasso),  0.8653(ridge) indicates that the model explains a significant portion of the variance in the target variable, with Ridge Regression having the slightly better R² value among the three. This suggests that Ridge Regression model may be better suited for predicting house price, as the R2 value is an important factor in determining how well a model can accurately predict results. However, for generalizability Lasso and Linear is more suited. But all model, does well and there are only slight differences between each model.

The next notebook will fine-tune the models and explore the impact of hyperparameter tuning. In particular optimize hyperparameters for Lasso and Ridge regression models to find the best regularization strength and investigate the creation of new features based on domain knowledge


In [20]:
# # Save preprocessed data to a CSV file
# df_cleaned.to_csv('../data/test_clean3.csv', index=False) 