# **Modeling**

In [57]:
import pandas as pd

train = pd.read_csv('../data/train_processed.csv')
test = pd.read_csv('../data/test_processed.csv')

We encode the qualitative variables to have numbers for the differents classes

In [58]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

qualitative_columns = train.select_dtypes(include=['object']).columns
for features in qualitative_columns:
    train[features] = le.fit_transform(train[features].astype(str))
    test[features] = le.fit_transform(test[features].astype(str))

We divide the train set into a smaller train set and a validation set

In [59]:
from sklearn.model_selection import train_test_split

X = train.drop(columns=['SalePrice'])  
y = train['SalePrice']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

-----------

# Model optimization

You can find more detail about the models, why we use it and how we optimize it in the file *model_explanation.pdf*.

In [60]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import root_mean_squared_error

## **Regression tree tuning**
Uncomment to run

In [61]:
"""
# We create a Decision Tree Regressor
reg_tree = DecisionTreeRegressor(random_state=42)

# Hyperparameters Grid
param_grid_tree = {
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [1, 100, 'sqrt', 'log2']
}

# Grid Search (we use negative mean squared error as score since gridsearch want to maximize it)
grid_tree = GridSearchCV(estimator=reg_tree, param_grid=param_grid_tree, cv=5, scoring='neg_mean_squared_error', verbose=1)
grid_tree.fit(X_train, y_train)

# Best parameters and best model
print("Best parameters for Regression Tree:", grid_tree.best_params_)
best_tree = grid_tree.best_estimator_
y_pred_tree = best_tree.predict(X_val)

#
print("Regression Tree RMSE:", root_mean_squared_error(y_val, y_pred_tree))"""


'\n# We create a Decision Tree Regressor\nreg_tree = DecisionTreeRegressor(random_state=42)\n\n# Hyperparameters Grid\nparam_grid_tree = {\n    \'max_depth\': [3, 5, 10],\n    \'min_samples_split\': [2, 5, 10],\n    \'min_samples_leaf\': [1, 2, 4],\n    \'max_features\': [1, 100, \'sqrt\', \'log2\']\n}\n\n# Grid Search (we use negative mean squared error as score since gridsearch want to maximize it)\ngrid_tree = GridSearchCV(estimator=reg_tree, param_grid=param_grid_tree, cv=5, scoring=\'neg_mean_squared_error\', verbose=1)\ngrid_tree.fit(X_train, y_train)\n\n# Best parameters and best model\nprint("Best parameters for Regression Tree:", grid_tree.best_params_)\nbest_tree = grid_tree.best_estimator_\ny_pred_tree = best_tree.predict(X_val)\n\n#\nprint("Regression Tree RMSE:", root_mean_squared_error(y_val, y_pred_tree))'

### **Regression Tree Results**
The hyperparameter tuning for the Regression Tree was conducted using GridSearchCV with 81 combinations of hyperparameters. Each combination was evaluated using 5-fold cross-validation.

### Best Parameters
The best hyperparameters for the Regression Tree model were:
- `max_depth`: 5
- `max_features`: 100
- `min_samples_leaf`: 4
- `min_samples_split`: 2

### Performance
- **RMSE** (Root Mean Squared Error) on the validation set: **37,687.26**


----------------------

## **XGBoost Regressor tuning**
Uncomment to run (**Warning**: Took 10 hours for us !)

In [62]:
"""# We create a XGBoost Regressor
xgb_model = XGBRegressor(random_state=42)

# Hyperparameters Grid
param_grid_xgb = {
    'max_depth': [3, 5, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 500],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 1, 5],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 5, 10]
}

# Grid Search
grid_xgb = GridSearchCV(estimator=xgb_model, param_grid=param_grid_xgb, cv=5, scoring='neg_mean_squared_error', verbose=1)
grid_xgb.fit(X_train, y_train)

# Best parameters and best model
print("Best parameters for XGBoost:", grid_xgb.best_params_)
best_xgb = grid_xgb.best_estimator_
y_pred_xgb = best_xgb.predict(X_val)

# RMSE
print("XGBoost RMSE:", root_mean_squared_error(y_val, y_pred_xgb))"""

'# We create a XGBoost Regressor\nxgb_model = XGBRegressor(random_state=42)\n\n# Hyperparameters Grid\nparam_grid_xgb = {\n    \'max_depth\': [3, 5, 10],\n    \'learning_rate\': [0.01, 0.1, 0.2],\n    \'n_estimators\': [100, 200, 500],\n    \'subsample\': [0.6, 0.8, 1.0],\n    \'colsample_bytree\': [0.6, 0.8, 1.0],\n    \'gamma\': [0, 1, 5],\n    \'reg_alpha\': [0, 0.1, 1],\n    \'reg_lambda\': [1, 5, 10]\n}\n\n# Grid Search\ngrid_xgb = GridSearchCV(estimator=xgb_model, param_grid=param_grid_xgb, cv=5, scoring=\'neg_mean_squared_error\', verbose=1)\ngrid_xgb.fit(X_train, y_train)\n\n# Best parameters and best model\nprint("Best parameters for XGBoost:", grid_xgb.best_params_)\nbest_xgb = grid_xgb.best_estimator_\ny_pred_xgb = best_xgb.predict(X_val)\n\n# RMSE\nprint("XGBoost RMSE:", root_mean_squared_error(y_val, y_pred_xgb))'

### **XGBoost Results**

### Best Parameters
The best hyperparameters for the XGBoost model were:
- `colsample_bytree`: 0.6
- `gamma`: 0
- `learning_rate`: 0.1
- `max_depth`: 3
- `n_estimators`: 500
- `reg_alpha`: 0
- `reg_lambda`: 10
- `subsample`: 1.0

### Performance
- **RMSE** (Root Mean Squared Error) on the validation set: **27,832.37**

-----------------

# Model implementation
We implement the two optimize models

In [63]:
# Optimize Tree
Tree = DecisionTreeRegressor(max_depth=5, max_features=100, min_samples_leaf=4, 
                             min_samples_split=2, random_state=42)

# Optimize XGBoost
XGBoost = XGBRegressor(colsample_bytree=0.6, gamma=0, learning_rate=0.1, 
                       max_depth=3, n_estimators=500, reg_alpha=0, reg_lambda=10, 
                       subsample=1.0, random_state=42)

# Fit the models
Tree.fit(X_train, y_train)
XGBoost.fit(X_train, y_train)

# Predictions
y_pred_tree = Tree.predict(X_val)
y_pred_xgb = XGBoost.predict(X_val)

# RMSE

print("Regression Tree RMSE:", round(root_mean_squared_error(y_val, y_pred_tree), 2))
print("XGBoost RMSE:", round(root_mean_squared_error(y_val, y_pred_xgb), 2))

Regression Tree RMSE: 37687.26
XGBoost RMSE: 27832.37


### Score

We use the MSE as a scoring method to tune our models, but it is also interstning to see their `score`.

In scikit-learn, the `score` method evaluates the performance of a regression model using the **coefficient of determination ($R^2$)** by default. This metric provides insight into how well the model explains the variance in the target variable.

The $R^2$ metric, also known as the "coefficient of determination," measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It is computed as:


$$R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$

Where:
- $y_i$: The actual target value.
- $hat{y}_i$: The predicted target value.
- $\bar{y}$: The mean of the actual target values.
- $\sum (y_i - \hat{y}_i)^2$: The residual sum of squares (unexplained variance).
- $\sum (y_i - \bar{y})^2$: The total sum of squares (total variance).

In [64]:
print("Regression Tree score:", round(Tree.score(X_val, y_val), 2))
print("XGBoost score:", round(XGBoost.score(X_val, y_val), 2))

Regression Tree score: 0.81
XGBoost score: 0.9


As expected, the XGBoost regressor has the best score than the Tree regressor.

---

### Dataframe to compare the real sale price and the predicted price of the two models on the validtion set.

In [65]:
import numpy as np

# Create a DataFrame with actual and predicted values
results_df = pd.DataFrame({
    'SalePrice': y_val,
    'Pred_Tree': np.round(y_pred_tree).astype(int),
    'Pred_XGB': np.round(y_pred_xgb).astype(int),
    'Error_Tree': np.round(np.abs(y_val - y_pred_tree).astype(int)),
    'Error_XGB': np.round(np.abs(y_val - y_pred_xgb).astype(int))
})

# Display the first few rows of the DataFrame
results_df.head(20)

Unnamed: 0,SalePrice,Pred_Tree,Pred_XGB,Error_Tree,Error_XGB
892,154500,122421,140024,32078,14476
1105,325000,352098,332971,27097,7970
413,115000,135069,108053,20068,6946
522,159000,210657,164650,51656,5650
1036,315500,278374,329861,37125,14360
614,75500,86411,74212,10910,1288
218,311500,191189,220641,120311,90859
1160,146000,174099,137967,28099,8033
649,84500,86411,76908,1910,7592
887,135500,122421,136148,13078,648


### Boxplot of the error of predictions of the two models

In [66]:
import pandas as pd
import plotly.graph_objects as go

# Create the box plots
fig = go.Figure()

# Add box for Error_Tree
fig.add_trace(go.Box(
    y=results_df['Error_Tree'],
    name='Error_Tree',
    boxmean=True,  # Show mean
    marker=dict(size=8)  # Marker size for points
))

# Add box for Error_XGB
fig.add_trace(go.Box(
    y=results_df['Error_XGB'],
    name='Error_XGB',
    boxmean=True,  # Show mean
    marker=dict(size=8)  # Marker size for points
))

# Update layout for larger boxes and better visualization
fig.update_layout(
    title='Error Distribution for Regression Tree and XGBoost',
    yaxis_title='Absolute Error',
    boxmode='group',  # Grouped box plots
    width=800,  # Width of the figure
    height=600  # Height of the figure
)

# Show the plot
fig.show()

This boxplots confirm that we have better prediction with XGBoost.

Some statistics :

### Tree regressor errors :
- **1st quartile**: 7455
- **Median**: 16,306
- **Mean**: 24,964
- **3rd quartile**: 33,054


### XGBoost regressor errors :
- **1st quartile**: 5118
- **Median**: 10,817
- **Mean**: 16,6678
- **3rd quartile**: 18,645


------------

# Prediction on the test set

In [67]:
y_test_tree = Tree.predict(test)
y_test_xgb = XGBoost.predict(test)

We export a csv with the test set and the prediciton

In [68]:
test_with_pred = test.copy()
test_with_pred['SalePrice_Tree'] = y_test_tree
test_with_pred['SalePrice_XGB'] = y_test_xgb

test_with_pred.to_csv('../data/test_with_pred.csv', index=False)