| Model                  | Best for                     | Pros                              | Cons                         |
|------------------------|------------------------------|-----------------------------------|------------------------------|
| **Linear Regression**  | Simple, linear relationships | Fast, interpretable               | Limited to linear data, sensitive to outliers |
| **Random Forest**      | Complex, tabular data        | Handles outliers, captures non-linear patterns | Slower, less interpretable |
| **XGBoost**            | High-stakes, complex data    | High accuracy, handles non-linear data, regularization | Complex tuning, resource-intensive |


In practice, you can start with Linear Regression if you expect a simple relationship. If it’s not enough, try Random Forest. If you need the best possible performance and are okay with more tuning, use XGBoost.

In [103]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

In [104]:
# Step 1: Load the Enhanced Feature Engineered Dataset
data = pd.read_csv('../../data/feature_engineered_immo_data.csv')

In [105]:
# Define features (X) and target (y)
X = data.drop(columns=['totalRent'])  # Use 'totalRent_log' if using the log-transformed target
y = data['totalRent']  # or 'totalRent_log' for the log-transformed target

# Step 2: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Hyperparameter Tuning for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 1.0],
    'colsample_bytree': [0.7, 1.0]
}

xgb_grid_search = GridSearchCV(
    estimator=XGBRegressor(random_state=42),
    param_grid=xgb_param_grid,
    scoring='r2',
    cv=3,
    verbose=1,
    n_jobs=-1
)

xgb_grid_search.fit(X_train, y_train)

# Save and evaluate the best XGBoost model
best_xgb_model = xgb_grid_search.best_estimator_
y_pred = best_xgb_model.predict(X_test)
print("Best XGBoost R2 Score:", r2_score(y_test, y_pred))
print("Best XGBoost parameters:", xgb_grid_search.best_params_)

Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best XGBoost R2 Score: 0.9999328415137748
Best XGBoost parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 300, 'subsample': 1.0}


In [107]:
# Step 4: Hyperparameter Tuning for Random Forest (Optional)
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_grid_search = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=rf_param_grid,
    scoring='r2',
    cv=3,
    verbose=1,
    n_jobs=-1
)

rf_grid_search.fit(X_train, y_train)

# Save and evaluate the best Random Forest model
best_rf_model = rf_grid_search.best_estimator_
y_pred_rf = best_rf_model.predict(X_test)
print("Best Random Forest R2 Score:", r2_score(y_test, y_pred_rf))
print("Best Random Forest parameters:", rf_grid_search.best_params_)

Fitting 3 folds for each of 81 candidates, totalling 243 fits


KeyboardInterrupt: 

In [109]:
# Step 5: Save the Best Model
# Choose the best model based on R2 score
# if r2_score(y_test, y_pred) > r2_score(y_test, y_pred_rf):
#     best_model = best_xgb_model
#     model_name = "XGBoost"
# else:
#     best_model = best_rf_model
#     model_name = "Random Forest"
best_model = best_xgb_model
model_name = "XGBoost"
joblib.dump(best_model, f'../../data/best_{model_name}_model.pkl')
print(f"Best model saved as 'best_{model_name}_model.pkl'")

Best model saved as 'best_XGBoost_model.pkl'


In [110]:
from sklearn.model_selection import cross_val_score
import xgboost as xgb

# Initialize XGBoost with the best parameters
best_xgb_params = {'colsample_bytree': 1.0, 'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 300, 'subsample': 1.0}
xgb_model = xgb.XGBRegressor(**best_xgb_params, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(xgb_model, X, y, cv=5, scoring='r2')
print("Cross-validated R2 Scores:", cv_scores)
print("Average Cross-validated R2 Score:", cv_scores.mean())


Cross-validated R2 Scores: [0.99993032 0.9999244  0.99992974 0.99991676 0.99992725]
Average Cross-validated R2 Score: 0.9999256958540872
