Train the winner model. 

Use the XGBRegressor algorithm to train a regression model on the training data. XGBoost is a popular gradient boosting algorithm that can handle regression tasks effectively. 

Set hyperparameters, such as the learning rate, number of estimators, and maximum depth, to optimize the model's performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, r2_score

imputed_data = pd.read_csv("imputed_data_handle_multicollinearity.csv")

ames = pd.read_csv("Ames_HousePrice.csv")

# Prepare the data
X = imputed_data[['OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'GarageArea',    
        'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF', 'Fireplaces']]

X_neighborhood=ames[['Neighborhood']]

y = imputed_data['SalePrice']

In [2]:
# Create an instance of XGBRegressor
xgb = XGBRegressor()

# Define the number of folds for cross-validation
num_folds = 5

# Define the cross-validation strategy
cv = KFold(n_splits=num_folds, shuffle=True)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.01, 0.001],
    'max_depth': [3, 4, 5]
}

# Perform grid search
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=cv, scoring='neg_mean_absolute_error')
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Mean MAE:", -grid_search.best_score_)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
Best Mean MAE: 4861.77269410126


In [3]:
X_neighborhood

Unnamed: 0,Neighborhood
0,SWISU
1,Edwards
2,IDOTRR
3,OldTown
4,NWAmes
...,...
2575,BrkSide
2576,Edwards
2577,Crawfor
2578,CollgCr


In [4]:
# Fit the model on the training data using the best hyperparameters
xgb_best = XGBRegressor(**grid_search.best_params_)
xgb_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test = xgb_best.predict(X_test)

# Calculate R^2 score
r2 = r2_score(y_test, y_pred_test)

print("R^2 score:", r2)
print("")

R^2 score: 0.9705697060871487



In [5]:
# Calculate residuals
residuals = y_pred_test - y_test

# Create a DataFrame with residuals, actual sale prices, predicted sale prices, and neighborhoods
residuals_df = pd.DataFrame({
    'Actual Sale Price': y_test,
    'Predicted Sale Price': y_pred_test,
    'Residuals': residuals,
    'OverallQual': X_test['OverallQual'],
    'GrLivArea': X_test['GrLivArea'],
    'TotalBsmtSF': X_test['TotalBsmtSF'],
    '1stFlrSF': X_test['1stFlrSF'],
    'GarageArea': X_test['GarageArea'],
    'YearBuilt': X_test['YearBuilt'],
    'YearRemodAdd': X_test['YearRemodAdd'],
    'BsmtFinSF1': X_test['BsmtFinSF1'],
    'BsmtUnfSF': X_test['BsmtUnfSF'],
    'Fireplaces': X_test['Fireplaces']
})
# Join residuals_df with X_neighborhood based on index
residuals_merged = residuals_df.join(X_neighborhood, how='left')

In [6]:
# Filter undervalued properties
undervalued_properties = residuals_merged[residuals_merged['Actual Sale Price'] < residuals_merged['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

Residuals represent the differences between the predicted sale prices and the actual sale prices. To calculate residuals, subtract the predicted sale prices (y_pred_test) from the actual sale prices (y_test).

To identify undervalued properties, sort the properties in the test set based on their residuals in ascending order.

In [7]:
# Print the top 10 undervalued properties with neighborhoods
print()
print("Top 10 Undervalued Properties:")
print()
top_undervalued_properties = undervalued_properties.head(10)
for index, row in top_undervalued_properties.iterrows():
    print("Neighborhood:", row['Neighborhood'])
    print("Actual Sale Price:", row['Actual Sale Price'])
    print("Predicted Sale Price:", row['Predicted Sale Price'])
    print("Residual:", row['Residuals'])
    print()


Top 10 Undervalued Properties:

Neighborhood: Veenker
Actual Sale Price: 150000
Predicted Sale Price: 380131.75
Residual: 230131.75

Neighborhood: NAmes
Actual Sale Price: 84900
Predicted Sale Price: 222556.9375
Residual: 137656.9375

Neighborhood: NAmes
Actual Sale Price: 167000
Predicted Sale Price: 271021.34375
Residual: 104021.34375

Neighborhood: MeadowV
Actual Sale Price: 151400
Predicted Sale Price: 247910.0625
Residual: 96510.0625

Neighborhood: OldTown
Actual Sale Price: 122000
Predicted Sale Price: 216369.015625
Residual: 94369.015625

Neighborhood: NWAmes
Actual Sale Price: 278000
Predicted Sale Price: 362180.96875
Residual: 84180.96875

Neighborhood: NridgHt
Actual Sale Price: 386250
Predicted Sale Price: 470245.03125
Residual: 83995.03125

Neighborhood: CollgCr
Actual Sale Price: 239000
Predicted Sale Price: 312992.75
Residual: 73992.75

Neighborhood: CollgCr
Actual Sale Price: 185000
Predicted Sale Price: 255453.078125
Residual: 70453.078125

Neighborhood: SWISU
Actual S