Train the winner model: 

Use the XGBRegressor algorithm to train a regression model on the training data. XGBoost is a popular gradient boosting algorithm that can handle regression tasks effectively. 

Make sure to set appropriate hyperparameters, such as the learning rate, number of estimators, and maximum depth, to optimize the model's performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, r2_score

imputed_data = pd.read_csv("imputed_data_handle_multicollinearity.csv")

# Prepare the data
X = imputed_data[['OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'GarageArea',    
                   'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF', 'Fireplaces',
                  'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 
                  'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 
                  'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens', 
                  'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 
                  'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 
                  'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 
                  'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 
                  'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 
                  'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker']]

y = imputed_data['SalePrice']

# Create an instance of XGBRegressor
xgb = XGBRegressor()

# Define the number of folds for cross-validation
num_folds = 5

# Define the cross-validation strategy
cv = KFold(n_splits=num_folds, shuffle=True)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.01, 0.001],
    'max_depth': [3, 4, 5]
}

# Perform grid search
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=cv, scoring='neg_mean_absolute_error')
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Mean MAE:", -grid_search.best_score_)

# List of all neighborhood columns
neighborhood_cols = ['Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 
                     'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 
                     'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens', 
                     'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 
                     'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 
                     'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 
                     'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 
                     'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 
                     'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker']

# Create 'Neighborhood' DataFrame
neighborhood = X[neighborhood_cols].idxmax(axis=1)
neighborhood = neighborhood.str.replace('Neighborhood_', '')

# Split the data into train and test sets along with 'Neighborhood' DataFrame
X_train, X_test, y_train, y_test, neighborhood_train, neighborhood_test = train_test_split(
    X, y, neighborhood, test_size=0.3)

# Fit the model on the training data using the best hyperparameters
xgb_best = XGBRegressor(**grid_search.best_params_)
xgb_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test = xgb_best.predict(X_test)

# Evaluate the model
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print("MAE Test:", mae_test)
print("R^2 Test:", r2_test)


Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
Best Mean MAE: 5487.087436409884
MAE Test: 6557.530848776647
R^2 Test: 0.9829312430282665


Calculate residuals: The code does not explicitly calculate residuals. Residuals represent the differences between the predicted sale prices and the actual sale prices. To calculate residuals, you can subtract the predicted sale prices (y_pred_test) from the actual sale prices (y_test).

Identify undervalued properties: Sorting the properties based on residuals to identify undervalued properties is not covered in the provided code. To identify undervalued properties, you would need to sort the properties in the test set based on their residuals in ascending order.

Analyze undervalued properties: Analyzing undervalued properties for potential reasons for their underpricing is not covered in the provided code. This step involves examining the properties with low residuals and considering factors that might have been overlooked in the model, such as unique features, renovations, location advantages, or upcoming developments.

In [2]:
# Calculate residuals
residuals = y_pred_test - y_test

# Create a DataFrame with residuals, actual sale prices, predicted sale prices, and neighborhoods
residuals_df = pd.DataFrame({
    'Actual Sale Price': y_test, 
    'Predicted Sale Price': y_pred_test, 
    'Residuals': residuals, 
    'Neighborhood': neighborhood_test 
})

# Filter undervalued properties
undervalued_properties = residuals_df[residuals_df['Actual Sale Price'] < residuals_df['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

# Print the undervalued properties with neighborhoods
print("Undervalued Properties:")
top_undervalued_properties = undervalued_properties
for index, row in top_undervalued_properties.iterrows():
    print("Property Neighborhood:", row['Neighborhood'])
    print("Actual Sale Price:", row['Actual Sale Price'])
    print("Predicted Sale Price:", row['Predicted Sale Price'])
    print("Residual:", row['Residuals'])
    print()


Undervalued Properties:
Property Neighborhood: Somerst
Actual Sale Price: 260116.0
Predicted Sale Price: 327122.5625
Residual: 67006.5625

Property Neighborhood: Timber
Actual Sale Price: 327000.0
Predicted Sale Price: 392312.1875
Residual: 65312.1875

Property Neighborhood: Gilbert
Actual Sale Price: 110000.0
Predicted Sale Price: 146002.09375
Residual: 36002.09375

Property Neighborhood: OldTown
Actual Sale Price: 221300.0
Predicted Sale Price: 255930.515625
Residual: 34630.515625

Property Neighborhood: Edwards
Actual Sale Price: 221300.0
Predicted Sale Price: 255460.65625
Residual: 34160.65625

Property Neighborhood: CollgCr
Actual Sale Price: 97500.0
Predicted Sale Price: 131128.28125
Residual: 33628.28125

Property Neighborhood: NAmes
Actual Sale Price: 97500.0
Predicted Sale Price: 130784.9375
Residual: 33284.9375

Property Neighborhood: Gilbert
Actual Sale Price: 97500.0
Predicted Sale Price: 130245.5078125
Residual: 32745.5078125

Property Neighborhood: Veenker
Actual Sale Pri

In [3]:
# Filter undervalued properties
undervalued_properties = residuals_df[residuals_df['Actual Sale Price'] < residuals_df['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

# Count the number of undervalued properties
num_undervalued_properties = len(undervalued_properties)

# Print the count
print("Number of Undervalued Properties:", num_undervalued_properties)

Number of Undervalued Properties: 390


In [4]:
undervalued_properties

Unnamed: 0,Actual Sale Price,Predicted Sale Price,Residuals,Neighborhood
2402,260116.0,327122.562500,67006.562500,Somerst
2427,327000.0,392312.187500,65312.187500,Timber
2424,110000.0,146002.093750,36002.093750,Gilbert
826,221300.0,255930.515625,34630.515625,OldTown
827,221300.0,255460.656250,34160.656250,Edwards
...,...,...,...,...
1087,118500.0,118540.148438,40.148438,BrDale
184,222000.0,222035.375000,35.375000,CollgCr
105,196000.0,196017.421875,17.421875,Timber
1092,119000.0,119008.195312,8.195312,NAmes
