Train the winner model: 

Use the XGBRegressor algorithm to train a regression model on the training data. XGBoost is a popular gradient boosting algorithm that can handle regression tasks effectively. 

Make sure to set appropriate hyperparameters, such as the learning rate, number of estimators, and maximum depth, to optimize the model's performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, r2_score

imputed_data = pd.read_csv("imputed_data_handle_multicollinearity.csv")

# Prepare the data
X = imputed_data[['OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'GarageArea',    
                   'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF', 'Fireplaces',
                  'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 
                  'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 
                  'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens', 
                  'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 
                  'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 
                  'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 
                  'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 
                  'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 
                  'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker']]

y = imputed_data['SalePrice']

# Create an instance of XGBRegressor
xgb = XGBRegressor()

# Define the number of folds for cross-validation
num_folds = 5

# Define the cross-validation strategy
cv = KFold(n_splits=num_folds, shuffle=True)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.01, 0.001],
    'max_depth': [3, 4, 5]
}

# Perform grid search
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=cv, scoring='neg_mean_absolute_error')
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Mean MAE:", -grid_search.best_score_)

# List of all neighborhood columns
neighborhood_cols = ['Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 
                     'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 
                     'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens', 
                     'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 
                     'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 
                     'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 
                     'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 
                     'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 
                     'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker']

# Create 'Neighborhood' DataFrame
neighborhood = X[neighborhood_cols].idxmax(axis=1)
neighborhood = neighborhood.str.replace('Neighborhood_', '')

# Split the data into train and test sets along with 'Neighborhood' DataFrame
X_train, X_test, y_train, y_test, neighborhood_train, neighborhood_test = train_test_split(
    X, y, neighborhood, test_size=0.3)

# Fit the model on the training data using the best hyperparameters
xgb_best = XGBRegressor(**grid_search.best_params_)
xgb_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test = xgb_best.predict(X_test)

# Evaluate the model
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print("MAE Test:", mae_test)
print("R^2 Test:", r2_test)


Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
Best Mean MAE: 5652.416484223595
MAE Test: 6134.955174014858
R^2 Test: 0.983939979804367


Calculate residuals: The code does not explicitly calculate residuals. Residuals represent the differences between the predicted sale prices and the actual sale prices. To calculate residuals, you can subtract the predicted sale prices (y_pred_test) from the actual sale prices (y_test).

Identify undervalued properties: Sorting the properties based on residuals to identify undervalued properties is not covered in the provided code. To identify undervalued properties, you would need to sort the properties in the test set based on their residuals in ascending order.

Analyze undervalued properties: Analyzing undervalued properties for potential reasons for their underpricing is not covered in the provided code. This step involves examining the properties with low residuals and considering factors that might have been overlooked in the model, such as unique features, renovations, location advantages, or upcoming developments.

In [2]:
# Calculate residuals
residuals = y_pred_test - y_test

# Create a DataFrame with residuals, actual sale prices, predicted sale prices, and neighborhoods
residuals_df = pd.DataFrame({
    'Actual Sale Price': y_test, 
    'Predicted Sale Price': y_pred_test, 
    'Residuals': residuals, 
    'Neighborhood': neighborhood_test 
})

# Filter undervalued properties
undervalued_properties = residuals_df[residuals_df['Actual Sale Price'] < residuals_df['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

# Print the top 10 undervalued properties with neighborhoods
print("Top 10 Undervalued Properties:")
top_undervalued_properties = undervalued_properties.head(20)
for index, row in top_undervalued_properties.iterrows():
    print("Property Neighborhood:", row['Neighborhood'])
    print("Actual Sale Price:", row['Actual Sale Price'])
    print("Predicted Sale Price:", row['Predicted Sale Price'])
    print("Residual:", row['Residuals'])
    print()


Top 10 Undervalued Properties:
Property Neighborhood: CollgCr
Actual Sale Price: 386250.0
Predicted Sale Price: 486842.4375
Residual: 100592.4375

Property Neighborhood: IDOTRR
Actual Sale Price: 121500.0
Predicted Sale Price: 201164.75
Residual: 79664.75

Property Neighborhood: NAmes
Actual Sale Price: 121500.0
Predicted Sale Price: 200872.859375
Residual: 79372.859375

Property Neighborhood: Edwards
Actual Sale Price: 380000.0
Predicted Sale Price: 435582.6875
Residual: 55582.6875

Property Neighborhood: Sawyer
Actual Sale Price: 380000.0
Predicted Sale Price: 434507.6875
Residual: 54507.6875

Property Neighborhood: BrkSide
Actual Sale Price: 380000.0
Predicted Sale Price: 434507.6875
Residual: 54507.6875

Property Neighborhood: Gilbert
Actual Sale Price: 63000.0
Predicted Sale Price: 108868.390625
Residual: 45868.390625

Property Neighborhood: NAmes
Actual Sale Price: 63000.0
Predicted Sale Price: 108821.2109375
Residual: 45821.2109375

Property Neighborhood: NAmes
Actual Sale Price

In [3]:
# Filter undervalued properties
undervalued_properties = residuals_df[residuals_df['Actual Sale Price'] < residuals_df['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

# Count the number of undervalued properties
num_undervalued_properties = len(undervalued_properties)

# Print the count
print("Number of Undervalued Properties:", num_undervalued_properties)

Number of Undervalued Properties: 401


In [4]:
undervalued_properties

Unnamed: 0,Actual Sale Price,Predicted Sale Price,Residuals,Neighborhood
2521,386250.0,486842.437500,100592.437500,CollgCr
2347,121500.0,201164.750000,79664.750000,IDOTRR
2348,121500.0,200872.859375,79372.859375,NAmes
1025,380000.0,435582.687500,55582.687500,Edwards
1023,380000.0,434507.687500,54507.687500,Sawyer
...,...,...,...,...
2183,191000.0,191128.750000,128.750000,OldTown
329,107500.0,107598.562500,98.562500,NoRidge
872,233555.0,233637.125000,82.125000,CollgCr
1459,260000.0,260053.609375,53.609375,Blueste
