Train the winner model: 

Use the XGBRegressor algorithm to train a regression model on the training data. XGBoost is a popular gradient boosting algorithm that can handle regression tasks effectively. 

Make sure to set appropriate hyperparameters, such as the learning rate, number of estimators, and maximum depth, to optimize the model's performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, r2_score

imputed_data = pd.read_csv("imputed_data_handle_multicollinearity.csv")

# Prepare the data
X = imputed_data[['OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'GarageArea',    
                   'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF', 'Fireplaces',
                   'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 
                   'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 
                   'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens', 
                   'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 
                   'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 
                   'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 
                   'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 
                   'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 
                   'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker']]

y = imputed_data['SalePrice']

# Create an instance of XGBRegressor
xgb = XGBRegressor()

# Define the number of folds for cross-validation
num_folds = 5

# Define the cross-validation strategy
cv = KFold(n_splits=num_folds, shuffle=True)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.01, 0.001],
    'max_depth': [3, 4, 5]
}

# Perform grid search
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=cv, scoring='neg_mean_absolute_error')
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Mean MAE:", -grid_search.best_score_)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


# Fit the model on the training data using the best hyperparameters
xgb_best = XGBRegressor(**grid_search.best_params_)
xgb_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test = xgb_best.predict(X_test)

# Calculate R^2 score
r2 = r2_score(y_test, y_pred_test)

print("R^2 score:", r2)

# Calculate residuals
residuals = y_pred_test - y_test

# Create a DataFrame with residuals, actual sale prices, predicted sale prices, and neighborhoods
residuals_df = pd.DataFrame({
    'Actual Sale Price': y_test,
    'Predicted Sale Price': y_pred_test,
    'Residuals': residuals
})

# Filter undervalued properties
undervalued_properties = residuals_df[residuals_df['Actual Sale Price'] < residuals_df['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

# Print the top 10 undervalued properties with neighborhoods
print("Top 10 Undervalued Properties:")
top_undervalued_properties = undervalued_properties.head(10)
for index, row in top_undervalued_properties.iterrows():

    print("Actual Sale Price:", row['Actual Sale Price'])
    print("Predicted Sale Price:", row['Predicted Sale Price'])
    print("Residual:", row['Residuals'])
    print()

Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
Best Mean MAE: 6188.347000666182
R^2 score: 0.9711399246748238
Top 10 Undervalued Properties:
Actual Sale Price: 315000.0
Predicted Sale Price: 383322.9375
Residual: 68322.9375

Actual Sale Price: 315000.0
Predicted Sale Price: 381655.3125
Residual: 66655.3125

Actual Sale Price: 315000.0
Predicted Sale Price: 381655.3125
Residual: 66655.3125

Actual Sale Price: 67500.00000000001
Predicted Sale Price: 95442.1484375
Residual: 27942.148437499985

Actual Sale Price: 67500.00000000001
Predicted Sale Price: 95245.828125
Residual: 27745.828124999985

Actual Sale Price: 245000.0
Predicted Sale Price: 270982.78125
Residual: 25982.78125

Actual Sale Price: 99500.0
Predicted Sale Price: 123209.6328125
Residual: 23709.6328125

Actual Sale Price: 315000.0
Predicted Sale Price: 338523.625
Residual: 23523.625

Actual Sale Price: 315000.0
Predicted Sale Price: 338220.59375
Residual: 23220.59375

Actual Sale Price: 20700

Calculate residuals: The code does not explicitly calculate residuals. Residuals represent the differences between the predicted sale prices and the actual sale prices. To calculate residuals, you can subtract the predicted sale prices (y_pred_test) from the actual sale prices (y_test).

Identify undervalued properties: Sorting the properties based on residuals to identify undervalued properties is not covered in the provided code. To identify undervalued properties, you would need to sort the properties in the test set based on their residuals in ascending order.

Analyze undervalued properties: Analyzing undervalued properties for potential reasons for their underpricing is not covered in the provided code. This step involves examining the properties with low residuals and considering factors that might have been overlooked in the model, such as unique features, renovations, location advantages, or upcoming developments.

In [2]:
# Filter undervalued properties
undervalued_properties = residuals_df[residuals_df['Actual Sale Price'] < residuals_df['Predicted Sale Price']].sort_values(by='Residuals', ascending=False)

# Count the number of undervalued properties
num_undervalued_properties = len(undervalued_properties)

# Print the count
print("Number of Undervalued Properties:", num_undervalued_properties)

Number of Undervalued Properties: 394


In [3]:
undervalued_properties

Unnamed: 0,Actual Sale Price,Predicted Sale Price,Residuals
921,315000.0,383322.937500,68322.937500
923,315000.0,381655.312500,66655.312500
922,315000.0,381655.312500,66655.312500
234,67500.0,95442.148438,27942.148437
235,67500.0,95245.828125,27745.828125
...,...,...,...
789,179000.0,179058.796875,58.796875
654,102000.0,102039.109375,39.109375
1297,150909.0,150948.031250,39.031250
1436,142500.0,142526.750000,26.750000
