# Housing Price Prediction Using XGBoost
---
### By: Tyler Trzecki

This code is adapted from the article [Medium: XGBoost Python Example](https://towardsdatascience.com/xgboost-python-example-42777d01001e).

The Boston housing prices dataset used in the article has several ethical concerns as it relies heavily on racial and socioeconomic data. This data set was built to study the impact of air quality, but did not provide adequate evidence of provided by this data set. Furthermore, it assumed that racial segregation had a positive impact on housing prices.[[1]](https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8)[[2]](https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air)

In its place a California housing dataset will be used.

In [10]:
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [11]:
housing = fetch_california_housing()

X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target)

In [12]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [21]:
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [14]:
reg = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3
    )

In [15]:
reg.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=16,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [17]:
pd.DataFrame(reg.feature_importances_.reshape(1,-1), columns=housing.feature_names)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,0.560666,0.052794,0.033029,0.018917,0.017629,0.139241,0.085581,0.092142


The median income is the stronges home price indicator for this dataset.

In [22]:
y_pred = reg.predict(X_test)

We use the mean squared error to evaluate the model performance. The mean squared error is the average of the differences between the predictions and the actual values squared.

In [26]:
print('The mean squared error is {}.'.format(round(mean_squared_error(y_test,y_pred),2)))

The mean squared error is 0.24.


Mean squared error is the most common loss function used in regression machine learning algorithms. The optimal value of the MSE is 0 meaning the data is perfectly correlated and there is no error between the true and predicted outcomes. This [Medium article](https://towardsdatascience.com/https-medium-com-chayankathuria-regression-why-mean-square-error-a8cad2a1c96f) provides good discussion of the different error loss functions and why mean squared error is a good loss function for regression problems.