# **Comparison of Regression methods for House Price Prediction**

The objective of this work is to predict house prices in King County, USA using regression models and to identify the best fitting model. Three regression models are used in the study: Multiple Linear Regression, Decision
Tree Regression and Random Forest Regression. The best regression model is identified by comparing the r2score for the different models. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import relevant libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
data = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')
data.head()

In [None]:
data.info()

Dataframe has 21613 rows and 21 columns. There aren't any missing values in any column.

In [None]:
data.describe()

In [None]:
#Find the number of unique entries in each column
data.nunique()

Classify the variables into 4 categories.
* Continuous variables: A numeric variable that takes any value between a certain set of real numbers.
* Discrete variables: A numeric variable that can only take distinct and separate values.
* Nominal variables:A categorical variable which has no order.
* Ordinal variables: A categorical variable whose value can be logically ordered or ranked.

In [None]:
continuous_variables = ['price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']
discrete_variables = ['yr_built', 'yr_renovated']
nominal_variables = ['lat', 'long', 'zipcode']
ordinal_variables = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade']

## Distribution of target variable : sales price

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize =(8,5))
sns.distplot(data['price'], axlabel = 'Price')

In [None]:
print('Skewness : %f' % data['price'].skew())
print('Kurtosis : %f' % data['price'].kurt())

Skewness is the degree of distortion from the symmetrical bell curve or the normal distribution. The above distribution curve shows a positive skewness. ie, the peak of the distribution curve is less than the average value. This may be an indication that many houses are sold at less than the average value. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. High Kurtosis (34.585540) in this case may be because of outliers present in the data.

In [None]:
sns.pairplot(data[continuous_variables], height = 2 ,kind ='scatter',diag_kind='kde')               

Almost all the continuous variables show a positive skewness. Variables 'sqft_above' and 'sqft_living' are almost linearly related.

In [None]:
fig, ax = plt.subplots(7, 3, figsize=(15,30))

for i, el in enumerate(ordinal_variables):
    feature_count = data[el].value_counts()
    sns.set_style("darkgrid")
    sns.countplot(x=el, data=data,  ax=ax[i,0])
    sns.boxplot(x=el, y= 'price',data=data, ax=ax[i,1])
    sns.regplot(x=el, y= 'price',data=data,  ax=ax[i,2])
    
plt.show()    

## Observations
* Bedrooms & Bathrooms:The median house price is going up with increase in the number of bedrooms (upto 7) and bathrooms (upto 5). Thereafter it doesn't show a linear trend.
* Floors: The median house price increases with an increase in the number of floors (upto 2.5)
* Waterfront: The houses with waterfront are priced higher.
* View: The better the view, the higher the price.
* Condition: The median price for condition 3, 4 and 5 remains almost the same, though price for condition 1 & 2 houses are slightly lower.
* Grade: The median house price increases almost exponentially with increase in grade.

## House age vs house price

In [None]:
df1 = data.copy() 
df1.drop(['id','date'], axis = 1, inplace=True)
df1['yrs_old_renovated'] = np.where(df1['yr_renovated']!= 0, 2015 - df1['yr_renovated'], 2015 - df1['yr_built'])
df1['yrs_old_bins'] = pd.cut(x = df1['yrs_old_renovated'], bins = [-1, 20, 40, 60, 80, 100, 120])
df1['price_bins'] = pd.cut(x = df1['price'], bins = [0, 1e6, 2e6, 3e6, 4e6, 5e6, 6e6, 7e6, 8e6])
df1.head()     

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(18,5))
sns.boxplot(x='yrs_old_bins', y= 'price',data=df1, ax=ax[0])
sns.countplot(x='yrs_old_bins', data=df1, ax=ax[1])
plt.show()

There is not much change in the median house price with aging. So we will discard yr_built and yr_renovated features from the training data.

Let's have a look at the 33 bedroom house and compare it with mean and median values of the dataset.

In [None]:
data[(data['bedrooms'] == 33)]

In [None]:
df1.drop(df1[df1['bedrooms'] == 33].index, axis = 0, inplace = True)

## **Geographic location vs house price**

In [None]:
#plt.figure(figsize=(10,10))
#sns.scatterplot(x='long', y ='lat',data=df1,  sizes = (50, 300), style = 'price_bins', 
               # hue = 'price_bins', alpha = 0.4,   palette='bright')

In [None]:
plt.figure(figsize=(20,15))
g = sns.pairplot(data=df1[['long','lat','price_bins']], hue='price_bins', corner=True )

The above scatter plot is almost the shape of King County. It can be seen that higher priced houses are located in some specific regions, especially near the coasts. Specifically, the high priced houses are located between latitudes of  $47.5^{o}$   and  $47.7^{o}$   and longitudes of  $-122.0^{o}$   and  $−122.4^{o}$  . This information may be helpful for a homebuyer when making a purchase decision. This also indicates that geographical location (latitude, longitude) is a key factor that decides house price.

## Correlation between variables

In [None]:
features = continuous_variables +  ordinal_variables 
k= len(features)
cols = df1[features].corr().nlargest(k,'price')['price'].index
cm = np.corrcoef(data[cols].values.T)
mask = np.zeros_like(df1[cols].corr())
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(15, 10))
    ax = sns.heatmap(cm, cmap='viridis', mask=mask, vmax=.7, linewidths=0.01, annot = True, square=True, 
                    linecolor="white",xticklabels = cols.values ,annot_kws = {'size':12},yticklabels = cols.values)

## Feature selection
Here we select the variables which are highly correlated with our target variable, price. Let's choose the top 10 varaibles - 'sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms', 'view', 'sqft_basement', 'bedrooms', 'waterfront', 'floors'.

'sqft_living' and 'sqft_above' are highly correlated with a correlation coefficient of 0.88. So keeping one of this variable in the training set is sufficient. 'sqft_living' has a higher correlation with 'price' than 'sqft_above'. Therefore, we will keep 'sqft_living' in the training feature. Also, we will add the geographical location parameters, 'lat' and 'long' in the training features.

In [None]:
selected_features = ['sqft_living', 'grade', 'sqft_living15', 'bathrooms', 'view', 'sqft_basement', 'bedrooms',
                     'waterfront', 'floors', 'long', 'lat']
target = ['price']
X = data[selected_features]
y = np.ravel(data[target])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100)

## 1. Multiple Linear Regression
Multiple linear regression (MLR) attempts to model a linear relationship between the several explanatory (independent) variables and the response (dependent) variable. Here we use all the selected independent training variables to predict the house price.

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_prediction = regressor.predict(X_test)
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
mae = mean_absolute_error(y_test, y_prediction)
r2score = r2_score(y_test, y_prediction)
            
print('RMSE:', RMSE)
print('MAE:' ,mae )
print('R2score:', r2score)

### True Value vs. Predicted value for Multiple Linear Regression model

In [None]:
sns.regplot(x=y_test, y=  y_prediction)
plt.xlabel('True Values [Price]')
plt.ylabel('Predictions [Price]')
plt.title('Multiple Linear Regression predictions for the test data')

## 2. Decision Tree Regression

In [None]:
max_depth = [5,10,15,20,25,30,35,40,45,50]
RMSE = []
mae = []
r2score = []
for n in max_depth:
    regressor = DecisionTreeRegressor(max_depth = n, random_state = 100)
    regressor.fit(X_train, y_train)
    y_prediction = regressor.predict(X_test)
    RMSE.append(sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction)))
    mae.append(mean_absolute_error(y_test, y_prediction))
    r2score.append(r2_score(y_test, y_prediction))
    
DTRegressor_results = pd.DataFrame({'max_depth':max_depth,'RMSE':RMSE, 'MAE': mae, 'r2score':r2score})

print(DTRegressor_results.round(2))

fig, ax1 = plt.subplots()
ax1.plot(DTRegressor_results['max_depth'], DTRegressor_results['r2score'], 'b--')
ax1.set_xlabel('max_depth')
ax1.set_ylabel('r2score')
ax1.legend(['r2score'], loc ="upper right")
ax2 = ax1.twinx()
ax2.plot(DTRegressor_results['max_depth'], DTRegressor_results['MAE'], 'r--')
ax2.set_ylabel('MAE')
ax2.legend(['MAE'],loc ="upper center") 
plt.show()

The best fitting model in this case has an r2score of 0.80 and MAE of 88514.64 with max_depth = 10.

### True Value vs. Predicted value for the best fitting Decision Tree Regression model

In [None]:
sns.regplot(x=y_test, y=  DecisionTreeRegressor(max_depth = 10, random_state = 100).fit(X_train, y_train).predict(X_test))
plt.xlabel('True Values [Price]')
plt.ylabel('Predictions [Price]')
plt.title('Decision Tree Regression predictions for the test data')

## 3. Random Forest Regression

In [None]:
n_estimators = [5,10,15,20,25,30, 35, 40, 45, 50]
RMSE = []
mae = []
r2score = []
for n in n_estimators:
    regressor = RandomForestRegressor(n_estimators = n, random_state = 100)
    regressor.fit(X_train, y_train)
    y_prediction = regressor.predict(X_test)
    RMSE.append(sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction)))
    mae.append(mean_absolute_error(y_test, y_prediction))
    r2score.append(r2_score(y_test, y_prediction))
    
RFRegression_results = pd.DataFrame({'n_estimators':n_estimators,'RMSE':RMSE, 'MAE': mae, 'r2score':r2score})

print(RFRegression_results.round(3))

fig, ax1 = plt.subplots()
ax1.plot(RFRegression_results['n_estimators'], RFRegression_results['r2score'], 'b--')
ax1.set_xlabel('max_depth')
ax1.set_ylabel('r2score')
ax1.legend(['r2score'], loc ="upper right")
ax2 = ax1.twinx()
ax2.plot(RFRegression_results['n_estimators'], RFRegression_results['MAE'], 'r--')
ax2.set_ylabel('MAE')
ax2.legend(['MAE'],loc ="upper center") 
plt.show()

The best fitting model in this case has an r2score of 0.877 and MAE of 71879.346 with n_estimators = 40.

### True Value vs. Predicted value for the best fitting Random Forest Regression model

In [None]:
sns.regplot(x=y_test, y= RandomForestRegressor(n_estimators = 50, random_state = 0).fit(X_train, y_train).predict(X_test))
plt.xlabel('True Values [Price]')
plt.ylabel('Predictions [Price]')

## Comparing the different regressor models

In [None]:
reg1 = LinearRegression()
reg2 = DecisionTreeRegressor(max_depth = 10,  random_state = 100)
reg3 = RandomForestRegressor(n_estimators = 40, random_state = 100)

reg1.fit(X_train, y_train)
reg2.fit(X_train, y_train)
reg3.fit(X_train, y_train)

pred1 = reg1.predict(X_test[:20])
pred2 = reg2.predict(X_test[:20])
pred3 = reg3.predict(X_test[:20])


plt.figure(figsize=(20,5))
plt.plot(pred1, 'gd', label='LinearRegression')
plt.plot(pred2, 'b^', label='DecisionTreeRegressor')
plt.plot(pred3, 'ys', label='RandomForestRegressor')
plt.plot(y_test[:20], 'ro', label = 'True value')

plt.tick_params(axis='x', which='both', bottom=False, top=False,
                labelbottom=False)
plt.ylabel('predicted')
plt.xlabel('training samples')
plt.legend(loc="best")
plt.title('Regressor predictions and true value of 20 samples')

plt.show()

The above graph shows the house price predictions with the different regressor models used and the actual price for the first 20 samples in the test dataset.

The highest r2score (0.877) is obtained with Random Forest Regression model.

## Conclusions
* Among the models studied in this work, Random Forest regression model gives highest accuracy in house price prediction.
* High valued houses are located between latitudes of 47.5˚ and 47.7˚ and longitudes of −122.0˚ and −122.4˚. This may be an ideal location to invest in King County. However, the investment decision should be based on one’s financial provisions and aspirations.
* The most sold out homes are either single or two storied houses with 3 or 4 bedrooms. Better view and having waterfront raises the value of the house. It would be good for a builder to keep these in mind while planning a new project.