# HCDE 530 - Mini Project 1 Housing Price Prediction
by Tianyu (Tyler) Fang

## Overview

This project is my first attempt to try out the entry-level machine learning leveraging simple and multiple linear regression to produce models that can predict housing prices based on existing data points.

The high-level process of the porject is to:

1. Clean and profile the publicly available data on housing prices for King County with analysis
2. Visualize data based on several variables (features) to identify and confirm relevancy
3. Build simple linear regression model from these features: sqft_living, zipcode, yr_built; and compare performances
4. Build multiple linear regression model using features mentioned above, and assess its performance

The data set used by this project is "House Sales in King County, USA" available on Kaggle - https://www.kaggle.com/harlfoxem/housesalesprediction, which contains house sale prices for King County sold between May 2014 and May 2015.

But as required by assignment, here's another one available from King County Public Data: https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Data Profile

Like most data sets, the "House Sales in King County, USA" data set contains more variables/features than what's needed for the purpose of this analysis. So the first step is to load the data from CSV file into pandas dataFrame for examine and flitering.

In [None]:
sales = pd.read_csv('/kaggle/input/housesalesprediction/kc_house_data.csv')

# print the number of data points and types of metadata
print('Number of data points: ' + str(sales.shape[0]))
print('Types of metadata: ' + str(sales.shape[1]))

# sample top 5 lines of sales data to see what's available
# also force pandas to display all columns
pd.set_option('display.max_columns', 110)
sales.head()

There are 20 features available in the dataset. Based on conventional wisdom, I think the following variables will be particular interesting:

1. price - the target
2. sqft_living - logically the most important factor of price
3. zipcode - location, location, location
4. yr_built/yr_renovated - newer houses should probably have higher prices, but in tight market like Seattle this may not be true
5. bedrooms/bathrooms - If time permits, I will try to incoprate them into my model too.

The next step is to filter the data set based on variables selected above and clean up by dropping all NaN cells.

In [None]:
filtered_data = sales.loc[:, ['price', 'sqft_living', 'yr_built', 'yr_renovated', 'zipcode', 'bedrooms', 'bathrooms']]
cleaned_data = filtered_data.dropna()

# confirm the shape of cleaned data
print(cleaned_data.shape)

# Next, sample the first 5 rows of data to make sure it contains correct information
cleaned_data.head(5)

In addition, we need to process the colum on yr_built/yr_renovated and bedrooms/bathrooms so they show up as one feature which is easier to build visualization and regression model.

For yr_built/yr_renovated, which really matters is the age till last renovation. We will take the yr_renovated if it's available, otherwise fallback to yr_built. Then use current year (2015 for this dataset) to minus the yr_built/yr_renovated to get age.

However, a house renovated in 2010 should not be treated as "new" as a house actually built in 2010. Thus, we need to penalize the renovated house for better prediction outcome. Picking a random constant 1.5 here and will see how it impacts the result.

> **Age = yr_renovated != 0 ? (2015 - yr_renovated) * 1.5 : (2015 - yr_built) **

For bedrooms/bathrooms, will simply combine them. Again, bathrooms should not be treated as same signifcance with bedroom, so penalize that with a 0.8 factor to try it out.

> **Rooms = bedrooms + 0.8 * bathrooms**

In [None]:
# using apply function to create a new column age
cleaned_data['age'] = cleaned_data.apply(lambda row: 1.5 * (2015 - row.yr_renovated) if row.yr_renovated != 0 else (2015 - row.yr_built), axis = 1)

# using apply function to create a new column age
cleaned_data['rooms'] = cleaned_data.apply(lambda row: row.bedrooms + (row.bathrooms * 0.8), axis = 1)

# Next, sample the first 5 rows of data to make sure age and rooms looks correct
cleaned_data.head(5)

## Data Visualization

With data cleaned and filtered, we are ready for next step of the project which is to visualize the data and examine visually to confirm the relevancy guess I made above - using scatter plot should be a good choice here.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(30, 10))

# first, drop the yr_built, yr_renovated, bedrooms, and bathrooms column as they are no longer needed
final_data = cleaned_data.loc[:, ['price', 'sqft_living', 'age', 'rooms', 'zipcode']]

final_data.plot.scatter(x='sqft_living', y='price', c='DarkBlue')
final_data.plot.scatter(x='age', y='price', c='Blue')
final_data.plot.scatter(x='rooms', y='price', c='Yellow')
final_data.plot.scatter(x='zipcode', y='price', c='Red')

From the scatter plot above, it's easy to tell that there are some linear relationship between the price and sqft_living and rooms - which makes sense. And combining them will probably yield a even more linear relationship. Let's try that with 3D scatter plot which can plot 2 factors together.

In [None]:
from mpl_toolkits import mplot3d

# create 3D scatter plot with sqft_living and rooms
fig = plt.figure()
sqft_living_age = plt.axes(projection='3d')
zdata = final_data['price']
xdata = final_data['sqft_living']
ydata = final_data['rooms']
sqft_living_age.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Blues');

For age and zipcode however, they do not have a direct relationship with price like the other two features, and will probably only be meaningful for analysis when fixing some other variables. Let's try that by sampling the houses of 2900 < sqft_living < 3100 and 6.5 < rooms < 7.5, and try the popular zipcodes. Hopefully we will see some inverse linear relationship as I expected. UPDATE: Unfortunately this is not true.

In [None]:
# sample the data points matching the criteria
sampled_data = final_data.loc[(final_data.sqft_living >= 2800) & (final_data.sqft_living <= 3200) & (final_data.rooms >= 6.0) & (final_data.rooms <= 8.0)] 

# confirm the shape of sampled data
print(sampled_data.shape)

# use bar chart of figure out popular zipcodes
zip_code_data = sampled_data.groupby('zipcode')["price"].count().reset_index(name="count")
# select the top 5 zipcodes after sorting for bar chart
zip_code_data = zip_code_data.sort_values(by='count', ascending=False).head(5)
zip_code_data.plot.bar(x='zipcode', y='count', figsize=(30, 10))

# now lets try each popular zipcode one by one
data_98075 = sampled_data.loc[sampled_data.zipcode == 98075]
data_98075.plot.scatter(x='age', y='price', c='Blue')
data_98059 = sampled_data.loc[sampled_data.zipcode == 98059]
data_98059.plot.scatter(x='age', y='price', c='Blue')
data_98052 = sampled_data.loc[sampled_data.zipcode == 98052]
data_98052.plot.scatter(x='age', y='price', c='Blue')
data_98006 = sampled_data.loc[sampled_data.zipcode == 98006]
data_98006.plot.scatter(x='age', y='price', c='Blue')
data_98038 = sampled_data.loc[sampled_data.zipcode == 98038]
data_98038.plot.scatter(x='age', y='price', c='Blue')


## Simple Linear Regression Model

Here comes the exciting part! Let's start building the machine learning model using linear regression. Here I'm using the popular library called sklearn which was recommended by a few tutorials. Based on the data visualization above, I shall use sqft_living and rooms for this analysis.

### Model with sqft_living

But before that, I need to split data into training set and test set in order to assess the performance of my model - I'm choosing a 75-25 split here. 

In [None]:
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 

x_sqft = np.array(final_data['sqft_living']).reshape(-1, 1)
y_price = np.array(final_data['price'])

# splitting the data for simple regssion based on sqft_living
x_sqft_train, x_sqft_test, y_price_train, y_price_test = train_test_split(x_sqft,y_price,test_size=1/4, random_state=0)

# fitting simple linear regression to the training Set
linear_regressor = LinearRegression()
linear_regressor.fit(x_sqft_train, y_price_train)

# using the linear regression to predict prices in training set
sqft_train_prediction = linear_regressor.predict(x_sqft_train)

# visualizing the prediction line in training set 
plt.scatter(x_sqft_train, y_price_train, color= 'yellow')
plt.plot(x_sqft_train, sqft_train_prediction, color = 'darkblue')
plt.xlabel("sqft_living")
plt.ylabel("price")
plt.show()



To assess the performance of this model, I will calculate the average difference between the prediction (running on training data) with the actual price in training set. Then compare the same difference between the prediction from test data with the actual price in test set, and use this to derive the percentage of consistency which will be my benchmark for assessing model performance.

> Average Prediction Difference (APD) = avg(prediction - actual)

> Percentage of Consistency (PoC) = 1 - (APD(training set) - APD(test set)) / APD(training set)

In [None]:
# function to calculate the average difference between prediction and actual price
def calculateDiffAvg(y_data, y_prediction):
    print('Length of data set: ' + str(len(y_prediction)))
    sum_diff = 0
    for index in range(len(y_prediction)):
        sum_diff += abs(y_prediction[index] - y_data[index])
    average_diff = sum_diff / len(y_prediction)
    return average_diff

# function to calculate the percentage of consistency
def calculatePercentConsistent(diff_train, diff_test):
    return (1 - (abs(diff_train - diff_test) / diff_train)) * 100

diff_avg_train = calculateDiffAvg(y_price_train, sqft_train_prediction)
print('Average difference in training set: ' + diff_avg_train.astype(str))

# using the linear regression to predict prices in test set
sqft_test_prediction = linear_regressor.predict(x_sqft_test)
diff_avg_test = calculateDiffAvg(y_price_test, sqft_test_prediction)
print('Average difference in test set: ' + diff_avg_test.astype(str))

# calculate the percentage of consistency
per_consist = calculatePercentConsistent(diff_avg_train,diff_avg_test)
print('Percentage of consistency: %s' % per_consist)

### Model with rooms

The model that predict using linear regression from sqft_living has an average difference of 173K, which is obviously not a great number in terms of accuracy. But at least it achieves high percentage of consistency when running on test data as well.

Now repeat the same process for rooms.

In [None]:
x_room = np.array(final_data['rooms']).reshape(-1, 1)

# splitting the data for simple regssion based on rooms
x_room_train, x_room_test, y_price_train, y_price_test = train_test_split(x_room,y_price,test_size=1/4, random_state=0)

# fitting simple linear regression to the training Set
linear_regressor_room = LinearRegression()
linear_regressor_room.fit(x_room_train, y_price_train)

# using the linear regression to predict prices in training set
room_train_prediction = linear_regressor_room.predict(x_room_train)

# visualizing the prediction line in training set 
plt.scatter(x_room_train, y_price_train, color= 'red')
plt.plot(x_room_train, room_train_prediction, color = 'darkblue')
plt.xlabel("rooms")
plt.ylabel("price")
plt.show()

In [None]:
diff_avg_train = calculateDiffAvg(y_price_train, room_train_prediction)
print('Average difference in training set: ' + diff_avg_train.astype(str))

# using the linear regression to predict prices in test set
room_test_prediction = linear_regressor_room.predict(x_room_test)
diff_avg_test = calculateDiffAvg(y_price_test, room_test_prediction)
print('Average difference in test set: ' + diff_avg_test.astype(str))

# calculate the percentage of consistency
per_consist = calculatePercentConsistent(diff_avg_train,diff_avg_test)
print('Percentage of consistency: %s' % per_consist)

From result above, we can see that using rooms for simple linear regression model yield a result of average difference of rougle 210K, which is worse than the model on sqft_living which was 173K.

## Multiple Linear Regression Model

Now, let's try combine multiple variables into the linear regression model, with a hope that it can do better on Average Prediction Difference as well as Percentage of Consistency. The variables I'm picking are sqft_living, rooms and age. This is due to the fact that zipcodes are discrete data and might impact the result in non-contributing way.

In [None]:
# select multiple variables for linear regression this time
x_multi = final_data.loc[:, ['sqft_living', 'rooms', 'age']].values

# splitting the data for simple regssion based on rooms
x_multi_train, x_multi_test, y_price_train, y_price_test = train_test_split(x_multi,y_price,test_size=1/4, random_state=0)

# fitting linear regression to the training Set
linear_regressor_multi = LinearRegression()
linear_regressor_multi.fit(x_multi_train, y_price_train)

# using the linear regression to predict prices in training set
multi_train_prediction = linear_regressor_multi.predict(x_multi_train)

diff_avg_train = calculateDiffAvg(y_price_train, multi_train_prediction)
print('Average difference in training set: ' + diff_avg_train.astype(str))

# using the linear regression to predict prices in test set
multi_test_prediction = linear_regressor_multi.predict(x_multi_test)
diff_avg_test = calculateDiffAvg(y_price_test, multi_test_prediction)
print('Average difference in test set: ' + diff_avg_test.astype(str))

# calculate the percentage of consistency
per_consist = calculatePercentConsistent(diff_avg_train,diff_avg_test)
print('Percentage of consistency: %s' % per_consist)


As shown above, by using the multiple linear regression, I was able to reduce the average difference in prediction from 173K of using just sqft_living to 168K, which is a marginal improvement, but not as much as I hoped. I suspect if I fitler data set based a single zipcode and redo the analysis, the result will be much better.

## Final Thoughts

In this project I tried out using pandas, numpy and sklearn to clean, analyze, visualize and predict data based on the classic "House Sales in King County, USA" dataset found on Kaggle. A few thoughts after completing the journey here on what's learnt and can be improved:

1. Visualization is a powerful tool to yield insight and should always be tried out being heading down into the Machine Learning stage. For example, I always assumed that age would have a strong negative-linear relationship with price, but as visualization showed, this assumption was not true.
2. Linear regression as the most basic model is obviously very limited when analyzing housing prices which in reality is impacted by much more factors than analyzed here. A more powerful and comprehensive model/algorithm is needed to further this analysis.
3. As I mentioned, if I fitler data set based a single zipcode and redo the analysis, I suspect the result will be much better. Unfortunately I don't have time to finish this before the deadline.