# Melbourne house price analysis
- In this notebook, I aim to visualise and analyse melbourne's house prices in 2016 and 2017 using the data from https://www.kaggle.com/dansbecker/melbourne-housing-snapshot.
- Data preprocessing, visualisation and prediction modelling will be covered.
- Machine learning algorithms that predict the price of a house given the features will also be developed.

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import seaborn as sns
import pickle
import re

## Preprocessing

In [48]:
houses = pd.read_csv('data/melb_data.csv')  
houses

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,...,1,1.0,202,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019
1,Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/02/2016,2.5,3067,...,1,0.0,156,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019
2,Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/03/2017,2.5,3067,...,2,0.0,134,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019
3,Abbotsford,40 Federation La,3,h,850000,PI,Biggin,4/03/2017,2.5,3067,...,2,1.0,94,,,Yarra,-37.79690,144.99690,Northern Metropolitan,4019
4,Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/06/2016,2.5,3067,...,1,2.0,120,142.0,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,12 Strada Cr,4,h,1245000,S,Barry,26/08/2017,16.7,3150,...,2,2.0,652,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392
13576,Williamstown,77 Merrett Dr,3,h,1031000,SP,Williams,26/08/2017,6.8,3016,...,2,2.0,333,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380
13577,Williamstown,83 Power St,3,h,1170000,S,Raine,26/08/2017,6.8,3016,...,2,4.0,436,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380
13578,Williamstown,96 Verdon St,4,h,2500000,PI,Sweeney,26/08/2017,6.8,3016,...,1,5.0,866,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380


In [49]:
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  int64  
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  int64  
 10  Bedroom2       13580 non-null  int64  
 11  Bathroom       13580 non-null  int64  
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  int64  
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

- The dataset has 21 columns with some having 'object' types, e.g. strings.
    - We will probably need to convert these to dummy variables for the modelling phase.
    - The 'Price' column will be our target variable, and the rest being features to help explain the price.
- We see there are quite a bit of null values (NaN) for columns 'BuildingArea', 'YearBuilt' and 'CouncilArea'. We need to somehow handle these values.
    1. BuildingArea
        - We only have 7130 non-null rows, meaning there are about 6000 null values for this feature. That is almost half of all our training examples. 
        - Since there are too many unobsserved rows for this feature, it will be safer to simply disregard this feature. Doing imputations to such a large number of rows will likely introduce bias and not capture the true characteristic of the feature.
    2. YearBuilt
        - Simiarly to 'BuildingArea', we have a lot of unobserved values for this feature. We will disregard it as well.
    3. CouncilArea
        - There are relatively fewer unobserved values for this feature (only about 1300). We can aim to impute these rows and preserve the feature in our analysis. 
        - Another reason to keep this feature is because the councial area (governing council for the area) is likely to be a good explanator of the house price.

In [50]:
houses.drop(["BuildingArea", "YearBuilt"], axis=1, inplace=True)

- Just moving the 'Price' target column/variable to the last index of our dataframe.

In [51]:
price = houses.pop('Price')
houses['Price'] = price

- Let's see if we have any null values.

In [52]:
is_NaN = houses.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = houses[row_has_NaN]

In [53]:
rows_with_NaN.shape

(1369, 19)

- We have 1369 unobserved values for 'CouncilArea'.
- Let's try to impute these values like we mentioned above.

In [54]:
all_coords = list(zip(houses["Lattitude"].values, houses["Longtitude"]))

In [55]:
from math import sin, cos, sqrt, atan2, radians

def calc_dist(coord1, coord2):
    """
    calculate distance between two coordinates (latitude, longitude) in km
    """
    R = 6373.0
    
    lat1 = radians(abs(coord1[0]))
    lon1 = radians(abs(coord1[1]))
    lat2 = radians(abs(coord2[0]))
    lon2 = radians(abs(coord2[1]))

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    return R * c

In [56]:
# get closest council area for each row with NaN

store = []

for ind, row in rows_with_NaN.iterrows():
    coord = (row["Lattitude"], row["Longtitude"])
    
    min_dist = 1000000
    index = -1
    for one_coord in all_coords:
        output = [-1]*2
        dist = calc_dist(coord, one_coord)
        if dist < min_dist and dist != 0: 
            min_dist = dist
            index = all_coords.index(one_coord)
    output[0] = min_dist
    output[1] = index
    store.append(output)

KeyboardInterrupt: 

In [57]:
# pickle.dump(store, open('data/closest_council_for_NaN.sav', 'wb'))
store = pickle.load(open('data/closest_council_for_NaN.sav', 'rb'))

In [58]:
replace = []
for dist, ind in store:
    replace.append(houses.iloc[ind,:]["CouncilArea"])
replace = np.array(replace)

In [59]:
replace

array(['Bayside', 'Darebin', 'Moonee Valley', ..., 'Hobsons Bay',
       'Hobsons Bay', 'Maribyrnong'], dtype='<U32')

In [60]:
# fill nan values with the council area of the closest instance

i = 0
for ind, row in rows_with_NaN.iterrows():
    houses.iloc[ind, list(houses.columns).index("CouncilArea")] = replace[i]
    i += 1

In [61]:
is_NaN = houses.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = houses[row_has_NaN]

In [62]:
rows_with_NaN.shape

(62, 19)

- Still 62 rows with NaN values... But these are relatively small amount compared to our training set of size ~13000. 
- These NaN instances mean that their closest instance also had NaN council area.
- It is probably safe to remove these completely.

In [63]:
houses = houses.dropna()

In [64]:
houses = houses[houses.CouncilArea != "NaN"]
houses = houses[houses.CouncilArea != "Unavailable"]

In [65]:
houses.isnull().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
Price            0
dtype: int64

- Ok, good.
- Now, I'm going to filter out all houses that have 4 rooms or more, since these are too expensive and irrelevant to me.

In [66]:
houses = houses[houses["Rooms"] <= 3]
houses

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,Price
0,Abbotsford,85 Turner St,2,h,S,Biggin,3/12/2016,2.5,3067,2,1,1.0,202,Yarra,-37.79960,144.99840,Northern Metropolitan,4019,1480000
1,Abbotsford,25 Bloomburg St,2,h,S,Biggin,4/02/2016,2.5,3067,2,1,0.0,156,Yarra,-37.80790,144.99340,Northern Metropolitan,4019,1035000
2,Abbotsford,5 Charles St,3,h,SP,Biggin,4/03/2017,2.5,3067,3,2,0.0,134,Yarra,-37.80930,144.99440,Northern Metropolitan,4019,1465000
3,Abbotsford,40 Federation La,3,h,PI,Biggin,4/03/2017,2.5,3067,3,2,1.0,94,Yarra,-37.79690,144.99690,Northern Metropolitan,4019,850000
5,Abbotsford,129 Charles St,2,h,S,Jellis,7/05/2016,2.5,3067,2,1,0.0,181,Yarra,-37.80410,144.99530,Northern Metropolitan,4019,941000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13570,Wantirna South,34 Fewster Dr,3,h,S,Barry,26/08/2017,14.7,3152,3,2,2.0,674,,-37.88360,145.22805,Eastern Metropolitan,7082,970000
13572,Watsonia,76 Kenmare St,2,h,PI,Morrison,26/08/2017,14.5,3087,2,1,1.0,210,Banyule,-37.70657,145.07878,Northern Metropolitan,2329,650000
13574,Westmeadows,9 Black St,3,h,S,Red,26/08/2017,16.5,3049,3,2,2.0,256,Hume,-37.67917,144.89390,Northern Metropolitan,2474,582000
13576,Williamstown,77 Merrett Dr,3,h,SP,Williams,26/08/2017,6.8,3016,3,2,2.0,333,Hobsons Bay,-37.85927,144.87904,Western Metropolitan,6380,1031000


In [67]:
houses.describe()

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Propertycount,Price
count,10151.0,10151.0,10151.0,10151.0,10151.0,10151.0,10151.0,10151.0,10151.0,10151.0,10151.0
mean,2.509605,9.517693,3100.819525,2.508817,1.322628,1.440548,519.750862,-37.807822,144.988298,7580.555807,923805.4
std,0.619297,5.61654,81.867758,0.668279,0.514269,0.884005,4550.877656,0.075383,0.096952,4484.245296,480342.1
min,1.0,0.0,3000.0,0.0,0.0,0.0,0.0,-38.18255,144.43181,249.0,85000.0
25%,2.0,5.5,3044.0,2.0,1.0,1.0,129.0,-37.8536,144.928,4380.0,600000.0
50%,3.0,8.8,3079.0,3.0,1.0,1.0,314.0,-37.80196,144.99472,6567.0,813500.0
75%,3.0,12.3,3146.0,3.0,2.0,2.0,602.0,-37.75669,145.046835,10412.0,1155000.0
max,3.0,45.9,3977.0,20.0,6.0,10.0,433014.0,-37.40853,145.52635,21650.0,9000000.0


- Let's now create a correlation heatmap to see if some features are correlated.

In [68]:
# using a styled panda's dataframe from https://stackoverflow.com/a/42323184/1215012
cmap = 'coolwarm'
corr = houses.corr()

def magnify():
    return [dict(selector="th", props=[("font-size", "7pt")]),
            dict(selector="td", props=[('padding', "0em 0em")]),
            dict(selector="th:hover", props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover", 
                 props=[('max-width', '200px'), ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .set_precision(2)\
    .set_table_styles(magnify())

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Propertycount,Price
Rooms,1.0,0.32,-0.02,0.9,0.34,0.34,0.02,0.08,-0.0,-0.08,0.4
Distance,0.32,1.0,0.37,0.3,0.06,0.28,0.0,-0.11,0.22,-0.07,-0.17
Postcode,-0.02,0.37,1.0,-0.01,0.09,0.03,0.01,-0.44,0.47,0.07,0.12
Bedroom2,0.9,0.3,-0.01,1.0,0.34,0.33,0.02,0.08,0.0,-0.08,0.38
Bathroom,0.34,0.06,0.09,0.34,1.0,0.2,0.04,-0.06,0.04,-0.03,0.3
Car,0.34,0.28,0.03,0.33,0.2,1.0,0.02,0.01,0.03,-0.02,0.16
Landsize,0.02,0.0,0.01,0.02,0.04,0.02,1.0,0.0,0.01,-0.0,0.04
Lattitude,0.08,-0.11,-0.44,0.08,-0.06,0.01,0.0,1.0,-0.35,0.06,-0.2
Longtitude,-0.0,0.22,0.47,0.0,0.04,0.03,0.01,-0.35,1.0,0.09,0.18
Propertycount,-0.08,-0.07,0.07,-0.08,-0.03,-0.02,-0.0,0.06,0.09,1.0,-0.04


- From above correlation heatmap, we see that the features 'Rooms' and 'Bedroom2' are highly correlated with a correlation of 0.90. 
    - From the data source, I found out that both columns are actually measuring the same thing, the number of rooms in the house, but these two variables just came from different sources.
    - I will discard 'Bedroom2' to avoid multicollinearity problems which can cause troubles when trianing prediction models.

In [69]:
houses.drop('Bedroom2', axis=1, inplace=True)

In [70]:
houses.shape

(10151, 18)

In [71]:
houses

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bathroom,Car,Landsize,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,Price
0,Abbotsford,85 Turner St,2,h,S,Biggin,3/12/2016,2.5,3067,1,1.0,202,Yarra,-37.79960,144.99840,Northern Metropolitan,4019,1480000
1,Abbotsford,25 Bloomburg St,2,h,S,Biggin,4/02/2016,2.5,3067,1,0.0,156,Yarra,-37.80790,144.99340,Northern Metropolitan,4019,1035000
2,Abbotsford,5 Charles St,3,h,SP,Biggin,4/03/2017,2.5,3067,2,0.0,134,Yarra,-37.80930,144.99440,Northern Metropolitan,4019,1465000
3,Abbotsford,40 Federation La,3,h,PI,Biggin,4/03/2017,2.5,3067,2,1.0,94,Yarra,-37.79690,144.99690,Northern Metropolitan,4019,850000
5,Abbotsford,129 Charles St,2,h,S,Jellis,7/05/2016,2.5,3067,1,0.0,181,Yarra,-37.80410,144.99530,Northern Metropolitan,4019,941000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13570,Wantirna South,34 Fewster Dr,3,h,S,Barry,26/08/2017,14.7,3152,2,2.0,674,,-37.88360,145.22805,Eastern Metropolitan,7082,970000
13572,Watsonia,76 Kenmare St,2,h,PI,Morrison,26/08/2017,14.5,3087,1,1.0,210,Banyule,-37.70657,145.07878,Northern Metropolitan,2329,650000
13574,Westmeadows,9 Black St,3,h,S,Red,26/08/2017,16.5,3049,2,2.0,256,Hume,-37.67917,144.89390,Northern Metropolitan,2474,582000
13576,Williamstown,77 Merrett Dr,3,h,SP,Williams,26/08/2017,6.8,3016,2,2.0,333,Hobsons Bay,-37.85927,144.87904,Western Metropolitan,6380,1031000


In [72]:
houses.drop(['Method', 'SellerG', 'Date', 'Address'], axis=1, inplace=True)

In [93]:
#pickle.dump(houses, open('data/cleaned_houses.sav', 'wb'))
houses = pickle.load(open('data/cleaned_houses.sav', 'rb'))