# Multiple Linear Regression with Melbourne Housing Data

Import modules needed

In [1]:
from matplotlib import pyplot as plt
import pandas as pd

Import Melbourne housing dataset into pandas and see datatypes  
Source: https://www.kaggle.com/anthonypino/melbourne-housing-market/home

In [2]:
df = pd.read_csv('data/Melbourne_housing_FULL.csv')
df.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

Keep only columns with numerical data

In [3]:
cols = [col for col in df.columns if df[col].dtype != 'object']
df = df[cols]
df = df.drop(labels='Postcode', axis=1)
df.dtypes

Rooms              int64
Price            float64
Distance         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
Lattitude        float64
Longtitude       float64
Propertycount    float64
dtype: object

Find percentage of each column that contain NaN values

In [4]:
df.isnull().sum()/df.shape[0]

Rooms            0.000000
Price            0.218321
Distance         0.000029
Bedroom2         0.235735
Bathroom         0.235993
Car              0.250394
Landsize         0.338813
BuildingArea     0.605761
YearBuilt        0.553863
Lattitude        0.228821
Longtitude       0.228821
Propertycount    0.000086
dtype: float64

Drop all houses that don't have associated house price and see proportion of null values in each column

In [5]:
df = df.loc[df['Price'].isnull() == False]
df.isnull().sum()/df.shape[0]

Rooms            0.000000
Price            0.000000
Distance         0.000037
Bedroom2         0.236393
Bathroom         0.236613
Car              0.250450
Landsize         0.340037
BuildingArea     0.608911
YearBuilt        0.556502
Lattitude        0.229530
Longtitude       0.229530
Propertycount    0.000110
dtype: float64

Drop all columns with more than 1/4 null values

In [6]:
cols = [col for col in df.columns if df[col].isnull().sum()/df.shape[0] < 0.25]
df = df[cols]

Drop remaining rows with null values

In [7]:
for col in df.columns:
    df = df.loc[df[col].isnull() == False]

In [8]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Lattitude,Longtitude,Propertycount
count,20778.0,20778.0,20778.0,20778.0,20778.0,20778.0,20778.0,20778.0
mean,3.061267,1090224.0,11.378121,3.046155,1.591298,-37.806767,144.996757,7503.806189
std,0.944354,652817.3,6.889495,0.954839,0.700844,0.09172,0.12083,4403.671879
min,1.0,85000.0,0.0,0.0,0.0,-38.19043,144.42379,83.0
25%,2.0,660000.0,6.4,2.0,1.0,-37.860675,144.925077,4380.0
50%,3.0,910000.0,10.4,3.0,1.0,-37.799805,145.0035,6567.0
75%,4.0,1335000.0,14.2,4.0,2.0,-37.748648,145.068977,10331.0
max,16.0,11200000.0,48.1,20.0,9.0,-37.3978,145.52635,21650.0
