## Introduction
This code provides an introdutory machine learning applycation of the scikit learn library, using python and jupyter notebook.
We will use this code to try to predict prices of the real estate market in Melbourne - but we will fail. Never the less, it's a good learning experience!

The dataset was previously cleaned and is available online at: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot. It consist of a snapshot of Tony Pino's Melbourne Housing Dataset and contains data regarding the Melbourne real estate market. 


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [2]:
melbourne_data = pd.read_csv("melb_data.csv")
melbourne_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [3]:
melbourne_data.sample(4)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
896,Bentleigh East,3 Molden St,5,h,1550000.0,S,Hodges,18/06/2016,13.9,3165.0,...,3.0,2.0,623.0,257.0,1950.0,Glen Eira,-37.9066,145.0541,Southern Metropolitan,10969.0
2534,Fitzroy,190 Gore St,2,h,1440000.0,S,Nelson,17/09/2016,1.6,3065.0,...,1.0,0.0,95.0,,,Yarra,-37.8032,144.9825,Northern Metropolitan,5825.0
6969,Hawthorn East,503 Tooronga Rd,3,h,1700000.0,VB,Jellis,10/09/2016,7.5,3123.0,...,2.0,1.0,646.0,228.0,1980.0,Boroondara,-37.8337,145.0469,Southern Metropolitan,6482.0
3417,Ivanhoe,2/92 Beatty St,2,t,710000.0,PI,Nelson,28/08/2016,7.9,3079.0,...,1.0,1.0,134.0,120.0,2012.0,Banyule,-37.7582,145.034,Eastern Metropolitan,5549.0


In [4]:
# Changing the "Date" format from string to datetime

melbourne_data['Date'] = pd.to_datetime(melbourne_data['Date'], format = "%Y-%m-%d", errors = "ignore")

In [5]:
# Dropping columns with missing values:

melbourne_data = melbourne_data.dropna(axis=1)

In [6]:
# We want to predict the prices in the dataframe, therefore:

y = melbourne_data["Price"]

In [7]:
# In this analysis, we will use a subframe of our dataframe to make predictions:

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
x = melbourne_data[melbourne_features]
x.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


In [8]:
"""
Using the scikit-learn library to create models:
At first we will use the whole dataset to create the model, this will be changed latter!
It is necessary to specify a random_state number to ensure the same results on each run!
"""

melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(x, y)

In [9]:
# Verifying predictions (for a overfitting model):

predictions = melbourne_model.predict(x)
print(y.head())
print(predictions[:5])

0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64
[1480000. 1035000. 1465000.  850000. 1600000.]


In [10]:
#Calculating the average absolute error:

predicted_home_prices = melbourne_model.predict(x)
print(mean_absolute_error(y, predicted_home_prices))

1125.1804614629357


In [11]:
"""
Model validation: Breaking the data into two separeted blocks, using one of them to create the model
and the other to validate the model.

The split is based on a random number generator. Supplying a numeric value to the random_state argument
guarantees we get the same split every time we run this script.
"""

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_x, train_y)

# Verifying our predictions:
val_predictions = melbourne_model.predict(val_x)
print(mean_absolute_error(val_y, val_predictions))

246845.416593029


In [12]:
def get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state=0)
    model.fit(train_x, train_y)
    preds_val = model.predict(val_x)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [13]:
candidate_max_leaf_nodes = [5, 50, 500, 5000]

scores = {leaf_size: get_mae(leaf_size, train_x, val_x, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
print(best_tree_size)

500


In [14]:
melbourne_final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size, random_state = 0)
melbourne_final_model.fit(train_x, train_y)
final_predictions = melbourne_final_model.predict(val_x)
print(mean_absolute_error(val_y, final_predictions))

231301.17567588817


We obtained a model that couldn't precisely identify the house prices in Melbourne. The mean absolute value obtained from this code is too high, surpassing the value of a few homes, and therefore can not be used effectively to predict house prices.