# 1. Problem Statement

Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.

# 2. Explore the data

The first step in any machine learning project is to familiarize yourself with the data.

In [1]:
import pandas as pd

In [2]:
melbourne_data = pd.read_csv("melb_data.csv")

In [3]:
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


With the count, we can see there are missing values for Bathroom, Car, BuildingArea, and YearBruilt. We will deal with those

In [4]:
melbourne_data.dropna(axis=0, inplace=True)

We need to select ou target variable (the dependent variable)

In [5]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [6]:
y = melbourne_data.Price

We need to select the features that we think are useful to predict the price of a house. (The independent variables). Sometimes we use all the columns except the target as the features, if they're all important. But for now let's select a few

In [7]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']

In [8]:
X = melbourne_data[melbourne_features]

In [9]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,141.568645,1964.081988,-37.807904,144.990201
std,0.971079,0.711362,897.449881,90.834824,38.105673,0.07585,0.099165
min,1.0,1.0,0.0,0.0,1196.0,-38.16492,144.54237
25%,2.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198
50%,3.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958
75%,4.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527
max,8.0,8.0,37000.0,3112.0,2018.0,-37.45709,145.52635


In [10]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
1,2,1.0,156.0,79.0,1900.0,-37.8079,144.9934
2,3,2.0,134.0,150.0,1900.0,-37.8093,144.9944
4,4,1.0,120.0,142.0,2014.0,-37.8072,144.9941
6,3,2.0,245.0,210.0,1910.0,-37.8024,144.9993
7,2,1.0,256.0,107.0,1890.0,-37.806,144.9954


# 3. Build a model

The steps to building and using a model are:

* Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
* Fit: Capture patterns from provided data. This is the heart of modeling.
* Predict: Just what it sounds like
* Evaluate: Determine how accurate the model's predictions are.

<h3> 3.1 Decision Tree </h3>

We'll start with a Decision Tree. You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.

The splits and values at the leaves will be determined by the data.

In [11]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

We use data to decide how to break the houses into groups, and then again to determine the predicted price in each group.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [13]:
melbourne_model = DecisionTreeRegressor()

The step of capturing patterns from data is called fitting or training the model. The data used to fit the model is called the training data.

In [14]:
melbourne_model.fit(X_train, y_train)

DecisionTreeRegressor()

In [15]:
print("Making predictions for the following houses")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
print("The True values are")
print(y.tolist()[:5])

Making predictions for the following houses
   Rooms  Bathroom  Landsize  BuildingArea  YearBuilt  Lattitude  Longtitude
1      2       1.0     156.0          79.0     1900.0   -37.8079    144.9934
2      3       2.0     134.0         150.0     1900.0   -37.8093    144.9944
4      4       1.0     120.0         142.0     2014.0   -37.8072    144.9941
6      3       2.0     245.0         210.0     1910.0   -37.8024    144.9993
7      2       1.0     256.0         107.0     1890.0   -37.8060    144.9954
The predictions are
[1035000. 1440000. 1600000. 1876000. 1634000.]
The True values are
[1035000.0, 1465000.0, 1600000.0, 1876000.0, 1636000.0]


<h3>3.1.1 Measure the accuracy of the model</h3>

There are many metrics for summarizing model quality, but we'll start with the Mean Absolute Error (also called MAE).
With the MAE metric, we take the absolute value of each error (actual value − predicted value). This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality.

In [16]:
#Let's calculate the Mean Absolute Error for our model
from sklearn.metrics import mean_absolute_error

The value of the Mean Absolute Error depends on the data and there is no scale to compare it for different sets of data

In [17]:
predicted_home_prices = melbourne_model.predict(X_test)
mean_absolute_error(y_test, predicted_home_prices)

261202.36475145255

<b>Check underfitting and overfitting</b>.
<p>The decision tree model has many options, the most important options determine the tree's depth.

In [18]:
def get_mae(mln, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=mln, random_state=0)
    model.fit(train_X, train_y)
    pred = model.predict(val_X)
    mae = mean_absolute_error(val_y, pred)
    
    return mae

In [19]:
#Let's compare the accuracy for different leaf values

In [20]:
for max_ln in [5, 50, 500, 5000]:
    my_mae = get_mae(max_ln, X_train, X_test, y_train, y_test)
    print(f"Max leaf nodes : {max_ln}, and Mean Absolute Error: {my_mae}")

Max leaf nodes : 5, and Mean Absolute Error: 347380.33833344496
Max leaf nodes : 50, and Mean Absolute Error: 258171.21202406782
Max leaf nodes : 500, and Mean Absolute Error: 243495.96361790417
Max leaf nodes : 5000, and Mean Absolute Error: 254983.64299548094


<b> Of the options listed, 500 is the optimal number of leaves.</b>

You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate 
by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that 
you've made all your modeling decisions.

In [21]:
final_model = DecisionTreeRegressor(max_leaf_nodes=500)
final_model.fit(X, y)

DecisionTreeRegressor(max_leaf_nodes=500)

<h3>3.2 Random Forest </h3>

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

In [22]:
from sklearn.ensemble import RandomForestRegressor

In [23]:
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(X_train, y_train)
prediction = forest_model.predict(X_test)

print(mean_absolute_error(y_test, prediction))

191669.7536453626


This is already far better than the best value using a single tree. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

Let's see what our predictions look like

In [24]:
house_price_predictions = X_test.loc[:]

In [25]:
house_price_predictions["Price"]=prediction
house_price_predictions.reset_index(drop=True, inplace=True)

In [26]:
house_price_predictions.head()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Price
0,2,1.0,96.0,71.0,1880.0,-37.8501,144.9953,947155.0
1,2,1.0,0.0,70.0,1965.0,-37.8902,144.9907,540290.0
2,2,1.0,136.0,58.0,1892.0,-37.85542,144.99571,976405.0
3,3,2.0,205.0,184.0,1995.0,-37.7993,145.0267,1492295.0
4,2,1.0,400.0,88.0,1955.0,-37.7352,144.9852,662815.0


Let's have a look at the prices in the original dataset to compare with the trends in our prediction

In [27]:
melbourne_data[['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                'YearBuilt', 'Lattitude', 'Longtitude', 'Price']].head()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Price
1,2,1.0,156.0,79.0,1900.0,-37.8079,144.9934,1035000.0
2,3,2.0,134.0,150.0,1900.0,-37.8093,144.9944,1465000.0
4,4,1.0,120.0,142.0,2014.0,-37.8072,144.9941,1600000.0
6,3,2.0,245.0,210.0,1910.0,-37.8024,144.9993,1876000.0
7,2,1.0,256.0,107.0,1890.0,-37.806,144.9954,1636000.0


Quiet satisfying I must admit.