In [1]:
import pandas as pd

## Loading Data
Read the Melbourne Housing Snapshot data file into a Pandas DataFrame called `melbourne_df `.

In [2]:
melbourne_file_path = './melb_data.csv'
melbourne_df = pd.read_csv(melbourne_file_path)


## Review The Data
Use ```<df_name>.describe()``` to view summary statistics of the data.

In [3]:
#Count: how many rows have non-missing values
#Mean: average
#std: standard deviation (độ lệch chuẩn), which measures how numerically spread out the values are.
statistic = melbourne_df.describe()
print(statistic)

              Rooms         Price      Distance      Postcode      Bedroom2  \
count  13580.000000  1.358000e+04  13580.000000  13580.000000  13580.000000   
mean       2.937997  1.075684e+06     10.137776   3105.301915      2.914728   
std        0.955748  6.393107e+05      5.868725     90.676964      0.965921   
min        1.000000  8.500000e+04      0.000000   3000.000000      0.000000   
25%        2.000000  6.500000e+05      6.100000   3044.000000      2.000000   
50%        3.000000  9.030000e+05      9.200000   3084.000000      3.000000   
75%        3.000000  1.330000e+06     13.000000   3148.000000      3.000000   
max       10.000000  9.000000e+06     48.100000   3977.000000     20.000000   

           Bathroom           Car       Landsize  BuildingArea    YearBuilt  \
count  13580.000000  13518.000000   13580.000000   7130.000000  8205.000000   
mean       1.534242      1.610075     558.416127    151.967650  1964.684217   
std        0.691712      0.962634    3990.669241   

In [4]:
#Print average landside
avg_land_size = statistic['Landsize']['mean']
print(avg_land_size)

558.4161266568483


## Selecting Data for Modeling
Your dataset had too many variables to wrap your head around, or even to print out nicely. How can you pare down this overwhelming amount of data to something you can understand?

### Get Columns

In [5]:
# To choose variables/columns, we'll need to see a list of all columns in the dataset. 
# That is done with the columns property of the DataFrame (the bottom line of code below).
attributes = melbourne_df.columns
print(attributes)

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')


### Drop missing values

In [6]:
melbourne_df = melbourne_df.dropna(axis=0)

### SELECT THE PREDICTION TARGET
You can pull out a variable with `dot notation`. This single column is stored in a `Series`, which is like a DataFrame with only a single column of data

In [7]:
y = melbourne_df.Price

## Choosing "Features"
The columns that are inputted into our model (and later used to make predictions) are called "features."In our case, those would be the columns used to determine the home price

In [8]:
melbourne_features = ['Rooms','Bathroom','Landsize','Lattitude','Longtitude']

By convention, this data is called `X`.

In [9]:
X = melbourne_df[melbourne_features]

In [10]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [11]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building Model
Using `sklearn` (skikit-learn) library to create model.

The steps to building and using a model are:

+ Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
+ Fit: Capture patterns from provided data. This is the heart of modeling.
+ Predict: Just what it sounds like
+ Evaluate: Determine how accurate the model's predictions are.

In [12]:
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

#Fit model
melbourne_model.fit(X,y)

We now have a fitted model that we can use to make predictions.


In [13]:
#Making prediction for the first 5 rows
print("Predictions for the first 5 houses:")
print(X.head())
print("\nPredicted prices are:")
print(melbourne_model.predict(X.head()))

Predictions for the first 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954

Predicted prices are:
[1035000. 1465000. 1600000. 1876000. 1636000.]


# Model Vadidation
Measuring model quality is the key to iteratively improving your models.

## Mean Absolute Error (MAE)
The prediction error for each house is:
```
error=actual−predicted
```
So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.
With the MAE metric, we take the **absolute value** of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality.

In [14]:
from sklearn.metrics import mean_absolute_error

In [15]:
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y,predicted_home_prices)

1115.7467183128902

This small error because we used a single "sample" of houses for both building the model and evaluating it 
The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**

## train_test_split
```train_test_split``` break up the data into two pieces
1. raining data to fit the model
2. validation data to calculate mean_absolute_error.

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
train_X,val_X,train_y,val_y = train_test_split(X,y,random_state = 0)
print(train_X),
print(val_X)

       Rooms  Bathroom  Landsize  Lattitude  Longtitude
10385      3       1.0     206.0  -37.87107   145.04991
5805       2       1.0       0.0  -37.85900   144.97670
8488       2       1.0    2701.0  -37.81090   144.86840
6672       3       1.0     670.0  -37.81340   144.87450
776        6       3.0     708.0  -37.91810   145.04400
...      ...       ...       ...        ...         ...
9510       3       1.0     118.0  -37.81351   144.98804
6023       5       2.0     661.0  -37.76510   144.82410
2960       4       2.0     453.0  -37.70160   144.89740
4729       2       1.0      90.0  -37.83570   144.93760
4996       3       1.0     495.0  -37.75210   145.01140

[4647 rows x 5 columns]
       Rooms  Bathroom  Landsize  Lattitude  Longtitude
4850       2       1.0      96.0  -37.85010   144.99530
2307       2       1.0       0.0  -37.89020   144.99070
10090      2       1.0     136.0  -37.85542   144.99571
3645       3       2.0     205.0  -37.79930   145.02670
4930       2       1.0 

In [18]:
melbourne_model = DecisionTreeRegressor()   #Define model
melbourne_model.fit(train_X,train_y)    #Fit model

In [19]:
val_prediction = melbourne_model.predict(val_X)
mean_absolute_error(val_prediction,val_y)

278219.3412954594

# Underfitting and Overfitting
|         |Underfitting | Overfitting| 
|---------|------------ | -----------|
|Desribe|Capturing spurious patterns that won't recur in the future, leading to less accurate| Failing to capture relevant patterns|
|Tree Depth| **TOO DEEP** | **TOO SHALLOW** |

We should to find the suitable tree depth to optimize the model! Control the tree depth by `max_leaf_nodes` argument

In [20]:
def getMAE(max_leaf_nodes,train_X,val_X,train_y,val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes,random_state=0)
    model.fit(train_X,train_y)
    pred_val = model.predict(val_X)
    mae = mean_absolute_error(val_y,pred_val)
    return mae

## Compare the accruancy 
For different `mln`(max_leaf_nodes) in `LEAFS`. We call getMAE to show its model's MAE.

In [21]:
LEAFS = [5,50,500,5000]

In [22]:
for mln in LEAFS :
    cur_mae = getMAE(mln,train_X,val_X,train_y,val_y)
    print("Max leaf nodes:%d \t\t Mean Absolute Error %d" %(mln,cur_mae))

Max leaf nodes:5 		 Mean Absolute Error 385696
Max leaf nodes:50 		 Mean Absolute Error 279794
Max leaf nodes:500 		 Mean Absolute Error 261718
Max leaf nodes:5000 		 Mean Absolute Error 271320


Of the options listed, 500 is the optimal number of leaves.