# Training and  Testing ML Models (Lab 2)

### Intro and objectives


### In this lab you will learn:
1. a basic example of training and testing ML models

## What I hope you'll get out of this lab
* The feeling that you'll "know where to start" when you need to train and test a ML-based model
* Worked Examples
* How to interpret the results obtained

In [31]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

In [32]:


# Load data

melbourne_data = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/ML4DS301/main/data/melb_data.csv') 
melbourne_data

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.79690,144.99690,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,12 Strada Cr,4,h,1245000.0,S,Barry,26/08/2017,16.7,3150.0,...,2.0,2.0,652.0,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392.0
13576,Williamstown,77 Merrett Dr,3,h,1031000.0,SP,Williams,26/08/2017,6.8,3016.0,...,2.0,2.0,333.0,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380.0
13577,Williamstown,83 Power St,3,h,1170000.0,S,Raine,26/08/2017,6.8,3016.0,...,2.0,4.0,436.0,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380.0
13578,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26/08/2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0


In [33]:
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

In [34]:
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

### We split the dataset in two: train and test

In [43]:
# split data into training and test data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

In [44]:
train_X

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
10385,3,1.0,206.0,110.0,1980.0,-37.87107,145.04991
5805,2,1.0,0.0,73.0,2000.0,-37.85900,144.97670
8488,2,1.0,2701.0,79.0,2011.0,-37.81090,144.86840
6672,3,1.0,670.0,116.0,1940.0,-37.81340,144.87450
776,6,3.0,708.0,275.0,1988.0,-37.91810,145.04400
...,...,...,...,...,...,...,...
9510,3,1.0,118.0,177.0,1890.0,-37.81351,144.98804
6023,5,2.0,661.0,133.0,1960.0,-37.76510,144.82410
2960,4,2.0,453.0,213.0,2007.0,-37.70160,144.89740
4729,2,1.0,90.0,106.0,2007.0,-37.83570,144.93760


In [45]:
test_X

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
4850,2,1.0,96.0,71.0,1880.0,-37.85010,144.99530
2307,2,1.0,0.0,70.0,1965.0,-37.89020,144.99070
10090,2,1.0,136.0,58.0,1892.0,-37.85542,144.99571
3645,3,2.0,205.0,184.0,1995.0,-37.79930,145.02670
4930,2,1.0,400.0,88.0,1955.0,-37.73520,144.98520
...,...,...,...,...,...,...,...
8223,2,1.0,0.0,82.0,2011.0,-37.73240,144.93770
11190,3,2.0,590.0,151.0,1981.0,-37.89628,145.22294
8563,1,1.0,1175.0,35.0,1970.0,-37.78490,144.82720
1867,2,1.0,585.0,97.0,1950.0,-37.87920,145.09480


In [46]:
train_y

10385    1060000.0
5805      390000.0
8488      502000.0
6672     1055000.0
776      1900000.0
           ...    
9510     1875000.0
6023      605000.0
2960      709000.0
4729     1000000.0
4996      890000.0
Name: Price, Length: 4647, dtype: float64

In [47]:
test_y

4850      815000.0
2307      655000.0
10090     957500.0
3645     1330000.0
4930      722000.0
           ...    
8223      520000.0
11190     870000.0
8563      200000.0
1867     1002000.0
8375     1710000.0
Name: Price, Length: 1549, dtype: float64

### We define and fit the model using the train dataset

In [48]:
# Define model
melbourne_model = DecisionTreeRegressor()

In [49]:
# Fit model
melbourne_model.fit(train_X, train_y)

DecisionTreeRegressor()

### We test the performance of the model using the test dataset

In [50]:
# get predicted prices on validation data
val_predictions = melbourne_model.predict(test_X)
print(mean_absolute_error(test_y, val_predictions))

259866.36604260813


## Wow!
#### Our mean absolute error for the validation dataset is larger than 250,000 dollars.
#### This is clearly unacceptable. We need to improve our model, for instance by adding more features or by selecting our model 