# Understanding ML Models

## Introduction
Here we use different machine learning models on the same data and understand how they work
(This currently contains only code. The detail analysis of model will be added)
## Setup

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

melbourne_file_path = 'data/mel_housing/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 

## CleanUp and Parameters Selection
In this data there are few rows got non numerical values which could mess up the model we are using. We will remove those rows before feeding it to our model.
We also need to choose what our feature and target parameters.

In [2]:
# removing rows with na
melbourne_data = melbourne_data.dropna(0)

# setting features (here features are selected randomly)
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.head()

# setting target
y = melbourne_data.Price
y.head()

1    1035000.0
2    1465000.0
4    1600000.0
6    1876000.0
7    1636000.0
Name: Price, dtype: float64

## Splitting data to training and test
We validate the model using separate test data as model might have closely followed behavior specific to training data which might not be true in case real world data.

In [3]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y =  train_test_split(X,y)

## Training and Evaluating Models
Here we write a function to which takes model, training and test data as input and returns mean absolute error.
We use this to validate various models in later steps.
We train the model with training data and get predictions for unseen validation(test) data.

In [4]:
from sklearn.metrics import mean_absolute_error
def mae(model, train_X, val_X, train_y, val_y):
    model.fit(train_X, train_y)  # train
    prediction = model.predict(val_X) # predict
    return mean_absolute_error(prediction, val_y)

## Models

### Decision Tree

The next example captures basic decision tree implementation. Since all machine learning models have some uncertainty we use *random_state* to make sure the results doesn't vary for each run.

In [5]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=1) # random_state is used to reproduce the results
print("Decision Tree mean absolute error - ", str(mae(model, train_X, val_X, train_y, val_y)))

Decision Tree mean absolute error -  242542.38347320852


We can configure various parameters of the model. One of them is *max_leaf_nodes* which helps in deciding depth of the decision tree. The low value indicates simple models very as very high indicate complex model.
This also influence whether our model overfit/underfit the data.

In [6]:
model = DecisionTreeRegressor(max_leaf_nodes=5, random_state=1)
print("Decision Tree with max_leaf_nodes=5 mean absolute error - ", str(mae(model, train_X, val_X, train_y, val_y)))

model = DecisionTreeRegressor(max_leaf_nodes=5000, random_state=1)
print("Decision Tree with max_leaf_nodes=5 mean absolute error - ", str(mae(model, train_X, val_X, train_y, val_y)))

Decision Tree with max_leaf_nodes=5 mean absolute error -  363973.35401571565
Decision Tree with max_leaf_nodes=5 mean absolute error -  234498.25823111684


### Random Forests

The idea of random forests is to predict based on predictions of multiple decision trees. 

In [7]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=1)
print("Random Forrest mean absolute error - ", str(mae(model, train_X, val_X, train_y, val_y)))

Random Forrest mean absolute error -  193282.4957176673
