# DTree Classifer Demonstration

In this tutorial we will demonstrate how to use the `DecisionTreeClassifer` class in `scikit-learn` to perform classifications predictions. 


## 1.0 Setup
Import modules


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

np.random.seed(1)

## 2.0 Load data
Load data (it's already cleaned and preprocessed)


In [2]:
# Uncomment the following snippet of code to debug problems with finding the .csv file path
# This snippet of code will exit the program and print the current working directory.
#import os
#print(os.getcwd())

In [3]:
X_train = pd.read_csv('airbnb_train_X_price_gte_150.csv') 
y_train = pd.read_csv('airbnb_train_y_price_gte_150.csv') 
X_test = pd.read_csv('airbnb_test_X_price_gte_150.csv') 
y_test = pd.read_csv('airbnb_test_y_price_gte_150.csv') 

In [4]:
y_train = y_train['price_gte_150'].values.ravel()

In [5]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'C':[0.1,1,10,143],
    'gamma':[1,0.01,0.001,0.0001],
    'kernel':['poly']
    
}


SVM_R_out = SVC()
rand_search = RandomizedSearchCV(estimator = SVM_R_out, param_distributions=param_grid, cv=kfolds, n_iter=16,
                           scoring=score_measure, verbose=1, n_jobs=-1, return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 16 candidates, totalling 80 fits
The best precision score is 0.9370404920282969
... with parameters: {'kernel': 'poly', 'gamma': 0.01, 'C': 0.1}


In [6]:
#grid search in SVM
score_measure = "precision"
kfolds = 5

param_grid = {
    'C':[0.1,1,10,143],
    'gamma':[1,0.01,0.001,0.0001],
    'kernel':['poly']
    
}


SVM_G_out = SVC()
grid_search = GridSearchCV(estimator = SVM_G_out, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 16 candidates, totalling 80 fits
The best precision score is 0.9370404920282969
... with parameters: {'C': 0.1, 'gamma': 0.01, 'kernel': 'poly'}


## 3.0 Model the data

Conduct an initial random search across a wide range of possible parameters.

In [7]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,24),  
    'min_samples_leaf': np.arange(1,24),
    'min_impurity_decrease': np.arange(0.001, 0.1, 0.005),
    'max_leaf_nodes': np.arange(6, 24), 
    'max_depth': np.arange(6,24), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1, return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best precision score is 0.8506202419854265
... with parameters: {'min_samples_split': 8, 'min_samples_leaf': 22, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 11, 'max_depth': 9, 'criterion': 'gini'}


Conduct an exhaustive search across a smaller range of parameters around the parameters found in the initial random search.

In [8]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(2,7),  
    'min_samples_leaf': np.arange(2,5),
    'min_impurity_decrease': np.arange(0.009, 0.012,0.001),
    'max_leaf_nodes': np.arange(35,39), 
    'max_depth': np.arange(35,39), 
    'criterion': ['entropy'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 960 candidates, totalling 4800 fits
The best precision score is 0.8330433636237797
... with parameters: {'criterion': 'entropy', 'max_depth': 35, 'max_leaf_nodes': 35, 'min_impurity_decrease': 0.009, 'min_samples_leaf': 2, 'min_samples_split': 2}


Out of the four models evaluated, the SVM model using polynomial kernel and random search with k-folds = 5 and C values in the specified range achieved a precision of 0.93. Similarly, the SVM model using the same k-fold and C values but with grid search also achieved a precision of 0.93. These two SVM models outperformed the decision tree models using random and grid search, which had precision scores of 0.85 and 0.83, respectively. Thus, the SVM models were found to be the most suitable models while the decision tree models were the least suitable for the given problem.