# Classification & Regression with Trees

**Aim**: The aim of this notebook is to provide code-based examples for the implementation of tree based algorithms using scikit-learn. 

## Table of contents 

1. Decision Tree Classifier
2. Random Forest Classifier
3. AdaBoost Classifier
4. Decision Tree Regressor
5. Random Forest Regressor
6. Gradient Boosted Trees Regressor 
7. Ensemble Classifier

## Package Requirements

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
from sklearn import tree

ModuleNotFoundError: No module named 'sklearn.externals.six'

## Decision Tree Classifier

**Reading in the dataset**

In [2]:
df = pd.read_csv('fraud_prediction.csv')

In [3]:
df = df.drop(['Unnamed: 0'], axis = 1)

**Splitting the data into training & test sets**

In [4]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

In [5]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, 
                                                    stratify = target)

**Building the initial decision tree classifier**

In [6]:
#Initializing the DT classifier

dt = DecisionTreeClassifier(criterion = 'gini', random_state = 50)

#Fitting on the training data

dt.fit(X_train, y_train)

#Testing accuracy on the test data

dt.score(X_test, y_test)

0.9975627589568609

**Hyper-parameter Optimization**

In [7]:
#Creating a grid of different hyper-parameters

grid_params = {
    'max_depth': [1,2,3,4,5,6],
    'min_samples_leaf': [0.02,0.04, 0.06, 0.08]
}

#Building a 10 fold Cross Validated GridSearchCV object

grid_object = GridSearchCV(estimator = dt, param_grid = grid_params, scoring = 'accuracy', cv = 10, n_jobs = -1)

In [8]:
#Fitting the grid to the training data

grid_object.fit(X_train, y_train)

In [9]:
#Extracting the best parameters

grid_object.best_params_

{'max_depth': 1, 'min_samples_leaf': 0.02}

In [10]:
#Extracting the best model

dt = grid_object.best_estimator_

**Visualizing the decision tree**

In [11]:
#Reading in the data

df = pd.read_csv('fraud_prediction.csv')
df = df.drop(['Unnamed: 0'], axis = 1)

#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

In [12]:
#Initializing the DT classifier

dt = DecisionTreeClassifier(criterion = 'gini', random_state = 50, max_depth= 5)

In [13]:
#Fitting the classifier on the data

dt.fit(features, target)

In [14]:
#Extracting the feature names

feature_names = df.drop('isFraud', axis = 1)

In [15]:
#Creating the tree visualization

data = tree.export_graphviz(dt, out_file=None, feature_names= feature_names.columns.values, proportion= True)

graph = pydotplus.graph_from_dot_data(data) 

# Show graph
Image(graph.create_png())

NameError: name 'tree' is not defined

## Random Forest Classifier

In [16]:
#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the index

df = df.drop(['Unnamed: 0'], axis = 1)

**Splitting the data into training and test sets**

In [17]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

In [18]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, 
                                                    stratify = target)

In [19]:
#Initiliazing an Random Forest Classifier with default parameters

rf_classifier = RandomForestClassifier(random_state = 50)

#Fitting the classifier on the training data

rf_classifier.fit(X_train, y_train)

#Extracting the scores

rf_classifier.score(X_test, y_test)

0.9976846210090178

**Hyper-parameter tuning**

In [20]:
#Creating a grid of different hyper-parameters

grid_params = {
    'n_estimators': [300,400,500],
    'max_depth': [1,2,3],
    'min_samples_leaf': [0.05, 0.1, 0.2]
}

#Building a 3 fold Cross-Validated GridSearchCV object

grid_object = GridSearchCV(estimator = rf_classifier, param_grid = grid_params, scoring = 'accuracy', 
                           cv = 3, n_jobs = -1)

In [21]:
#Fitting the grid to the training data

grid_object.fit(X_train, y_train)

In [22]:
#Extracting the best parameters

grid_object.best_params_

{'max_depth': 3, 'min_samples_leaf': 0.05, 'n_estimators': 400}

In [23]:
#Extracting the best model

rf_best = grid_object.best_estimator_

## Adaboost Classifier

In [24]:
#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the index

df = df.drop(['Unnamed: 0'], axis = 1)

**Splitting the data into training & testing sets**

In [25]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

In [26]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, 
                                                    stratify = target)

**Building the AdaBoost Classifier**

In [27]:
#Initialize a tree (Decision Tree with max depth = 1)

dt = DecisionTreeClassifier(max_depth=1, random_state = 42)

In [28]:
#Initialize an AdaBoost classifier with the tree as the base estimator

ada_boost = AdaBoostClassifier(base_estimator = dt, n_estimators=100)

In [29]:
#Fitting the AdaBoost classifier to the training set

ada_boost.fit(X_train, y_train)



In [30]:
#Extracting the accuracy scores from the classifier

ada_boost.score(X_test, y_test)

0.9978064830611747

**Hyper-paramter tuning**

In [31]:
#Creating a grid of hyper-parameters

grid_params = {
    'n_estimators': [100,200,300]
}

#Building a 3 fold CV GridSearchCV object

grid_object = GridSearchCV(estimator = ada_boost, param_grid = grid_params, scoring = 'accuracy', cv = 3, n_jobs = -1)

In [32]:
#Fitting the grid to the training data

grid_object.fit(X_train, y_train)



In [33]:
#Extracting the best parameters

grid_object.best_params_

{'n_estimators': 300}

In [34]:
#Extracting the best model

ada_best = grid_object.best_estimator_

## Decision Tree Regressor 

In [35]:
#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the index

df = df.drop(['Unnamed: 0'], axis = 1)

In [36]:
#Creating the features 

features = df.drop('amount', axis = 1).values
target = df['amount'].values

In [37]:
#Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

In [38]:
#Building the decison tree regressor 

dt_reg = DecisionTreeRegressor(max_depth = 10, min_samples_leaf = 0.2, random_state= 50)

In [39]:
#Fitting the tree to the training data

dt_reg.fit(X_train, y_train)

**Visualizing the decision tree**

In [40]:
#Extracting the feature names

feature_names = df.drop('amount', axis = 1)

In [41]:
#Creating the tree visualization

data = tree.export_graphviz(dt_reg, out_file=None, feature_names= feature_names.columns.values, proportion= True)

graph = pydotplus.graph_from_dot_data(data) 

# Show graph
Image(graph.create_png())

NameError: name 'tree' is not defined

## Random Forest Regressor 

In [42]:
#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the index

df = df.drop(['Unnamed: 0'], axis = 1)

In [43]:
#Creating the features 

features = df.drop('amount', axis = 1).values
target = df['amount'].values

In [44]:
#Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

In [45]:
#Initiliazing an Random Forest Regressor with default parameters

rf_reg = RandomForestRegressor(max_depth = 10, min_samples_leaf = 0.2, random_state = 50)

#Fitting the Regressor on the training data

rf_reg.fit(X_train, y_train)

## Gradient Boosted Trees for regression

In [46]:
#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the index

df = df.drop(['Unnamed: 0'], axis = 1)

In [47]:
#Creating the features 

features = df.drop('amount', axis = 1).values
target = df['amount'].values

In [48]:
#Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

**Building the Gradient Boosted Regressor**

In [49]:
#Initiliazing an Gradient Boosted Regressor with default parameters

gb_reg = GradientBoostingRegressor(max_depth = 5, n_estimators = 100, learning_rate = 0.1, random_state = 50)

#Fitting the regressor on the training data

gb_reg.fit(X_train, y_train)

In [50]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

In [51]:
#Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

## Ensemble Classifier

In [52]:
#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the index

df = df.drop(['Unnamed: 0'], axis = 1)

#Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

**Building the DT & RF classifier to include in the Voting Classifier**

In [53]:
#Initializing the DT classifier

dt = DecisionTreeClassifier(criterion = 'gini', random_state = 50)

#Fitting on the training data

dt.fit(X_train, y_train)

In [54]:
#Initiliazing an Random Forest Classifier with default parameters

rf_classifier = RandomForestClassifier(random_state = 50)

#Fitting the classifier on the training data

rf_classifier.fit(X_train, y_train)

In [55]:
#Creating a list of models

models = [('Decision Tree', dt), ('Random Forest', rf_classifier)]

In [56]:
#Initialize a voting classifier 

voting_model = VotingClassifier(estimators = models)

#Fitting the model to the training data

voting_model.fit(X_train, y_train)

#Evaluating the accuracy on the test data

voting_model.score(X_test, y_test)

0.9979283451133317