<a href="https://colab.research.google.com/github/vivekshaoutlook/machine_learning/blob/master/04_Decision_Tree_on_Moon_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from sklearn.datasets import make_moons

We split the training and the test dataset using **train_test_split**.
**GridSearchCV** will help in evaluating various hyperparameters of Decision Tree

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

In [0]:
from sklearn.tree import DecisionTreeClassifier

**Classification Report** and **Confusion Matrix** are really good metric to measure the accuracy of the classification.

In [0]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [0]:
import numpy as np


Let's fetch 10,000 make_moons data points adding some noise to the guassian distribution

In [0]:
X,y = make_moons(n_samples=10000,noise=0.4)

In [0]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [0]:
tree_classifier_model = DecisionTreeClassifier()

In [0]:
param_grid = [
              {"max_leaf_nodes":[5,10,15],"min_samples_split":[2,3,4,5,6]}
            ]

In [0]:
grid_search = GridSearchCV(tree_classifier_model,param_grid,cv=6,scoring="accuracy")


In [10]:
grid_search.fit(X_train,y_train)

GridSearchCV(cv=6, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'max_leaf_nodes': [5, 10, 15],
                     

In [0]:
predictions = grid_search.predict(X_test)

In [12]:
print (confusion_matrix(y_test,predictions))

[[1198  279]
 [ 159 1364]]


In [13]:
print (classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.88      0.81      0.85      1477
           1       0.83      0.90      0.86      1523

    accuracy                           0.85      3000
   macro avg       0.86      0.85      0.85      3000
weighted avg       0.86      0.85      0.85      3000



In [14]:
accuracy_score(y_test,predictions)

0.854

In [15]:
grid_search.best_params_

{'max_leaf_nodes': 15, 'min_samples_split': 2}

In [16]:
grid_search.best_estimator_

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=15,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Let's now see what will happen if we train multiple Decision Trees on different training set and evaluate on the test set. Will we get same result or better?

In [20]:
len(X_train)

7000

In [0]:
n_trees = 1000
n_instances = 100

mini_sets = [] # list that holds the actual mini training sets
rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train)-n_instances , random_state=42)
for mini_train_index, mini_test_index in rs.split(X_train):
  X_mini_train = X_train[mini_train_index]
  y_mini_train = y_train[mini_train_index]
  #append as tuple so that the X_mini_train and y_mini_train data integrity is preserved
  mini_sets.append((X_mini_train,y_mini_train)) 
  



In [0]:
index=0
mini_accuracy_scores=[]
for data_set in mini_sets:  
  #the hyper paramters are are the best paramters found earlier  
  mini_data_tree_classifier_model = DecisionTreeClassifier(max_leaf_nodes=15,min_samples_split=2)
  mini_data_tree_classifier_model.fit(data_set[0],data_set[1])
  mini_predictions = mini_data_tree_classifier_model.predict(X_test)
  mini_accuracy_scores.append(accuracy_score(y_test,mini_predictions))


In [43]:
np.mean(mini_accuracy_scores)

0.7983536666666666

We achieved only about 80% mean accuracy on the test dataset with an ensemble of 1000 Trees - worse than single DecisionTree model we used earlier. This happened becasue each 1000 individual trees were trained only on 100 training instances