### Decision Trees - Synthetic dataset

First we import the libraries we will need. In addition we will use the first code cell to activate the *inline* mode for the graphics generated by *matplotlib*. We also initialize the seed of the random generator. 

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.tree import DecisionTreeClassifier, export_graphviz
#np.random.seed(19)

Now we will create a synthetic classification problem. The goal is to undestand the main concepts with this problem, after this we will analyze real databases. We will use the "make_blobs" function, which generates data from a specified number of blobs. The centers of the blobs are randomly chosen by this function.

The parameters we can play with are commented here:

In [None]:
Nblobs_class0 = 3 # number of blobs
Nblobs_class1 = 3
nsamples_class0 = 300 # number of points to be generated in class 0
nsamples_class1 = 300 # number of points to be generated in class 1
nattributes = 2
blob_width = 5

class_names = ['class 0', 'class 1']
attributes_names = ['attribute 0', 'attribute 1']

X_class0, _ = make_blobs(random_state = 21,# changing this parameter will lead to different points
                         n_samples = nsamples_class0,
                         n_features = nattributes,
                         centers = Nblobs_class0, # number of blobs (clouds)
                         cluster_std = blob_width)

X_class1, _ = make_blobs(random_state = 23,# changing this parameter will lead to different points
                         n_samples = nsamples_class1,
                         n_features = nattributes,
                         centers = Nblobs_class1,  # number of blobs (clouds)
                         cluster_std = blob_width)

X = np.vstack((X_class0, X_class1))
y = np.zeros(nsamples_class0 + nsamples_class1)
y[nsamples_class0:] = 1

plt.figure(figsize=(7,5))
plt.scatter(X[y==0,0], X[y==0,1], color = 'r', label=class_names[0])
plt.scatter(X[y==1,0], X[y==1,1], color = 'b', label=class_names[1])
plt.grid()
plt.xlabel(attributes_names[0])
plt.ylabel(attributes_names[1])
plt.title('synthetic data for classification')
plt.legend(loc=2)
plt.show()


Now we will train a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier">
DecisionTreeClassifier</a>. The most important arguments for DecisionTreeClassifier builder are the following: 

- **criterion:** criterion for splitting the tree nodes. It can be 'gini' or 'entropy' (this last is equivalent to information gain).

- **max_depth:** maximum depth of the decision tree.

The examples reaching a tree node are used to compute statistics related
to estimate the quality of subsequent splittings at that node. The examples
are also used to compute statistics related to the class to be predicted
in case no further splittings are made. This number should be large enough
to ensure these statistics quality.
Thus requirements about the minimum amount of examples are needed:

- **min_samples_split:** minimum number of examples in a tree node required to be splitted.

- **min_samples_leaf:** minimum number of examples in a classification node


First we make a training set/test set partition of the database in order to properly validate the model. This allows to measure the predictive quality of the model by means of the scoring and the confusion matrix.

Try different parameters and respond to the questions made at the end.

In [None]:
from sklearn.model_selection import train_test_split

# training /test split
testsize = 0.3 # in the [0,1] range. 1: 100%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=testsize, random_state=5)

## Loading the constructed tree

In [None]:
import pickle
with open("tree.b", 'rb') as f:
    clf = pickle.load(f)
    rules = pickle.load(f)

Next cell is used to visualize the constructed decision tree:

In [None]:
# Predictive quality of the model

print("Score training = %f" % (clf.score(X_train, y_train)))
print("Score test = %f" % (clf.score(X_test, y_test)))

from sklearn.metrics import confusion_matrix

print("\nConfusion matrix in test:\n")
print(confusion_matrix(y_test, clf.predict(X_test))) # row: real class; column: predicted class

In [None]:
from graphviz import Source
from libreria_aux_arboles import tree_to_code, tree_to_pseudo
tree_to_code(clf, attributes_names)

#Source( export_graphviz(clf, out_file=None,
#                        feature_names=attributes_names,
#                        class_names=class_names,
#                        filled=True, rounded=True,
#                        special_characters=True))

In [None]:
rules

In [None]:
# same rules but ordered by probability of class
target_class = 'class 1'
rules.sort(key=lambda e: e[1][class_names.index(target_class)][2], reverse=True)

In [None]:
rules