# Decision Tree
#### Decision Trees are versatile Machine Learning algorithms that can perform both classification and regression tasks. They are very powerful algorithms, capable of fitting complex datasets.
We can think of this model as breaking down our data by making decisions based on asking a series of questions. Based on the features in our training set, the decision tree model learns a series of questions to infer the class labels of the samples.

Using the decision algorithm, we start at the tree root and split the data on the feature that results in the largest ---*information gain (IG)*. In an iterative process, we can then repeat this splitting procedure at each child node until the leaves are pure. This means that the samples at each node all belong to the same class.

## Maximizing information gain
In order to split the nodes at the most informative features, we need to defne an objective function that we want to optimize via the tree learning algorithm. Here, our objective function is to maximize the information gain at each split, which we defne as follows:

### $IG\left( \mathbf{D}_p,f\right) =  I\left(\mathbf{D}_p\right)- \sum_{j=1}^m \frac{\mathbf{N}_j}{\mathbf{N}_p}I\left(\mathbf{D}_p\right)$

As we can see, the information gain is simply the difference between the _impurity_ of the parent node and the sum of the child node impurities — the lower the impurity of the child nodes, the larger the information gain.

Now, the three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini index ($I_G$), entropy ($I_H$), and the classifcation error ($I_E$).

## Entropy
### $I_H(t) = - \sum_{i=1}^c p{(i\:|\:t)}\:log_2\:p{(i\:|\:t)}$
Here, $p{(i\:|\:t)}$ is the proportion of the samples that belongs to class __c__ for a particular node __t__. The entropy is therefore 0 if all samples at a node belong to the same class, and the entropy is maximal if we have a uniform class distribution. 

For example, in a binary class setting, the entropy is 0 if $p{(i=1|t)}=1$ or $p{(i=0|t)}=0$. If the classes are distributed uniformly with $p{(i=1|t)}=0.5$ and $p{(i=0|t)}=0.5$ , the entropy is 1. Therefore, we can say that the entropy criterion attempts to maximize the mutual information in the tree.

## Gini index
Gini index can be understood as a criterion to minimize the probability of misclassifcation:
### $I_G(t) =1 - \sum_{i=1}^c p{(i\:|\:t)}^2$
Similar to entropy, the Gini index is maximal if the classes are perfectly mixed.

## Classifcation error
### $I_E = 1 - max{\{p(i\:|\:t)}\}$

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [9]:
# df = pd.read_csv('Classified Data', index_col=0)

## Data Exploration

## Data Cleaning

## Building the Model


In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2,criterion='entropy')
tree_clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [6]:
from sklearn.tree import export_graphviz
export_graphviz(
    tree_clf,
    out_file="iris_dtree.dot",
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

Converting the .dot file to a PNG using the dot commandline tool from the graphviz package is done with the following command...

**`dot -Tpng iris_tree.dot -o iris_tree.png`**
![img](iris_dtree.png)

### Creating features and Labels

### Preprocessing/Scaling
One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don’t require feature scaling or centering at all.

### Splitting the dataset
While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data.
#### A random permutation, to split the data randomly
```python
np.random.seed(42)
indices = np.random.permutation(len(X))
X_train = X[indices[:-20]]
y_train = y[indices[:-20]]
X_test = X[indices[-20:]]
y_test = y[indices[-20:]]
```
#### But we will use the `train_test_split` function from `sklearn.model_selection`

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# X_train, X_test, y_train, y_test = train_test_split(
#    X, y, test_size=0.3, random_state=101)

### Importing the Model

### Create and fit a    Classifier

## Predictions

## Evaluation

In [12]:
from sklearn.metrics import classification_report, confusion_matrix