# Decision Tree

- Split data sets to be as uniform as possible
- Representative methods of measuring uniformity include an **information gain index** using entropy, and a **Gini coefficient**. 

- **Entropy** refers to the degree of congestion of a given data set. If different values are mixed, the entropy is high, and if the same value is mixed, the entropy is low. **Information gain index** is 1 - entropy. The decision tree divide based on an attribute with a high information gain.
- **Gini coefficient** is most uniform to 0 and is more uneven toward 1.

## Decision Tree's Feature

### Merit
- The algorithm is easy and intuitive because it is based on the rule of "uniformity."
- Pre-processing tasks such as scaling and normalization of each feature are not required except in special cases because only uniformity is required.

### Weakness
- Accuracy is poor due to overfitting. So it is necessary to limit the size of the tree in advance.

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Create DecisionTree Classifier
dt_clf = DecisionTreeClassifier(random_state=156)

iris_data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target,
                                                   test_size=0.2, random_state=11)

dt_clf.fit(X_train,y_train)

DecisionTreeClassifier(random_state=156)

In [11]:
from sklearn.tree import export_graphviz

export_graphviz(dt_clf, out_file="tree.dot",class_names=iris_data.target_names,
                feature_names = iris_data.feature_names, impurity=True, filled=True)

In [12]:
import graphviz

with open("tree.dot") as f:
    dot_graph = f.read()
    
graphviz.Source(dot_graph)

ModuleNotFoundError: No module named 'graphviz'