**CS596 - Machine Learning**
<br>
Date: **19 October 2020**


Title: **Seminar 5 - Part A**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

<h2 align="center">Decision Tree Classifier</h2>


- **Diabetes** is a chronic condition in which the **body develops a resistance to insulin**, a hormone which converts food into glucose. 


- **Diabetes affect many people worldwide** and is normally divided into **Type 1** and **Type 2** diabetes. **Both** have **different characteristics**. 


- We **are going to build a model** on the **PIMA Indian Diabetes dataset** to **predict if a particular observation is at a risk of developing diabetes**.

In [None]:
# Import google colab library for loading dataset files
from google.colab import files
uploaded = files.upload()

In [None]:
# Define column names
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# Load cvs dataset
import pandas as pd
df = pd.read_csv("diabetes.csv", header=None, names=col_names)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Feature Selection
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']

# Define feature variable
X = df[feature_cols]

# Define target variable
y = df.label

In [None]:
# Split dataset into training set and test set
from sklearn.model_selection import train_test_split # Import train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
# Building decision tree model
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
model = DecisionTreeClassifier()

# Train decision tree model
model = model.fit(X_train,y_train)

# Predict the response for test dataset
y_pred = model.predict(X_test)

In [None]:
# Evaluating Model Accuracy
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

- Thus, we've got a classification rate of 67.53%, considered as **good accuracy**. 


- However, **we can improve** this **accuracy** by **tuning the parameters** in the Decision Tree Algorithm.


- We can use Scikit-learn's **export_graphviz** function for display the tree within a Jupyter notebook. 


- For plotting tree, you also need to **install graphviz** and **pydotplus**.

In [None]:
pip install graphviz

In [None]:
pip install pydotplus

In [None]:
# Visualizing Decision Trees
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

- In the **decision tree chart**, each internal node has a **decision rule** that splits the data. 


- **Gini** referred as **Gini ratio**, which measures the impurity of the node. 


- We say that a **node is pure** when **all of its records belong to the same class**, such nodes known as the **leaf node**.


- Here, the resultant **tree is unpruned**. This unpruned tree is unexplainable and not easy to understand. Let's **optimize** it by **pruning**.

<h2 align="center">Optimizing Decision Tree Performance</h2>

- In Scikit-learn, **optimization** of decision tree classifier **performed** by only **pre-pruning**. 


- **Maximum depth** of the tree **can be used** as a **control variable** for pre-pruning. 


- Let's plot a decision tree on the same data with `max_depth = 3` and selection `criterion = "entropy"`:

In [None]:
# Build decision tree model
model = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train decision tree model
model = model.fit(X_train,y_train)

# Predict the response for test dataset
y_pred = model.predict(X_test)

# Evaluate Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

- Well, the **classification rate** increased to **77.05%**, which is **better accuracy** than the previous model.

In [None]:
# Visualizing Decision Trees
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

- This **pruned model** is less complex, explainable, and **easy to understand** than the previous decision tree model plot.

<h2 align="center">Conclusions</h2>

- Advantages:
  - Decision trees are **easy to interpret** and **visualize**.
  - It can easily **capture Non-linear patterns**.
  - It requires **fewer data preprocessing** from the user, for example, there is no need to normalize columns.
  - It can be **used for feature engineering** such as **predicting missing values**, suitable for variable selection.
  - The decision tree **has no assumptions about distribution** because of the non-parametric nature of the algorithm. 
  

- Disadvantages:
  - **Sensitive to noisy data**. It can overfit noisy data.
  - The **small variation** (or **variance**) in data can **result** in the **different decision tree**. This **can be reduced** by **bagging** and **boosting**.
  - Decision trees are **biased with imbalance dataset**, so it is **recommended** to **balance out the dataset before building the decision tree**.


<h1 align="center">End of Part A</h1>