# Coding Block 4 - Afternoon

## Optimizing Decision Tree Performance

What are criteria we want to optimize the code for?

- **criterion :  optional (default=”gini”) or Choose attribute selection measure**: This parameter allows us to use the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy” for the information gain.

- **splitter : string, optional (default=”best”) or Split Strategy**: This parameter allows us to choose the split strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

- **max_depth : int or None, optional (default=None) or Maximum Depth of a Tree**: The maximum depth of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The higher value of maximum depth causes overfitting, and a lower value causes underfitting ([Source](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)).

In Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. Maximum depth of the tree can be used as a control variable for pre-pruning. In the following the example, you can plot a decision tree on the same data with max_depth=3.  Other than pre-pruning parameters, You can also try other attribute selection measure such as entropy.


### Load the packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from plotly.offline import init_notebook_mode,iplot
from plotly.tools import FigureFactory as ff
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import *
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
'''
...
'''

'\n...\n'

### Read the dataset 

In [2]:
diab=pd.read_csv('C:\\Users\\v.weber\\Documents\\000 Master Wirtschaftsinformatik FU Berlin\\I\\Applied Analytics\\github stuff\\fork\\Applied-Analytics\\data\\diabetes.csv')

## Do some hyperparameter tuning to benchmark different decision tree models

In [46]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
import numpy as np

# Split the dataset into training and test sets
X = diab.drop(columns=['Outcome'])  # Feature variables
y = diab['Outcome']  # Outcome variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [50]:


# Define hyperparameter ranges
max_depth_range = range(1, 10, 1)  # Values from 1 to 50 in steps of 1
min_samples_split_range = range(2, 10, 1)  # Values from 2 to 50 in steps of 1
min_samples_leaf_range = range(30, 51, 1)  # Values from 1 to 50 in steps of 1

# Initialize variables to store the best model and score
best_f1_score = 0
best_hyperparameters = {}

# Iterate over all combinations of hyperparameters
for max_depth in max_depth_range:
    for min_samples_split in min_samples_split_range:
        for min_samples_leaf in min_samples_leaf_range:
            # Build and train the decision tree model
            model = DecisionTreeClassifier(
                random_state=42,
                max_depth=max_depth,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf
            )
            model.fit(X_train, y_train)
            
            # Evaluate the model on the test data
            y_test_pred = model.predict(X_test)
            f1 = f1_score(y_test, y_test_pred)
            
            # Update the best model if the current one is better
            if f1 > best_f1_score:
                best_f1_score = f1
                best_hyperparameters = {
                    'max_depth': max_depth,
                    'min_samples_split': min_samples_split,
                    'min_samples_leaf': min_samples_leaf
                }

# Print the best hyperparameters and corresponding F1-score
print("Best Hyperparameters:")
print(best_hyperparameters)
print(f"Best F1-Score: {best_f1_score:.4f}")

Best Hyperparameters:
{'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 45}
Best F1-Score: 0.6750


In [51]:
from sklearn.tree import export_text

# Export the decision rules as text
decision_tree_rules = export_text(decision_tree_model, feature_names=list(X.columns))
print("Decision Tree Rules:")
print(decision_tree_rules)

Decision Tree Rules:
|--- Glucose <= 143.50
|   |--- Age <= 28.50
|   |   |--- Glucose <= 126.50
|   |   |   |--- BMI <= 31.40
|   |   |   |   |--- SkinThickness <= 14.50
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- SkinThickness >  14.50
|   |   |   |   |   |--- class: 0
|   |   |   |--- BMI >  31.40
|   |   |   |   |--- Glucose <= 105.50
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- Glucose >  105.50
|   |   |   |   |   |--- class: 0
|   |   |--- Glucose >  126.50
|   |   |   |--- class: 0
|   |--- Age >  28.50
|   |   |--- BMI <= 26.95
|   |   |   |--- class: 0
|   |   |--- BMI >  26.95
|   |   |   |--- Glucose <= 100.50
|   |   |   |   |--- class: 0
|   |   |   |--- Glucose >  100.50
|   |   |   |   |--- DiabetesPedigreeFunction <= 0.50
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- DiabetesPedigreeFunction >  0.50
|   |   |   |   |   |--- class: 1
|--- Glucose >  143.50
|   |--- Glucose <= 166.50
|   |   |--- class: 1
|   |--- Glucose >  166.50
|   |   |

## Use the information of the decision tree classifier to produce simple plots and information for stakeholders.
What are some relevant patterns to predict diabetes?

High Glucose value (>143.5) && age > 29