
#**Classification with Decision Tree - hyperparameter tuning (__model selection__) with Grid Search and Cross Validation**


---
We use the Decision Tree algorithm to build a model for classification. To evaluate its performance, we apply standard CrossValidation, ensuring robustness. Finally, we optimize the model by finding the best hyperparameter setting through grid search.



Importing

In [None]:
import warnings
warnings.filterwarnings('ignore')

**Importing libraries for ML**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

**Loading Data**


> We start by preparing the environment for our machine learning workflow.
This involves importing essential libraries, loading the dataset iris.csv,
and defining parameters like training set size and random state for reproducibility.



In [None]:
names= ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
df = pd.read_csv("iris.csv", sep=',', names=names)

**Data exploration**


> We explore the dataset to understand its structure and key statistics.
The df.head() function displays the first few rows, while df.describe() provides summary statistics.
df['class'].value_counts() shows the distribution of class labels, helping to assess class balance.




In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df['class'].value_counts()

**Split the data into the train and test sets**



> We split the dataset into X (features) and y (target labels) for training the model.
Irrelevant columns, if any, are removed using the drop() method to improve model performance.
The axis parameter in drop() determines whether rows (axis=0) or columns (axis=1) are removed.



In [None]:
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
print("There are "+ str(X_train.shape[0])+" samples in training dataset")
print("Each sample has "+ str(X_train.shape[1])+" features")
print("There are "+ str(X_test.shape[0])+" samples in testing dataset")


**Constructoring a Model**
> We train a DecisionTreeClassifier using the training data and make predictions on X_test.
The model's performance is evaluated using accuracy_score,



In [None]:
estimator = DecisionTreeClassifier(random_state=42)
estimator.fit(X_train,y_train)
y_predict = estimator.predict(X_test)
acc = accuracy_score(y_test, y_predict)
maximum_depth = estimator.tree_.max_depth
impurity = estimator.tree_.impurity[0]
depth_values = [*range(1,maximum_depth+1)]

In [None]:
scores = ['accuracy', 'recall_macro', 'f1_macro', 'precision_macro']

params = {'max_depth': depth_values,
          'criterion': ['gini', 'entropy'],
          'class_weight': ['balanced', None]}

**Loops on scores**



> We iterate over different scoring functions to evaluate the model's performance.
For each score, we train the estimator, identify the best model, and generate predictions.
Finally, we print the best score, show the classification_report, and visualize the confusion matrix.



In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for score in scores:
  clf = GridSearchCV(estimator=estimator,
                       cv=skf,
                       param_grid=params,
                       scoring= score,
                       return_train_score=False)
  clf.fit(X_train,y_train)
  y_predict = clf.predict(X_test)
  cr = classification_report(y_true=y_test, y_pred=y_predict, target_names=y_test.unique().tolist())
  cm = confusion_matrix(y_test, y_predict, labels=clf.classes_)
  disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                display_labels=clf.classes_)
  disp.plot()
  plt.title('Analyzing for scoring **' + str(score) + '**\n' + 'Best params: ' + str(clf.best_params_))