<div style="color:white;display:fill;border-radius:8px;font-size:200%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#fc6603'>AUTHOR: SOBIA ALAMGIR</span></b></p></div>

<a id="13"></a>
<h1 style="background-color:#435420;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;color:#FF9900;">Decision Tree Classifiers with Hyperparameters using Grid Search Cross Validation</h1>
<figcaption style="text-align: center;">
    <strong>
    </strong>
</figcaption>

- **Decision Tree Classifier:** A machine learning model used for classification tasks by splitting data into branches based on feature values.
 
- **Hyperparameters:** Settings like max_depth, min_samples_split, and criterion that control the behavior of the decision tree.
 
- **Grid Search Cross Validation:** A technique to find the best combination of hyperparameters by:
    Specifying a range of values for each hyperparameter.
    - Performing an exhaustive search over all possible combinations.
    - Using cross-validation to evaluate performance and select the optimal model configuration.

## Step-01 Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score,recall_score,f1_score,accuracy_score

## Step-02 Load Dataset 

In [2]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Step-03 Data Preprocessing

In [3]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [4]:
df.shape

(891, 15)

In [5]:
df.isnull().sum()/len(df)*100

survived        0.000000
pclass          0.000000
sex             0.000000
age            19.865320
sibsp           0.000000
parch           0.000000
fare            0.000000
embarked        0.224467
class           0.000000
who             0.000000
adult_male      0.000000
deck           77.216611
embark_town     0.224467
alive           0.000000
alone           0.000000
dtype: float64

In [6]:
df = df.drop('deck', axis=1)

In [7]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')

In [8]:
# Fill null values
df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)
df['embarked'].fillna(df['embarked'].mode()[0], inplace = True)
df['age'].fillna(df['age'].mean() , inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['embarked'].fillna(df['embarked'].mode()[0], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate

In [9]:
df.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

## Step-04(a) Label Encoding

**Let's do Label Encoding of categorical variables**

In [10]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [11]:
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])
df['embarked'] = le.fit_transform(df['embarked'])
df['class'] = le.fit_transform(df['class'])
df['who'] = le.fit_transform(df['who'])
df['adult_male'] = le.fit_transform(df['adult_male'])
df['embark_town'] = le.fit_transform(df['embark_town'])
df['alive'] = le.fit_transform(df['alive'])
df['alone'] = le.fit_transform(df['alone'])

In [12]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,1,2,0,0
1,1,1,0,38.0,1,0,71.2833,0,0,2,0,0,1,0
2,1,3,0,26.0,0,0,7.925,2,2,2,0,2,1,1
3,1,1,0,35.0,1,0,53.1,2,0,2,0,2,1,0
4,0,3,1,35.0,0,0,8.05,2,2,1,1,2,0,1


**Let's check description of all Numerical features in the dataset**

In [13]:
df.describe()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,1.536476,1.308642,1.210999,0.602694,1.536476,0.383838,0.602694
std,0.486592,0.836071,0.47799,13.002015,1.102743,0.806057,49.693429,0.791503,0.836071,0.594291,0.489615,0.791503,0.486592,0.489615
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,22.0,0.0,0.0,7.9104,1.0,1.0,1.0,0.0,1.0,0.0,0.0
50%,0.0,3.0,1.0,29.699118,0.0,0.0,14.4542,2.0,2.0,1.0,1.0,2.0,0.0,1.0
75%,1.0,3.0,1.0,35.0,1.0,0.0,31.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,2.0,2.0,2.0,1.0,2.0,1.0,1.0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    int32  
 3   age          891 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     891 non-null    int32  
 8   class        891 non-null    int32  
 9   who          891 non-null    int32  
 10  adult_male   891 non-null    int64  
 11  embark_town  891 non-null    int32  
 12  alive        891 non-null    int32  
 13  alone        891 non-null    int64  
dtypes: float64(2), int32(6), int64(6)
memory usage: 76.7 KB


## Step-05 Data Splitting into Independent `X` and dependent variable `y`

In [30]:
X = df.drop('survived', axis = 1)
y = df['survived']

In [31]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2 , random_state=42)

## Step-06 Model Training

In [17]:
model = DecisionTreeClassifier()

## Step-06(a) Hyperparameters

In [18]:
# Hyperparameter grid to search
param_grid = {'criterion':['gini','entropy'],
              'max_depth':[None,10,20,30,40,50],
              'min_samples_split':[2,5,10],
              'min_samples_leaf':[1,2,4]
              }

**Goal:** To optimize hyperparameters for better model accuracy and generalization.

**Outcome:** Identifies the hyperparameters that yield the best cross-validation performance.

## Step-06(b) Grid Search CV: Enhancing Model Performance

In [19]:
# Create a GridsearchCV object to perform hyperparameter tuning
grid_search = GridSearchCV(
    estimator= model,
    param_grid = param_grid,
    cv = 5,
    scoring ='accuracy',
    n_jobs= -1
)

# Fit the GridsearchCV object on the training data
grid_search.fit(X_train,y_train)

## Step-06(c) Best Hyperparameters

In [20]:
# Get the best hyperparameter
best_params = grid_search.best_params_

In [21]:
display(best_params)

{'criterion': 'gini',
 'max_depth': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2}

## Step-7 Model with Hyperparameters

In [27]:
model_with_hyperparameters = DecisionTreeClassifier(criterion='gini',
                                                    max_depth=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2)
print(model_with_hyperparameters.get_params())

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': None, 'splitter': 'best'}


## Step-8 Model fitting with Hyperparameters

In [28]:
model_with_hyperparameters.fit(X_train,y_train)

## Step-9 Model Prediction

In [32]:
y_pred = model_with_hyperparameters.predict(X_test)

## Step-10 Model Evaluation

In [34]:
accuracy = accuracy_score(y_test,y_pred)
precision = precision_score(y_test , y_pred , average='weighted')
recall = recall_score(y_test , y_pred, average='weighted')


print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")


Accuracy: 1.00
Precision: 1.00
Recall: 1.00


## Step-11 Compare testing and predicted values of `Survived` in Dataframe

In [26]:
X_test['Survived'] = y_test
X_test['y_pred'] = y_pred

display(X_test)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone,Survived,y_pred
709,3,1,29.699118,1,1,15.2458,0,2,1,1,0,1,0,1,1
439,2,1,31.000000,0,0,10.5000,2,1,1,1,2,0,1,0,0
840,3,1,20.000000,0,0,7.9250,2,2,1,1,2,0,1,0,0
720,2,0,6.000000,0,1,33.0000,2,1,0,0,2,1,0,1,1
39,3,0,14.000000,1,0,11.2417,0,2,0,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,3,1,17.000000,0,0,7.1250,2,2,1,1,2,0,1,0,0
773,3,1,29.699118,0,0,7.2250,0,2,1,1,0,0,1,0,0
25,3,0,38.000000,1,5,31.3875,2,2,2,0,2,1,0,1,1
84,2,0,17.000000,0,0,10.5000,2,1,2,0,2,1,1,1,1


<a id="13"></a>
<h1 style="background-color:#435420;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;color:#FF9900;">Thanks For Reading My Notebook!​</h1>
<figcaption style="text-align: center;">
    <strong>
    </strong>
</figcaption>