### After understanding the Intution part about Decision Tree Let's understand the implementation part of Decision Tree

**Import Required Libraries**

In [60]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Load the dataset `'health_lifestyle_dataset.csv'` using pandas

In [61]:
df = pd.read_csv('health_lifestyle_dataset.csv')
df

Unnamed: 0,id,age,gender,bmi,daily_steps,sleep_hours,water_intake_l,calories_consumed,smoker,alcohol,resting_hr,systolic_bp,diastolic_bp,cholesterol,family_history,disease_risk
0,1,56,Male,20.5,4198,3.9,3.4,1602,0,0,97,161,111,240,0,0
1,2,69,Female,33.3,14359,9.0,4.7,2346,0,1,68,116,65,207,0,0
2,3,46,Male,31.6,1817,6.6,4.2,1643,0,1,90,123,99,296,0,0
3,4,32,Female,38.2,15772,3.6,2.0,2460,0,0,71,165,95,175,0,0
4,5,60,Female,33.6,6037,3.8,4.0,3756,0,1,98,139,61,294,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,99996,53,Male,33.1,4726,3.9,2.0,3118,0,1,56,105,76,282,0,0
99996,99997,22,Male,35.1,11554,4.5,3.1,1967,0,0,51,149,77,192,0,0
99997,99998,37,Male,18.9,3924,3.8,1.0,2328,0,0,69,92,117,218,0,0
99998,99999,72,Female,27.8,16110,5.6,0.8,3093,0,0,93,164,72,188,0,0


**View top 5 rows of the dataset.**

In [62]:
df.head()

Unnamed: 0,id,age,gender,bmi,daily_steps,sleep_hours,water_intake_l,calories_consumed,smoker,alcohol,resting_hr,systolic_bp,diastolic_bp,cholesterol,family_history,disease_risk
0,1,56,Male,20.5,4198,3.9,3.4,1602,0,0,97,161,111,240,0,0
1,2,69,Female,33.3,14359,9.0,4.7,2346,0,1,68,116,65,207,0,0
2,3,46,Male,31.6,1817,6.6,4.2,1643,0,1,90,123,99,296,0,0
3,4,32,Female,38.2,15772,3.6,2.0,2460,0,0,71,165,95,175,0,0
4,5,60,Female,33.6,6037,3.8,4.0,3756,0,1,98,139,61,294,0,0


**Check for total null values present in our dataset**

In [63]:
df.isna().sum()

Unnamed: 0,0
id,0
age,0
gender,0
bmi,0
daily_steps,0
sleep_hours,0
water_intake_l,0
calories_consumed,0
smoker,0
alcohol,0


In [64]:
df.duplicated().sum()

np.int64(0)

**Now we have a column named as `gender` and this column consist of 'male' and 'female' which is of string datatype so we need to encode this in `0` and `1` for this we will use `LabelEncoder` from `sklearn.preprocessing`**

In [65]:
from sklearn.preprocessing import LabelEncoder
df['gender'] = LabelEncoder().fit_transform(df['gender'])
df

Unnamed: 0,id,age,gender,bmi,daily_steps,sleep_hours,water_intake_l,calories_consumed,smoker,alcohol,resting_hr,systolic_bp,diastolic_bp,cholesterol,family_history,disease_risk
0,1,56,1,20.5,4198,3.9,3.4,1602,0,0,97,161,111,240,0,0
1,2,69,0,33.3,14359,9.0,4.7,2346,0,1,68,116,65,207,0,0
2,3,46,1,31.6,1817,6.6,4.2,1643,0,1,90,123,99,296,0,0
3,4,32,0,38.2,15772,3.6,2.0,2460,0,0,71,165,95,175,0,0
4,5,60,0,33.6,6037,3.8,4.0,3756,0,1,98,139,61,294,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,99996,53,1,33.1,4726,3.9,2.0,3118,0,1,56,105,76,282,0,0
99996,99997,22,1,35.1,11554,4.5,3.1,1967,0,0,51,149,77,192,0,0
99997,99998,37,1,18.9,3924,3.8,1.0,2328,0,0,69,92,117,218,0,0
99998,99999,72,0,27.8,16110,5.6,0.8,3093,0,0,93,164,72,188,0,0


Now, let’s divide our dataset into two parts:

**X → Independent variables (features) – the inputs used to make predictions.**

**y → Dependent variable (target) – the output we want to predict.**

In [66]:
X = df.drop('disease_risk', axis=1)
y = df['disease_risk']

**Once we have separated the dataset into X (features) and y (target), the next step is to split the data into training and testing sets, and then use the training set to fit (train) our model.**

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**We initialize a Decision Tree Classifier with entropy as the criterion, a maximum depth of 3, and default values for min_samples_leaf and min_samples_split, while fixing random_state=42 for reproducibility.**

In [68]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)

**As we have successfully created our model now it's time for training X_train and y_train**

In [69]:
model = clf.fit(X_train, y_train)
model

**After training it's time for making prediciton on our decision tree model**

In [70]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7521


**After makine**

### Use RandomizedSearchCV to test a fixed number of random hyperparameter combinations for a Decision Tree with cross-validation, selecting the best set based on accuracy.
### It then trains the optimal model, makes predictions on the test data, and prints the best parameters along with the model’s accuracy.

In [71]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [72]:
rand_search = RandomizedSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv = 5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)
rand_search.fit(X_train, y_train)

In [73]:
print("best parameters",rand_search.best_params_)
best_dt = rand_search.best_estimator_


best parameters {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 5, 'criterion': 'entropy'}


In [74]:
y_pred = best_dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.75215
