You are a data scientist working for a healthcare company, and you have been tasked with creating a
decision tree to help identify patients with diabetes based on a set of clinical variables. You have been
given a dataset (diabetes.csv) with the following variables:
1. Pregnancies: Number of times pregnant (integer)

2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)

3. BloodPressure: Diastolic blood pressure (mm Hg) (integer)

4. SkinThickness: Triceps skin fold thickness (mm) (integer)

5. Insulin: 2-Hour serum insulin (mu U/ml) (integer)

6. BMI: Body mass index (weight in kg/(height in m)^2) (float)

7. DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes

based on family history) (float)

8. Age: Age in years (integer)

9. Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

Here’s the dataset link:

https://drive.google.com/file/d/1Q4J8KS1wm4-_YTuc389enPh6O-eTNcx2/view?usp=sharing


Your goal is to create a decision tree to predict whether a patient has diabetes based on the other
variables. Here are the steps you can follow:


Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to
understand the distribution and relationships between the variables.

Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical
variables into dummy variables if necessary.

Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.

Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use
cross-validation to optimize the hyperparameters and avoid overfitting.

Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy,
precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.

Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important
variables and their thresholds. Use domain knowledge and common sense to explain the patterns and
trends.

Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the
dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and
risks.

## Answers

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import GridSearchCV

data = pd.read_csv("/home/jovyan/work/diabetes.csv")

# Q1: 
print(data.head())
print(data.describe())

# Q2:
print(data.isnull().sum())

# Q3:
X = data.drop("Outcome", axis=1)
y = data["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Q4: 
model = DecisionTreeClassifier()

param_grid = {
    "max_depth": [3, 5, 7],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4]
}


grid_search = GridSearchCV(model, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

print("Best hyperparameters:", best_params)


best_model.fit(X_train, y_train)

# Q5:
y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:\n", cm)

roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC score:", roc_auc)

if hasattr(X, "columns"): 
    feature_names = X.columns.to_list()  
else:
    feature_names = None  

# Q.6
tree_text = export_text(best_model, feature_names=feature_names)
print(tree_text)

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   

In [2]:
# Q.7
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("/home/jovyan/work/diabetes.csv")

X = data.drop("Outcome", axis=1)
y = data["Outcome"]
X_new, X_test, y_new, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

new_predictions = best_model.predict(X_new)

from sklearn.metrics import accuracy_score

new_accuracy = accuracy_score(y_new, new_predictions)
print("Performance on new data (accuracy):", new_accuracy)

Performance on new data (accuracy): 0.7768729641693811


In [3]:
import numpy as np
X_train_noise = X_train.copy()

X_train_noise['Glucose'] += np.random.normal(loc=0, scale=1, size=X_train_noise.shape[0])  

model_with_noise = DecisionTreeClassifier()
model_with_noise.fit(X_train_noise, y_train)

y_pred_noise = model_with_noise.predict(X_test)
noise_accuracy = accuracy_score(y_test, y_pred_noise)

print("Original model accuracy:", accuracy)
print("Model with noise accuracy:", noise_accuracy)

Original model accuracy: 0.7597402597402597
Model with noise accuracy: 0.7272727272727273


In [5]:
from sklearn.model_selection import train_test_split

bmi_range = np.linspace(min(data['BMI']), max(data['BMI']), num=10)

bmi_scores = []

for bmi_value in bmi_range:
   
    X_train_modified = X_new.copy()
    X_train_modified['BMI'] = bmi_value

    model_modified = DecisionTreeClassifier()
    model_modified.fit(X_train_modified, y_new)

    y_pred_modified = model_modified.predict(X_test)
    score = accuracy_score(y_test, y_pred_modified) 

    bmi_scores.append(score)

In [None]:
scenario_data = pd.DataFrame({'Pregnancies': 0, 'Glucose': 150, 'BloodPressure': 80,
                             'SkinThickness': 20, 'Insulin': 0, 'BMI': 25,
                             'DiabetesPedigreeFunction': 0.5, 'Age': 35})

scenario_prediction = best_model.predict(scenario_data)

print("Predicted outcome for the scenario:", scenario_prediction[0])