In [None]:
Q1. Import the Dataset and Examine the Variables
Import Libraries and Load Dataset:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('diabetes.csv')

# Display the first few rows of the dataset
print(data.head())

# Get descriptive statistics
print(data.describe())

# Check for missing values
print(data.isnull().sum())

# Visualize the distribution of each feature
data.hist(figsize=(15, 10))
plt.tight_layout()
plt.show()

# Pairwise relationships
sns.pairplot(data, hue='Outcome')
plt.show()
Understanding the Variables:

Pregnancies: Continuous integer variable.
Glucose: Continuous integer variable.
BloodPressure: Continuous integer variable.
SkinThickness: Continuous integer variable.
Insulin: Continuous integer variable.
BMI: Continuous float variable.
DiabetesPedigreeFunction: Continuous float variable.
Age: Continuous integer variable.
Outcome: Binary target variable (0 or 1).

Q2. Preprocess the Data
Handle Missing Values: If any missing values are found:


# Fill missing values with mean or median, if applicable
data.fillna(data.median(), inplace=True)
Remove Outliers: You can use the IQR method to remove outliers:


# Calculate IQR for each feature
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

# Filter out outliers
filtered_data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]
Transform Categorical Variables: In this case, there are no categorical variables that need transformation.

Q3. Split the Dataset
Split into Training and Test Sets:


from sklearn.model_selection import train_test_split

# Define features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Q4. Train a Decision Tree Model
Train the Model:


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the model
dtree = DecisionTreeClassifier(random_state=42)

# Define the parameter grid for cross-validation
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(dtree, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and model
print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_
Q5. Evaluate the Model
Evaluate Performance:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# Make predictions
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()
Q6. Interpret the Decision Tree
Visualize and Interpret the Tree:


from sklearn.tree import export_text, plot_tree

# Print the tree structure
tree_rules = export_text(best_model, feature_names=list(X.columns))
print(tree_rules)

# Plot the tree
plt.figure(figsize=(20,10))
plot_tree(best_model, feature_names=X.columns, class_names=['Non-Diabetic', 'Diabetic'], filled=True)
plt.show()
Identify Important Variables:

Feature importance can be accessed via best_model.feature_importances_.
Q7. Validate the Model
Sensitivity Analysis and Scenario Testing:

Test the model with new or altered data to ensure robustness.
Perform sensitivity analysis to check how changes in input features affect the model's predictions.
Example of Sensitivity Analysis:


# Example: Modify feature values and observe changes in predictions
X_test_modified = X_test.copy()
X_test_modified['Glucose'] = X_test_modified['Glucose'] + 10  # Example modification
y_pred_modified = best_model.predict(X_test_modified)

# Compare with original predictions
print("Original Predictions vs Modified Predictions")
print(pd.DataFrame({'Original': y_pred, 'Modified': y_pred_modified}))