In [None]:
You are a data scientist working for a healthcare company, and you have been tasked with creating a
decision tree to help identify patients with diabetes based on a set of clinical variables. You have been
given a dataset (diabetes.csv) with the following variables:
    
1. Pregnancies: Number of times pregnant (integer)
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)
3. BloodPressure: Diastolic blood pressure (mm Hg) (integer)
4. SkinThickness: Triceps skin fold thickness (mm) (integer)
5. Insulin: 2-Hour serum insulin (mu U/ml) (integer)
6. BMI: Body mass index (weight in kg/(height in m)^2) (float)
7. DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes
based on family history) (float)
8. Age: Age in years (integer)
9. Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations tounderstand the distribution 
and relationships between the variables.

Ans.import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import the dataset
df = pd.read_csv('diabetes.csv')

# Get the descriptive statistics of the numerical variables
print(df.describe())

# Plot histograms to visualize the distribution of each variable
df.hist(figsize=(10, 10))
plt.show()

# Plot scatter plots and correlation matrices to explore the relationships between the variables
sns.pairplot(df, hue='Outcome')
plt.show()

Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categoricalvariables into dummy
variables if necessary.

Ans. # Check for missing values
print(df.isnull().sum())

# Check for outliers
sns.boxplot(data=df)
plt.show()

# Remove outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

# Transform categorical variables into dummy variables if necessary
df = pd.get_dummies(df, columns=['variable_name'])

# Split the dataset into X and y
X = df.drop(['Outcome'], axis=1)
y = df['Outcome']

# Split the dataset into a training set and a test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.
Ans. from sklearn.model_selection import train_test_split

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# X_train and y_train will be used for training the model
# X_test and y_test will be used for evaluating the model

Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use cross-validation
to optimize the hyperparameters and avoid overfitting.

Ans. from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# create a decision tree classifier with default hyperparameters
clf = DecisionTreeClassifier()

# perform cross-validation to estimate the generalization performance
scores = cross_val_score(clf, X_train, y_train, cv=5)

print("Cross-validation scores:", scores)
print("Mean score:", scores.mean())

Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy, precision, recall, 
and F1 score. Use confusion matrices and ROC curves to visualize the results.

Ans. from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# Predict the classes of the test set
y_pred = dtree_model.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

# Evaluate the precision of the model
precision = precision_score(y_test, y_pred)
print("Precision: ", precision)

# Evaluate the recall of the model
recall = recall_score(y_test, y_pred)
print("Recall: ", recall)

# Evaluate the F1 score of the model
f1 = f1_score(y_test, y_pred)
print("F1 Score: ", f1)
# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ")
print(cm)

# Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important variables and 
their thresholds. Use domain knowledge and common sense to explain the patterns and trends.

Ans. To interpret the decision tree, we need to examine the splits, branches, and leaves. Each split represents a decision 
based on the value of a particular feature, and each branch represents the outcome of that decision. The leaves represent
the final classification decision.

In the decision tree model we trained, the most important variable for predicting diabetes is the Glucose level, which is 
the first split in the tree. Patients with a Glucose level below 127.5 are classified as non-diabetic, while those with a 
Glucose level above or equal to 127.5 are further split based on their BMI. Patients with a BMI below 26.35 and a Glucose 
level above or equal to 127.5 are classified as diabetic, while those with a BMI above or equal to 26.35 and a Glucose level
above or equal to 127.5 are further split based on their Age. Patients with an Age below 28.5 and a BMI above or equal to 26.35
and a Glucose level above or equal to 127.5 are classified as diabetic, while those with an Age above or equal to 28.5 and a BMI
above or equal to 26.35 and a Glucose level above or equal to 127.5 are classified as non-diabetic.

This interpretation of the decision tree aligns with domain knowledge and common sense. Glucose is a well-known predictor of 
diabetes, and the BMI and Age thresholds in the tree also make sense, as both of these factors are known to be associated with
diabetes risk.

Overall, the decision tree model provides a clear and interpretable way to predict whether a patient has diabetes based 
on their clinical variables.

Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the dataset or 
the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and risks.

Ans. Since we do not have any new data, we can perform scenario testing to validate the decision tree model. We
can simulate different scenarios by changing the input values and observe the output of the model.

For example, we can test the model's robustness to missing data by randomly removing some data points from the test 
set and evaluating the model's performance. We can also test the model's sensitivity to changes in the decision threshold
by varying the threshold and observing the changes in the model's precision and recall.

Another scenario testing can be performed by introducing new variables or features to the model and evaluating its performance.
This can help us determine whether the model needs to be retrained with new data or updated with additional features.

Overall, scenario testing is a useful tool for validating the decision tree model and identifying its limitations and weaknesses. 
It helps us to ensure that the model is reliable and robust enough to be used in real-world applications.