# **Decision Tree**
Question 1: What is a Decision Tree, and how does it work in the context of
classification?
- **Decision Tree in Classification**

A Decision Tree is a type of supervised learning algorithm used for classification and regression tasks. It's a tree-like model that splits data into subsets based on features or attributes.

**How Decision Trees Work**
1. Root Node: The algorithm starts with a root node representing the entire dataset.
2. Splitting: The algorithm selects the best feature to split the data into subsets based on a splitting criterion (e.g., Gini impurity or entropy).
3. Decision Nodes: Each internal node represents a decision based on a feature, and the data is split into subsets accordingly.
4. Leaf Nodes: The terminal nodes represent the predicted class labels.
5. Prediction: New instances are classified by traversing the tree from the root node to a leaf node based on the feature values.

**Decision Tree Example**

Suppose we want to classify whether someone will play tennis based on weather conditions (outlook, temperature, humidity, and wind).

| Outlook | Temperature | Humidity | Wind | Play Tennis |
| ------- | ----------- | -------- | ---- | ----------- |
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |


The Decision Tree might look like this:

- Root Node: Outlook
    - Sunny: Humidity
        - High: No
        - Normal: Yes
    - Overcast: Yes
    - Rain: Wind
        - Weak: Yes
        - Strong: No

**Advantages**
- Interpretability: Decision Trees are easy to understand and visualize.
- Handling categorical features: Decision Trees can handle categorical features directly.

**Disadvantages**
- Overfitting: Decision Trees can overfit the training data, especially when the trees are deep.
- Instability: Small changes in the data can result in significantly different trees.




In [28]:
# Python Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)


# This example demonstrates how to create and train a Decision Tree classifier using scikit-learn in Python.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
- **Impurity Measures in Decision Tr**ees

Gini Impurity and Entropy are two common impurity measures used in Decision Trees to determine the best split.

- **Gini ImpurityGini**

Impurity measures the probability of misclassifying a randomly chosen instance from a dataset. It's calculated as:

Gini Impurity = 1 - Σ (p_i)^2

where p_i is the proportion of instances of class i in the dataset.

- **Entropy**

Entropy measures the uncertainty or randomness in a dataset. It's calculated as:

Entropy = - Σ (p_i * log2(p_i))

where p_i is the proportion of instances of class i in the dataset.

- **Impact on Splits**

Both Gini Impurity and Entropy are used to evaluate the quality of a split. The goal is to find the split that results in the largest reduction in impurity.

1. Gini Impurity: A lower Gini Impurity value indicates a purer node. Decision Trees using Gini Impurity tend to favor splits that result in large, pure nodes.
2. Entropy: A lower Entropy value indicates less uncertainty in the node. Decision Trees using Entropy tend to favor splits that result in nodes with low uncertainty.

**Comparison**
- Gini Impurity: Faster to compute and often produces similar results to Entropy.
- Entropy: More sensitive to changes in class probabilities and can lead to more nuanced splits.

**Example**

Suppose we have a dataset with two classes (0 and 1) and a feature that splits the data into two subsets:

Subset 1: 80% class 0, 20% class 1

Subset 2: 40% class 0, 60% class 1

Using Gini Impurity:

Gini Impurity (Subset 1) = 1 - (0.8)^2 - (0.2)^2 = 0.32

Gini Impurity (Subset 2) = 1 - (0.4)^2 - (0.6)^2 = 0.48

Using Entropy:

Entropy (Subset 1) = -(0.8 * log2(0.8) + 0.2 * log2(0.2)) ≈ 0.72

Entropy (Subset 2) = -(0.4 * log2(0.4) + 0.6 * log2(0.6)) ≈ 0.97

The Decision Tree would choose the split that results in the largest reduction in impurity.

In [7]:
# Python Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree classifier using Gini Impurity
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_gini.fit(X_train, y_train)

# Create and train a Decision Tree classifier using Entropy
clf_entropy = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf_entropy.fit(X_train, y_train)


# This example demonstrates how to create and train Decision Tree classifiers using Gini Impurity and Entropy as impurity measures in scikit-learn.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- **Pre-Pruning vs Post-Pruning in Decision Trees**

Pre-Pruning and Post-Pruning are two techniques used to prevent overfitting in Decision Trees.

**Pre-Pruning**

Pre-Pruning involves stopping the growth of the tree before it's fully grown. This is done by specifying a set of parameters that control the tree's growth, such as:

- Maximum depth: The maximum number of levels in the tree.
- Minimum samples split: The minimum number of samples required to split an internal node.
- Minimum samples leaf: The minimum number of samples required to be at a leaf node.

**Post-Pruning**

Post-Pruning involves growing the tree to its full depth and then removing branches that don't contribute significantly to the tree's performance. This is done by:

- Cost complexity pruning: Removing branches that have a high cost complexity parameter.

**Practical Advantages**

- Pre-Pruning: One practical advantage of pre-pruning is that it reduces computational cost. By stopping the growth of the tree early, we can avoid unnecessary computations and reduce the risk of overfitting.
- Post-Pruning: One practical advantage of post-pruning is that it allows for more flexibility. By growing the tree to its full depth, we can explore more possible splits and then prune the tree to remove unnecessary branches, potentially leading to better performance.

**Example**

Suppose we have a dataset with 1000 samples and 10 features. We can use pre-pruning to limit the maximum depth of the tree to 5 levels, or we can use post-pruning to grow the tree to its full depth and then remove branches with high cost complexity.


In [8]:
# Python Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pre-Pruning: Limit the maximum depth to 5
clf_pre_pruning = DecisionTreeClassifier(max_depth=5, random_state=42)
clf_pre_pruning.fit(X_train, y_train)

# Post-Pruning: Use cost complexity pruning
path = clf_pre_pruning.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
clf_post_pruning = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alphas[-1])
clf_post_pruning.fit(X_train, y_train)


# This example demonstrates how to implement pre-pruning and post-pruning in scikit-learn.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- **Information Gain in Decision Trees**

Information Gain is a measure used in Decision Trees to evaluate the quality of a split. It's calculated as the difference in entropy between the parent node and the child nodes.

Information Gain = Entropy (Parent) - Σ (|Child| / |Parent|) * Entropy (Child)

where |Child| is the number of instances in the child node, and |Parent| is the number of instances in the parent node.

**Importance for Choosing the Best Split**

Information Gain is important for choosing the best split because it helps to:

1. Reduce uncertainty: By choosing the split with the highest Information Gain, we can reduce the uncertainty in the child nodes and make more accurate predictions.
2. Identify relevant features: Information Gain helps to identify the most relevant features for splitting the data, which can improve the overall performance of the Decision Tree.

**How Information Gain Works**

1. Calculate entropy: Calculate the entropy of the parent node and the child nodes.
2. Calculate Information Gain: Calculate the Information Gain for each possible split.
3. Choose the best split: Choose the split with the highest Information Gain.

**Example**

Suppose we have a dataset with two features (A and B) and a target variable. We can calculate the Information Gain for each feature and choose the feature with the highest Information Gain as the best split.

| Feature A | Feature B | Target |
| --------- | --------- | ------ |
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |


Let's say the entropy of the parent node is 1.0. After splitting on Feature A, the entropy of the child nodes is 0.0 and 0.0, resulting in an Information Gain of 1.0. After splitting on Feature B, the entropy of the child nodes is 0.5 and 0.5, resulting in an Information Gain of 0.5. In this case, Feature A would be chosen as the best split.

In [9]:
# Python Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree classifier using entropy
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf.fit(X_train, y_train)

# Get the feature importances
feature_importances = clf.feature_importances_

# Print the feature importances
for i in range(len(iris.feature_names)):
    print(f"Feature {iris.feature_names[i]}: {feature_importances[i]:.3f}")


# This example demonstrates how to use entropy as the criterion for a Decision Tree classifier in scikit-learn, which is based on Information Gain.

Feature sepal length (cm): 0.000
Feature sepal width (cm): 0.014
Feature petal length (cm): 0.895
Feature petal width (cm): 0.090


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- **Real-World Applications of Decision Trees**

Decision Trees have numerous applications in various fields, including:

1. Credit Risk Assessment: Decision Trees can be used to evaluate the creditworthiness of loan applicants based on factors such as credit history, income, and employment status.
2. Medical Diagnosis: Decision Trees can be used to diagnose diseases based on symptoms, medical history, and test results.
3. Customer Segmentation: Decision Trees can be used to segment customers based on demographic and behavioral characteristics, such as age, income, and purchase history.
4. Predictive Maintenance: Decision Trees can be used to predict equipment failures and schedule maintenance based on sensor data and historical maintenance records.
5. Image Classification: Decision Trees can be used as a component of image classification systems, such as object detection and facial recognition.

**Advantages of Decision Trees**

1. Interpretability: Decision Trees are easy to understand and interpret, making them a popular choice for applications where transparency is important.
2. Handling categorical features: Decision Trees can handle categorical features directly, without requiring additional preprocessing.
3. Handling missing values: Decision Trees can handle missing values by treating them as a separate category or using imputation methods.
4. Fast training: Decision Trees are relatively fast to train, especially compared to more complex models like neural networks.

**Limitations of Decision Trees**

1. Overfitting: Decision Trees can overfit the training data, especially when the trees are deep or the dataset is small.
2. Instability: Small changes in the data can result in significantly different trees, which can make it challenging to interpret the results.
3. Limited handling of complex relationships: Decision Trees can struggle to capture complex relationships between features, especially when the relationships are non-linear.
4. Greedy algorithm: Decision Trees use a greedy algorithm to select the best split, which can lead to suboptimal solutions.

**Example**

Suppose we want to build a Decision Tree model to predict whether a customer will churn based on their usage patterns and demographic characteristics. We can use a Decision Tree classifier to identify the most important features and predict the likelihood of churn.

In [10]:
# Python Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset (using Iris as a proxy for a classification problem)
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))


# This example demonstrates how to use a Decision Tree classifier in scikit-learn to solve a classification problem.

Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



Dataset Info:

● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).

● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [14]:
# Training a Decision Tree Classifier on the Iris Dataset
# Here's a Python program that loads the Iris dataset, trains a Decision Tree Classifier using the Gini criterion, and prints the model's accuracy and feature importances.


# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
feature_importances = clf.feature_importances_
for i in range(len(iris.feature_names)):
    print(f"Feature '{iris.feature_names[i]}' importance: {feature_importances[i]:.3f}")


# Explanation
# 1. We load the Iris dataset using load_iris() from sklearn.
# 2. We split the data into training and test sets using train_test_split().
# 3. We create and train a Decision Tree classifier using the Gini criterion.
# 4. We make predictions on the test set and evaluate the model's accuracy.
# 5. We print the feature importances, which represent the importance of each feature in the dataset.

# Output- The output will include the model's accuracy and feature importances. The feature importances will indicate which features are most important for the Decision Tree classifier to make accurate predictions.

# Note that the feature importances are calculated based on the Gini impurity criterion, which measures the reduction in impurity after splitting on a feature. The feature importances are normalized to sum up to 1, making it easier to compare the importance of different features.

Model Accuracy: 1.0
Feature 'sepal length (cm)' importance: 0.000
Feature 'sepal width (cm)' importance: 0.017
Feature 'petal length (cm)' importance: 0.906
Feature 'petal width (cm)' importance: 0.077


Question 7: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [15]:
# Training a Decision Tree Classifier with Limited Depth
# Here's a Python program that loads the Iris dataset, trains a Decision Tree Classifier with max_depth=3, and compares its accuracy to a fully-grown tree.


# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a fully-grown Decision Tree classifier
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
print("Fully-grown Tree Accuracy:", accuracy_full)

# Create and train a Decision Tree classifier with max_depth=3
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
y_pred_limited = limited_tree.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)
print("Limited Tree (max_depth=3) Accuracy:", accuracy_limited)

# Compare the accuracies
print("Difference in Accuracy:", accuracy_full - accuracy_limited)


# Explanation
# 1. We load the Iris dataset using load_iris() from sklearn.
# 2. We split the data into training and test sets using train_test_split().
# 3. We create and train a fully-grown Decision Tree classifier.
# 4. We create and train a Decision Tree classifier with max_depth=3.
# 5. We make predictions on the test set and evaluate the accuracy of both models.
# 6. We compare the accuracies of the two models.

# Output- The output will include the accuracy of the fully-grown tree and the limited tree, as well as the difference in accuracy between the two models. This will help us understand the impact of limiting the tree's depth on its performance.

Fully-grown Tree Accuracy: 1.0
Limited Tree (max_depth=3) Accuracy: 1.0
Difference in Accuracy: 0.0


Question 8: Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [16]:
# Training a Decision Tree Regressor on the California Housing Dataset
# Here's a Python program that loads the California Housing dataset, trains a Decision Tree Regressor, and prints the Mean Squared Error (MSE) and feature importances.


# Import necessary libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
cal_housing = fetch_california_housing()
X = cal_housing.data
y = cal_housing.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Print feature importances
feature_importances = regressor.feature_importances_
for i in range(len(cal_housing.feature_names)):
    print(f"Feature '{cal_housing.feature_names[i]}' importance: {feature_importances[i]:.3f}")


# Explanation
# 1. We load the California Housing dataset using fetch_california_housing() from sklearn.
# 2. We split the data into training and test sets using train_test_split().
# 3. We create and train a Decision Tree Regressor.
# 4. We make predictions on the test set and evaluate the model's performance using the Mean Squared Error (MSE).
# 5. We print the feature importances, which represent the importance of each feature in the dataset.

# Output- The output will include the Mean Squared Error (MSE) and feature importances. The feature importances will indicate which features are most important for the Decision Tree Regressor to make accurate predictions.

# Note that the feature importances are calculated based on the reduction in impurity after splitting on a feature, and the feature importances are normalized to sum up to 1.

Mean Squared Error: 0.495235205629094
Feature 'MedInc' importance: 0.529
Feature 'HouseAge' importance: 0.052
Feature 'AveRooms' importance: 0.053
Feature 'AveBedrms' importance: 0.029
Feature 'Population' importance: 0.031
Feature 'AveOccup' importance: 0.131
Feature 'Latitude' importance: 0.094
Feature 'Longitude' importance: 0.083


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [18]:
# Tuning Decision Tree Parameters using GridSearchCV
# Here's a Python program that loads the Iris dataset, tunes the Decision Tree's max_depth and min_samples_split using GridSearchCV, and prints the best parameters and the resulting model accuracy.


# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [None, 3, 5, 7],
    'min_samples_split': [2, 5, 10]
}

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Perform grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best Parameters:", grid_search.best_params_)

# Print the best cross-validation score
print("Best Cross-Validation Score:", grid_search.best_score_)

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", accuracy)


# Explanation
# 1. We load the Iris dataset using load_iris() from sklearn.
# 2. We split the data into training and test sets using train_test_split().
# 3. We define the parameter grid for max_depth and min_samples_split.
# 4. We perform grid search using GridSearchCV to find the best combination of parameters.
# 5. We print the best parameters and the best cross-validation score.
# 6. We evaluate the best model on the test set and print its accuracy.

# Output- The output will include the best parameters, the best cross-validation score, and the test set accuracy. This will help us understand the impact of tuning the Decision Tree's parameters on its performance.

Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Best Cross-Validation Score: 0.9416666666666668
Test Set Accuracy: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.


In [26]:
#Step-by-Step Process for Predicting Disease Using Decision Trees
# Here's a step-by-step process to handle missing values, encode categorical features, train a Decision Tree model, tune its hyperparameters, and evaluate its performance.

# Step 1: Handle Missing Values
# 1. Identify missing values: Use isnull() or isna() to detect missing values in the dataset.
# 2. Determine the type of missing values: Decide whether the missing values are Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR).
# 3. Choose an imputation strategy: Based on the type of missing values and the dataset, choose an imputation strategy such as mean, median, mode, or imputation using a machine learning model.
# 4. Impute missing values: Use the chosen imputation strategy to fill in the missing values.

# Example code:

import pandas as pd
from sklearn.impute import SimpleImputer

# Assume 'df' is the DataFrame with missing values
imputer = SimpleImputer(strategy='mean')
df[['feature1', 'feature2']] = imputer.fit_transform(df[['feature1', 'feature2']])


# Step 2: Encode Categorical Features
# 1. Identify categorical features: Determine which features are categorical and need encoding.
# 2. Choose an encoding strategy: Decide on an encoding strategy such as one-hot encoding, label encoding, or ordinal encoding.
# 3. Encode categorical features: Use the chosen encoding strategy to transform categorical features into numerical features.

# Example code:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Assume 'df' is the DataFrame with categorical features
encoder = OneHotEncoder()
encoded_features = encoder.fit_transform(df[['categorical_feature']])


# Step 3: Train a Decision Tree Model
# 1. Split the data: Split the preprocessed data into training and test sets.
# 2. Train a Decision Tree model: Use the training data to train a Decision Tree classifier.

# Example code:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Assume 'X' is the feature matrix and 'y' is the target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)


# Step 4: Tune Hyperparameters
# 1. Define the hyperparameter grid: Determine the hyperparameters to tune and their possible values.
# 2. Perform grid search: Use grid search to find the best combination of hyperparameters.

# Example code:

from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'max_depth': [None, 3, 5, 7],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)


# Step 5: Evaluate Performance
# 1. Evaluate the model: Use the test data to evaluate the performance of the best model.
# 2. Calculate metrics: Calculate metrics such as accuracy, precision, recall, and F1-score.

# Example code:

from sklearn.metrics import accuracy_score, classification_report

# Evaluate the best model
y_pred = grid_search.best_estimator_.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))


# Business Value
# The Decision Tree model can provide significant business value in the real-world setting by:

# 1. Improving diagnosis accuracy: The model can help doctors diagnose diseases more accurately, reducing the risk of misdiagnosis and improving patient outcomes.
# 2. Reducing costs: By identifying high-risk patients and providing early intervention, the model can help reduce healthcare costs associated with treating advanced diseases.
# 3. Enhancing patient care: The model can help doctors develop personalized treatment plans tailored to individual patients' needs, improving patient care and outcomes.

# Overall, the Decision Tree model can be a valuable tool in the healthcare industry, providing insights that can improve diagnosis accuracy, reduce costs, and enhance patient care.

Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

