# **Questions**

**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**
>  A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has a hierarchical tree structure  consists of a root node, branches, internal nodes and leaf nodes.

> A decision tree recursively splits the dataset based on feature values to homogenous subsets. Each leaf node of the tree corresponds to a class label and the internal nodes are feature-based decision points

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

> Gini Impurity and Entropy both measures of purity of node in a decision tree, where lower impurity indicates a better split.

> The split depends on in which we give us more infromation gain.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

> **Pre-pruning** stops a decision tree's growth during training to prevent overfitting. The advantage of pre-pruning is its efficiency, as it prevents the creation of a large, complex tree, saving computational resources.

>**Post-pruning** removes unnecessary branches after the tree has fully grown. The advantage is the accuracy, as it considers the entire tree to make more informed decisions about removing subtrees, potentially leading to a more optimized model.

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

> Information Gain in decision trees is the measure of how much a feature reduces the impurity in the data after a split.

> It's calculated as the difference between the entropy of the parent node and the weighted average of the entropies of the child nodes.

> Features with the highest Information Gain are chosen for splitting because they provide the most information about the class labels, leading to more homogeneous and accurate decision tree nodes.

**Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**

> Decision Trees are used in various fields, including healthcare for diagnosis, finance for risk assessment and fraud detection, and marketing for customer segmentation and churn prediction.

**Main Advantages**

  * Simplicity
  * Data Preparation
  * Mixed Data Types
  * Feature Importance.
  * Computational Efficiency

**Main Limitations**
  * Overfitting
  * Instability
  * Biased with Imbalanced Data.

In [2]:
'''
Question 6 : Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
'''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train a Decision Tree Classifier using the Gini criterion
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# 3. Print the model’s accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(feature_names, dt_classifier.feature_importances_):
    print(f"{feature}: {importance:.2f}")



Model Accuracy: 1.0000

Feature Importances:
sepal length (cm): 0.00
sepal width (cm): 0.02
petal length (cm): 0.89
petal width (cm): 0.09


In [4]:
'''
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its
accuracy to a fully-grown tree.
'''

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier with max_depth=3
dt_classifier_3 = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_classifier_3.fit(X_train, y_train)

# Predict on the test set and calculate accuracy for the limited tree
y_pred_3 = dt_classifier_3.predict(X_test)
accuracy_3 = accuracy_score(y_test, y_pred_3)
print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_3:.2f}")

if accuracy_3 > accuracy:
    print("The Decision Tree with max_depth=3 performed better.")
elif accuracy > accuracy_3:
    print("The fully-grown Decision Tree performed better.")
else:
    print("Both Decision Trees achieved the same accuracy.")

Accuracy of Decision Tree with max_depth=3: 1.00
Both Decision Trees achieved the same accuracy.


In [6]:
'''
Question 8: Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
'''

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing Dataset as its showing error due to some ethical issue
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_regressor.predict(X_test)

# Calculate and print the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
print("\nFeature Importances:")
feature_importances = pd.Series(dt_regressor.feature_importances_, index=X.columns).sort_values(ascending=False)
print(feature_importances)

Mean Squared Error (MSE): 0.50

Feature Importances:
MedInc        0.528509
AveOccup      0.130838
Latitude      0.093717
Longitude     0.082902
AveRooms      0.052975
HouseAge      0.051884
Population    0.030516
AveBedrms     0.028660
dtype: float64


In [8]:
"""
Question 9:
Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
"""

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [None, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_split': [2, 5, 10, 15, 20]
}

# Initialize the Decision Tree Classifier
dtc = DecisionTreeClassifier(random_state=42)

# 2. Tune the Decision Tree's max_depth and min_samples_split using GridSearchCV
grid_search = GridSearchCV(estimator=dtc, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# 3. Print the best parameters and the resulting model accuracy
print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

# Get the best estimator (model)
best_dtc = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_dtc.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best model on the test set: {accuracy:.2f}")

Best parameters found by GridSearchCV:
{'max_depth': None, 'min_samples_split': 10}
Accuracy of the best model on the test set: 1.00


**Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:**
  * Handle the missing values
  * Encode the categorical features
  * Train a Decision Tree model
  * Tune its hyperparameters
  * Evaluate its performance

**And describe what business value this model could provide in the real-world
setting.**

1. **Handle Missing Values**: First, understand the nature of missing data. Are they random, or is there a pattern? How much data is missing?

Choose a Strategy:
* Imputation
* Removal

2. **Encode Categorical Features**
* Label Encoding: For nominal (no inherent order) categories, assign a unique integer to each category.
* One-Hot Encoding: For nominal categories, create new binary (0 or 1) columns for each unique category.
* Ordinal Encoding: For ordinal (ordered) categories (e.g., severity levels), assign integers that reflect their inherent order.

3. **Train a Decision Tree Model**

* Data Splitting: Divide the dataset into training, validation, and testing sets.
* Model Initialization: Instantiate a Decision Tree classifier from a suitable machine learning library.
* Training: Fit the Decision Tree model to the training data, allowing it to learn the patterns and relationships that predict the presence of the disease. The model will recursively split the data based on the most informative features to create decision rules.

4. **Tune Its Hyperparameters**
* Identify Hyperparameters: Decision Trees have hyperparameters like max_depth (maximum tree depth), min_samples_split (minimum samples to split a node), and min_samples_leaf (minimum samples in a leaf node).
* Cross-Validation: Use techniques like k-fold cross-validation on the training data to evaluate different hyperparameter settings without biasing the test set.
* Hyperparameter Tuning: Employ methods like Grid Search or Random Search to systematically find the best combination of hyperparameters that optimizes model performance and generalization.

5. **Evaluate Its Performance**
* Metrics: Use various metrics to assess the model's effectiveness on the unseen test set:
* Accuracy: The overall percentage of correct predictions.
* Precision: Of the predicted positive cases, how many were actually positive.
* Recall (Sensitivity): Of the actual positive cases, how many were correctly identified.
* F1-Score: A harmonic mean of precision and recall, useful for imbalanced datasets.
* Confusion Matrix: A table showing true positives, true negatives, false positives, and false negatives to give a detailed view of model performance.

**This model help us to detect the disease early and can provide early diagnosis and can provide the personalized medication and treatment.**
