# Supervised Classification: Decision Trees, SVM, and Naive Bayes
**Assignment** 



## Question 1: What is Information Gain, and how is it used in Decision Trees?

**Answer:**

**Information Gain** is a metric used to measure the reduction in entropy (or uncertainty) achieved by splitting a dataset based on a specific attribute. It calculates the difference between the entropy of the parent node and the weighted average entropy of the child nodes.

**How it is used in Decision Trees:**
1.  **Selection Criteria:** Decision Trees use Information Gain to decide which feature to split on at each step.
2.  **Highest Gain:** The algorithm calculates the Information Gain for every possible feature. The feature with the highest Information Gain is chosen as the root (or split) node because it provides the best separation of the data into pure classes.
3.  **Process:** This process is repeated recursively for each child node until the tree is fully grown or a stopping criterion is met.

## Question 2: What is the difference between Gini Impurity and Entropy?

**Answer:**

Both Gini Impurity and Entropy are metrics used to measure the "impurity" or disorder of a node in a Decision Tree, but they differ in their mathematical formulation and computational properties.

| Feature | Gini Impurity | Entropy |
| :--- | :--- | :--- |
| **Formula** | $1 - \sum (p_i)^2$ | $- \sum p_i \log_2(p_i)$ |
| **Range** | 0 to 0.5 (for binary classification) | 0 to 1 (for binary classification) |
| **Computation** | Computationally faster (uses simple squaring). | Computationally more expensive (uses logarithmic calculations). |
| **Sensitivity** | Biased towards finding the majority class. | Slightly more balanced, penalizes impurity more heavily. |
| **Use Case** | Default in libraries like Scikit-Learn (CART algorithm). | Used in algorithms like ID3 and C4.5. |

## Question 3: What is Pre-Pruning in Decision Trees?

**Answer:**

**Pre-Pruning** (also known as Early Stopping) is the process of halting the growth of a Decision Tree before it perfectly classifies the training set. 

Instead of allowing the tree to grow until every leaf is pure (which often leads to overfitting), constraints are applied during the training process.

**Common Pre-Pruning Hyperparameters:**
* **Max Depth:** Limiting the maximum height of the tree.
* **Min Samples Split:** Setting a minimum number of samples required to split an internal node.
* **Min Samples Leaf:** Setting a minimum number of samples required to be at a leaf node.
* **Max Leaf Nodes:** Limiting the total number of leaf nodes in the tree.

## Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances.

In [7]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd


# 1. Load the dataset (Using Iris dataset as a standard example)
data = load_iris()
X = data.data
y = data.target

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize the Decision Tree Classifier with Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 4. Train the model
clf.fit(X_train, y_train)

# 5. Print Feature Importances
print("Feature Importances:")
for name, importance in zip(data.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


## Question 5: What is a Support Vector Machine (SVM)?

**Answer:**

A **Support Vector Machine (SVM)** is a powerful supervised learning algorithm used for both classification and regression tasks. 

**Key Concepts:**
* **Hyperplane:** SVM finds the optimal boundary (hyperplane) that best separates the data points of different classes.
* **Margin:** It aims to maximize the "margin," which is the distance between the hyperplane and the nearest data points from either class.
* **Support Vectors:** The data points closest to the hyperplane that influence its position and orientation are called support vectors.
* **Goal:** By maximizing the margin, SVM tries to improve the model's generalization ability to new, unseen data.

## Question 6: What is the Kernel Trick in SVM?

**Answer:**

The **Kernel Trick** is a technique used in SVM to handle non-linearly separable data. 

**How it works:**
1.  **Transformation:** It implicitly maps the input data from a lower-dimensional space (where it is not linearly separable) into a higher-dimensional space.
2.  **Linear Separation:** In this higher-dimensional space, the data becomes linearly separable by a hyperplane.
3.  **Efficiency:** It computes the dot product of the data points in the high-dimensional space without actually performing the complex transformation explicitly. This makes the computation highly efficient.

**Common Kernels:** Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid.

## Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

In [6]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train SVM with Linear Kernel
svc_linear = SVC(kernel='linear', random_state=42)
svc_linear.fit(X_train, y_train)
y_pred_linear = svc_linear.predict(X_test)
acc_linear = accuracy_score(y_test, y_pred_linear)

# 4. Train SVM with RBF Kernel
svc_rbf = SVC(kernel='rbf', random_state=42)
svc_rbf.fit(X_train, y_train)
y_pred_rbf = svc_rbf.predict(X_test)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# 5. Compare Accuracies
print(f"Accuracy with Linear Kernel: {acc_linear:.4f}")
print(f"Accuracy with RBF Kernel:    {acc_rbf:.4f}")

if acc_linear > acc_rbf:
    print("\nConclusion: The Linear kernel performed better on this split.")
elif acc_rbf > acc_linear:
    print("\nConclusion: The RBF kernel performed better on this split.")
else:
    print("\nConclusion: Both kernels performed equally well.")

Accuracy with Linear Kernel: 0.9815
Accuracy with RBF Kernel:    0.7593

Conclusion: The Linear kernel performed better on this split.


## Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

**Answer:**

**Naïve Bayes Classifier** is a probabilistic machine learning model based on **Bayes' Theorem**. It is used for classification tasks (like spam filtering and sentiment analysis) and predicts the probability that a given data point belongs to a particular class.

**Why it is called "Naïve":**
* It makes a "naïve" (simplistic) assumption that all features in the dataset are **mutually independent**.
* For example, if a fruit is described as "Red", "Round", and "Diameter of 3 inches", Naïve Bayes assumes these features contribute independently to the probability of it being an apple, even though in reality, these features might depend on each other.
* Despite this strong and often unrealistic assumption, the classifier performs remarkably well in many real-world situations.

## Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

**Answer:**

These are variations of the Naïve Bayes algorithm tailored for different types of data distributions:

1.  **Gaussian Naïve Bayes:**
    * **Data Type:** Used when features are **continuous** (real-valued numbers).
    * **Assumption:** It assumes that the continuous values associated with each class are distributed according to a Gaussian (Normal) distribution (Bell curve).
    * **Example:** Predicting Iris flower species based on sepal length (cm).

2.  **Multinomial Naïve Bayes:**
    * **Data Type:** Used for **discrete counts**.
    * **Assumption:** It assumes the data follows a Multinomial distribution.
    * **Example:** Text classification (Spam vs. Ham), where features are the frequency counts of words in a document.

3.  **Bernoulli Naïve Bayes:**
    * **Data Type:** Used for **binary/boolean** features.
    * **Assumption:** It assumes the data follows a Multivariate Bernoulli distribution.
    * **Example:** Text classification where we only care if a word is present (1) or absent (0), rather than counting how many times it appears.

## Question 10: Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

In [4]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize Gaussian Naïve Bayes Classifier
gnb = GaussianNB()

# 4. Train the model
gnb.fit(X_train, y_train)

# 5. Make predictions
y_pred = gnb.predict(X_test)

# 6. Evaluate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Gaussian Naïve Bayes: {accuracy:.4f}")

Accuracy of Gaussian Naïve Bayes: 0.9737
