#Supervised Classification: Decision Trees, SVM, and Naive Bayes| Assignment

#Q1.  What is Information Gain, and how is it used in Decision Trees?

-->Information Gain (IG) measures how much uncertainty (or impurity) is reduced after splitting a dataset on a particular feature.

In simple words:

üëâ It tells us which feature gives the most ‚Äúinformation‚Äù about the target variable.

The feature with the highest Information Gain is chosen to split the node in a decision tree.

How Information Gain is Used in Decision Trees

Step-by-step:

1.Calculate entropy of the parent node

2.Split the data using a feature

3.Calculate entropy of each child node

4.Compute weighted average entropy

5.Subtract from parent entropy

6.Choose the feature with highest IG

That feature becomes the decision node.

Simple Example

Imagine a dataset for Play Tennis (Yes / No).

Entropy before split = 0.94

Split by Outlook

Entropy after split = 0.69


#Question 2: What is the difference between Gini Impurity and Entropy?

-->

## Difference Between Gini Impurity and Entropy

Both **Gini Impurity** and **Entropy** are metrics used in **decision trees** to measure how impure (mixed) a dataset is and to choose the best feature for splitting.

---

## 1. Gini Impurity

**Definition:**
Gini Impurity measures the **probability of misclassifying** a randomly chosen data point if it were labeled according to the class distribution in the node.

**Formula:**
[
Gini = 1 - \sum p_i^2
]

### Strengths

* Computationally **faster** (no logarithms)
* Works well with **large datasets**
* Tends to isolate the **most frequent class quickly**

### Weaknesses

* Slightly less sensitive to changes in class probabilities
* Less interpretable from an information-theory perspective

### Use Cases

* Used in **CART algorithm**
* Default choice in **scikit-learn**
* Preferred when performance and speed matter

---

## 2. Entropy

**Definition:**
Entropy measures the **amount of uncertainty or randomness** in the dataset using concepts from information theory.

**Formula:**
[
Entropy = -\sum p_i \log_2(p_i)
]

### Strengths

* More **theoretically grounded**
* More sensitive to small changes in class distribution
* Produces slightly more **balanced trees**

### Weaknesses

* Computationally **slower** (log calculations)
* Can be biased toward features with many unique values

### Use Cases

* Used in **ID3 and C4.5 algorithms**
* Preferred in **academic, theoretical, and exam settings**
* Useful when interpretability matters

---

## Side-by-Side Comparison

| Aspect             | Gini Impurity                         | Entropy                   |
| ------------------ | ------------------------------------- | ------------------------- |
| Measures           | Misclassification probability         | Uncertainty / information |
| Formula complexity | Simple                                | More complex              |
| Speed              | Faster                                | Slower                    |
| Sensitivity        | Lower                                 | Higher                    |
| Tree structure     | Slightly biased toward dominant class | More balanced             |
| Algorithms         | CART                                  | ID3, C4.5                 |

---

## Key Takeaway

> **Gini Impurity is faster and preferred in practice, while Entropy is more informative and preferred for theoretical understanding. Both usually lead to similar decision tree splits.**

---

### Perfect One-Line Answer (Exam Ready)

**Gini Impurity measures misclassification probability and is faster, while Entropy measures uncertainty using information theory and is more sensitive; both are used to select optimal splits in decision trees.**

If you want, I can also add:

* A **numerical example**
* **Python code** comparison
* A **diagram-based explanation**


#

#Question 3:What is Pre-Pruning in Decision Trees?

-->
## What is Pre-Pruning in Decision Trees?

**Pre-Pruning** (also called **early stopping**) is a technique used to **stop the growth of a decision tree early**‚Äî*before* it becomes too complex.

üëâ The goal is to **prevent overfitting** by limiting how much the tree can grow.

---

## Why Do We Need Pre-Pruning?

A fully grown decision tree:

* Fits training data **too well**
* Learns **noise**
* Performs poorly on **new/unseen data**

Pre-pruning avoids this by stopping splits that are not useful enough.

---

## How Pre-Pruning Works

During tree construction, a split is **not allowed** if it violates predefined conditions.

### Common Pre-Pruning Criteria

1. **Maximum depth**

   * Stop splitting once a certain tree depth is reached
     *(e.g., `max_depth = 5`)*

2. **Minimum samples per node**

   * Do not split if a node has fewer than *k* samples
     *(e.g., `min_samples_split = 20`)*

3. **Minimum samples in a leaf**

   * Ensures leaves have enough data
     *(e.g., `min_samples_leaf = 10`)*

4. **Minimum impurity decrease**

   * Split only if impurity reduction exceeds a threshold

5. **Statistical significance tests**

   * Split only if improvement is statistically meaningful

---

## Advantages of Pre-Pruning

‚úî Reduces **overfitting**
‚úî Faster training
‚úî Produces **simpler, more interpretable trees**
‚úî Uses less memory

---

## Disadvantages of Pre-Pruning

‚úò Risk of **underfitting**
‚úò May stop splits that are actually important
‚úò Requires careful tuning of hyperparameters

---

## Pre-Pruning vs Post-Pruning (Quick Contrast)

| Aspect       | Pre-Pruning        | Post-Pruning             |
| ------------ | ------------------ | ------------------------ |
| When applied | During tree growth | After full tree is built |
| Approach     | Early stopping     | Cut back branches        |
| Risk         | Underfitting       | Higher computation       |
| Speed        | Faster             | Slower but more accurate |

---




In [1]:
# Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load dataset
data = load_iris()
X = data.data          # features
y = data.target        # target
feature_names = data.feature_names

# Train Decision Tree Classifier using Gini Impurity
dt = DecisionTreeClassifier(criterion='gini', random_state=42)
dt.fit(X, y)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(feature_names, dt.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Feature Importances:
sepal length (cm): 0.0133
sepal width (cm): 0.0000
petal length (cm): 0.5641
petal width (cm): 0.4226


#Question 5: What is a Support Vector Machine (SVM)?

-->

## What is a Support Vector Machine (SVM)?

A **Support Vector Machine (SVM)** is a **supervised machine learning algorithm** used for **classification and regression** that works by finding the **optimal decision boundary (hyperplane)** that best separates data points of different classes.

---

## Core Idea (Intuition)

üëâ SVM doesn‚Äôt just separate classes ‚Äî it separates them **as far apart as possible**.

* The decision boundary is called a **hyperplane**
* The closest data points to the hyperplane are called **support vectors**
* The distance between the hyperplane and the support vectors is called the **margin**
* SVM tries to **maximize this margin**

A larger margin = better generalization.

---

## Key Components of SVM

### 1. Hyperplane

A line (2D), plane (3D), or higher-dimensional boundary that separates classes.

### 2. Support Vectors

Data points closest to the hyperplane that **define its position**.

### 3. Margin

The maximum distance between the hyperplane and the nearest data points from each class.

---

## Handling Non-Linear Data (Kernel Trick)

When data is not linearly separable, SVM uses **kernels** to map data into higher dimensions.

Common kernels:

* **Linear**
* **Polynomial**
* **RBF (Gaussian)** ‚Äî most popular
* **Sigmoid**

This is called the **kernel trick**.

---

## Advantages of SVM

‚úî Works well with **high-dimensional data**
‚úî Effective when number of features > samples
‚úî Robust to overfitting (with proper kernel & parameters)

---

## Disadvantages of SVM

‚úò Computationally expensive for large datasets
‚úò Sensitive to **kernel choice** and hyperparameters
‚úò Less interpretable compared to decision trees


#Question 6:  What is the Kernel Trick in SVM?

-->
## What is the Kernel Trick in SVM?

The **Kernel Trick** is a technique used in **Support Vector Machines (SVMs)** to handle **non-linearly separable data** by implicitly mapping input data into a **higher-dimensional feature space**, where a **linear separation becomes possible**, **without explicitly computing that transformation**.

---

## Why Do We Need the Kernel Trick?

Some datasets cannot be separated by a straight line (or plane).

üëâ Example:

* Circles, spirals, or XOR-type patterns

Instead of manually transforming features, SVM uses a kernel function to **compute inner products in higher dimensions efficiently**.

---

## How the Kernel Trick Works (Intuition)

* Original space ‚Üí data is **non-linear**
* Higher-dimensional space ‚Üí data becomes **linearly separable**
* Kernel function calculates similarity **as if** data were mapped to that higher space

‚ú® No actual transformation is computed ‚Äî that‚Äôs the ‚Äútrick‚Äù.

---

## Common Kernel Functions

1. **Linear Kernel**
   [
   K(x, x') = x \cdot x'
   ]
   Used when data is already linearly separable.

2. **Polynomial Kernel**
   [
   K(x, x') = (x \cdot x' + c)^d
   ]
   Captures polynomial relationships.

3. **RBF (Gaussian) Kernel**
   [
   K(x, x') = \exp(-\gamma ||x - x'||^2)
   ]
   Most widely used; handles complex boundaries.

4. **Sigmoid Kernel**
   [
   K(x, x') = \tanh(\alpha x \cdot x' + c)
   ]
   Inspired by neural networks.

---

## Advantages of the Kernel Trick

‚úî Enables SVM to solve **non-linear problems**
‚úî Avoids expensive computations in high dimensions
‚úî Highly flexible with different kernel choices

---

## Limitations

‚úò Kernel and parameter selection is crucial
‚úò Can be slow for very large datasets
‚úò Risk of overfitting with complex kernels


In [4]:
#Question 7:  Write a Python program to train two SVM classifiers with Linear and RBF
#kernels on the Wine dataset, then compare their accuracies.
#Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
#on the same dataset.

# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
linear_accuracy = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
rbf_accuracy = accuracy_score(y_test, y_pred_rbf)

# Print accuracy results
print("Accuracy with Linear Kernel SVM:", linear_accuracy)
print("Accuracy with RBF Kernel SVM:", rbf_accuracy)


Accuracy with Linear Kernel SVM: 0.9814814814814815
Accuracy with RBF Kernel SVM: 0.7592592592592593


#Question 8: What is the Na√Øve Bayes classifier, and why is it called "Na√Øve"?

-->

## What is the Na√Øve Bayes Classifier?

The **Na√Øve Bayes classifier** is a **supervised probabilistic machine learning algorithm** based on **Bayes‚Äô Theorem**, used mainly for **classification tasks** such as text classification, spam detection, and sentiment analysis.

It predicts the class that has the **highest posterior probability** given the input features.

### Bayes‚Äô Theorem:

[
P(C|X) = \frac{P(X|C),P(C)}{P(X)}
]

---

## Why Is It Called ‚ÄúNa√Øve‚Äù?

It is called **‚Äúna√Øve‚Äù** because it makes a **strong simplifying assumption**:

> üëâ **All features are conditionally independent given the class label.**

This assumption is usually **not true in real-world data**, but it simplifies calculations a lot ‚Äî and surprisingly, the model still works very well in practice.

---

## Intuition (Simple Example)

For spam detection:

* Words like *‚Äúfree‚Äù* and *‚Äúwin‚Äù* often appear together
* Na√Øve Bayes **assumes they are independent**
* Even with this unrealistic assumption, it still classifies spam accurately

That‚Äôs the ‚Äúna√Øve‚Äù part.

---

## Types of Na√Øve Bayes

1. **Gaussian Na√Øve Bayes** ‚Äì continuous features
2. **Multinomial Na√Øve Bayes** ‚Äì text and word counts
3. **Bernoulli Na√Øve Bayes** ‚Äì binary features

---

## Advantages

‚úî Very fast and memory efficient
‚úî Works well with **high-dimensional data**
‚úî Performs especially well in **text classification**

---

## Limitations

‚úò Independence assumption is often violated
‚úò Cannot model feature interactions well
‚úò Probability estimates can be inaccurate




#Question 9: Explain the differences between Gaussian Na√Øve Bayes, Multinomial Na√Øve Bayes, and Bernoulli Na√Øve Bayes

-->
## Differences Between Gaussian, Multinomial, and Bernoulli Na√Øve Bayes

All three are variants of the **Na√Øve Bayes classifier**, differing mainly in the **type of data they assume** and how they model feature distributions.

---

## 1. Gaussian Na√Øve Bayes

### Assumption

* Features are **continuous**
* Values follow a **normal (Gaussian) distribution**

### Probability Model

[
P(x|C) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
]

### Use Cases

* Medical data
* Sensor measurements
* Any dataset with real-valued features

### Example

* Height, weight, temperature

---

## 2. Multinomial Na√Øve Bayes

### Assumption

* Features represent **counts or frequencies**
* Data follows a **multinomial distribution**

### Key Characteristics

* Works with **non-negative integer values**
* Considers **frequency of features**

### Use Cases

* Text classification
* Spam detection
* Document categorization

### Example

* Word counts in documents (Bag-of-Words, TF)

---

## 3. Bernoulli Na√Øve Bayes

### Assumption

* Features are **binary (0 or 1)**
* Presence or absence of a feature matters

### Key Characteristics

* Penalizes **absence of a feature**
* Suitable for binary vectors

### Use Cases

* Binary text features
* Yes/No or True/False attributes

### Example

* Whether a word appears in a document or not

---

## Side-by-Side Comparison

| Aspect         | Gaussian NB     | Multinomial NB        | Bernoulli NB  |
| -------------- | --------------- | --------------------- | ------------- |
| Feature type   | Continuous      | Discrete counts       | Binary        |
| Distribution   | Gaussian        | Multinomial           | Bernoulli     |
| Values allowed | Any real number | Non-negative integers | 0 or 1        |
| Best for       | Numeric data    | Word counts           | Word presence |
| Text data      | ‚ùå               | ‚úÖ                     | ‚úÖ             |

---

## Key Takeaway (Exam Gold)

> **Gaussian NB is used for continuous data, Multinomial NB for count-based data, and Bernoulli NB for binary features, all under the Na√Øve independence assumption.**




In [5]:
#Question 10:  Breast Cancer Dataset
#Write a Python program to train a Gaussian Na√Øve Bayes classifier on the Breast Cancer
#dataset and evaluate accuracy.
#Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
#sklearn.datasets.
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Gaussian Na√Øve Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Gaussian Na√Øve Bayes:", accuracy)


Accuracy of Gaussian Na√Øve Bayes: 0.9415204678362573
