**Question 1)  What is Information Gain, and how is it used in Decision Trees?**

**Answer 1)**

Information Gain is the core metric used by decision tree algorithms (like ID3) to figure out the best feature to split the data on at each step of building the tree.

Think of it as a measure of how much clarity or reduction in uncertainty you get by splitting your data based on a particular feature. The goal is to always choose the split that provides the highest information gain.

To understand Information Gain, you first need to understand Entropy.

1. What is Entropy?

    In simple terms, Entropy is a measure of impurity, disorder, or uncertainty in a dataset.

    High Entropy (Max = 1): The dataset is perfectly impure. For a classification task, this means the classes are mixed randomly (e.g., a set with 50% "Yes" and 50% "No" has the highest possible entropy). It's very hard to make a decision.

    Low Entropy (Min = 0): The dataset is perfectly pure. This means all samples in the set belong to the same class (e.g., 100% "Yes"). There is no uncertainty.


    Example: Imagine you have two baskets of fruit:

    i. Basket A: Contains 10 apples. It is pure. Its entropy is 0.


    ii. Basket B: Contains 5 apples and 5 oranges. It is impure. Its entropy is 1

2. **How is Information Gain Used in Decision Trees?**

    The decision tree algorithm uses Information Gain to build the tree from the top down. At each "node" (a point where a decision is made), it does the following:

    i) Calculate Parent Entropy: It first calculates the entropy of the current dataset (the "parent node") before any split.

    ii) Calculate Child Entropies: For every possible feature (e.g., "Outlook," "Temperature," "Humidity"), it calculates what the entropy would be after splitting the data on that feature.
    This involves:a. Splitting the data into subsets (the "child nodes").11 For example, splitting on "Outlook" creates three subsets: "Sunny," "Overcast," and "Rainy."12b. Calculating the entropy for each of these child nodes.13c. Calculating the weighted average entropy of all the child nodes.(Subsets with more data points get a higher weight).

    iii) Calculate Information Gain: It then finds the Information Gain for that feature using this formula:$$\text{Information Gain} = \text{Entropy}(\text{Parent}) - \text{Weighted Average Entropy}(\text{Children})$$
  
    iv) Select the Best Feature: The algorithm repeats this for all features.15 The feature that results in the highest Information Gain is chosen as the splitting feature for that node.

This process is repeated at each new node until the nodes are "pure" (entropy is 0) or another stopping condition is met.

**Question 2) What is the difference between Gini Impurity and Entropy?**

**Answer 2)**

Both **Gini Impurity** and **Entropy** are measures of how mixed or impure a dataset is at a node in a decision tree — they help the algorithm decide where to split the data.

Here’s the difference explained in simple terms:

**Entropy** comes from information theory. It measures the amount of “uncertainty” or “disorder” in a dataset. If all the samples in a node belong to the same class, the entropy is zero — there’s no uncertainty. But if the samples are evenly split between classes, the entropy is high, meaning the node is very impure. Entropy uses logarithms to measure this uncertainty, and it tells us how much “information” is gained when we make a split (this is called *Information Gain*).

**Gini Impurity**, on the other hand, measures how often a randomly chosen sample from the node would be incorrectly labeled if it were randomly assigned a label according to the class distribution. Like entropy, Gini is also zero when the node is pure, but it increases as the classes become more mixed.

In practice, both give similar results, but Gini is a bit simpler and faster to compute because it doesn’t involve logarithms. Entropy is more theoretical, connecting to information theory, while Gini is more practical and often used as the default in algorithms like CART.

In short:

1. **Entropy** measures uncertainty using information theory.
2. **Gini Impurity** measures the probability of misclassification.
3. Both quantify impurity, but Gini is faster and usually preferred for efficiency.



**Question 3) What is Pre-Pruning in Decision Trees?**

**Answer 3)**

Pre-pruning is a technique used in decision trees to stop the tree from growing during the training process, before it becomes fully developed and overly complex.

Think of it as setting "early stopping rules" for the tree's growth.

The primary goal of pre-pruning is to prevent overfitting. A decision tree that grows without limits will try to perfectly classify every single sample in the training data, capturing not just the underlying patterns but also the noise. This "perfect" tree will then perform poorly on new, unseen data.


Pre-pruning prevents this by halting the creation of new branches (or "nodes") if they don't meet certain criteria.

How Pre-Pruning Works

Pre-pruning works by setting hyperparameters that act as thresholds. While the tree is being built, it checks these rules at every potential split. If a rule is met, the tree stops splitting at that node, and it becomes a "leaf node" (a final decision).


Common pre-pruning techniques (and their hyperparameters) include:

1. Maximum Depth (max_depth): This is the most common technique. You set a limit on how many levels the tree can have. For example, if you set max_depth=3, the tree will stop growing after its third level of decisions, regardless of whether the nodes are "pure" or not.


2. Minimum Samples per Split (min_samples_split): This rule specifies the minimum number of data points a node must have before it's even allowed to be split. If a node has fewer samples than this threshold, it will not be split and will become a leaf.


3. Minimum Samples per Leaf (min_samples_leaf): This rule dictates that a split is only allowed if it results in both new child nodes having at least this many samples. This prevents the tree from creating tiny, highly specific leaves that are likely just fitting to noise.

4. Minimum Impurity Decrease / Information Gain Threshold: You can set a threshold for how much "purity" (like Gini Impurity or Information Gain) a new split must achieve. If a potential split doesn't reduce the impurity by at least this amount, the tree won't bother making that split.



In [1]:
"""
Question 4) Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).

Answer 4)
"""
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load dataset (Iris dataset for example)
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Print feature importances
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")

# Optionally, check model accuracy
accuracy = clf.score(X_test, y_test)
print(f"\nModel Accuracy on Test Data: {accuracy:.2f}")


Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876

Model Accuracy on Test Data: 1.00


**Question 5) What is a Support Vector Machine (SVM)?**

**Question 5)**

A Support Vector Machine (SVM) is a supervised machine learning algorithm used primarily for classification tasks, though it can also handle regression (in the form of Support Vector Regression). The main idea of SVM is to find the best boundary (hyperplane) that separates data points of different classes.

Here’s a detailed explanation:

1. Basic Concept

    Suppose you have data points belonging to two classes (e.g., cats vs. dogs).

    SVM tries to find a hyperplane that separates these two classes.

    i) In 2D, this hyperplane is a line.

    ii) In 3D, it’s a plane.

    In higher dimensions, it’s a “hyperplane.”

    The goal is to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class. These nearest points are called support vectors.

2. Support Vectors

    i) Support vectors are the data points that are closest to the hyperplane.

    ii) They are crucial because the position of the hyperplane depends entirely on them.

    iii) Other points that are farther away do not affect the hyperplane.

3. Linear vs. Non-linear SVM

    i) Linear SVM:
    Works when data is linearly separable (can be separated by a straight line or hyperplane).

    ii) Non-linear SVM:
    For data that is not linearly separable, SVM uses a kernel trick to map the data into a higher-dimensional space where it becomes linearly separable.

4. Kernels

    Kernels help SVM handle non-linear data. Common kernels include:

    i) Linear kernel – for linearly separable data.

    ii) Polynomial kernel – maps data into polynomial features.

    iii) RBF (Radial Basis Function) kernel / Gaussian kernel – popular for complex boundaries.

    iv) Sigmoid kernel – similar to neural network activation.

5. Advantages of SVM

    i) Works well in high-dimensional spaces.

    ii) Effective even when number of features > number of samples.

    iii) Uses support vectors, making it memory efficient.

6. Limitations

    i) Choosing the right kernel and hyperparameters can be tricky.

    ii) Not ideal for very large datasets (training can be slow).

    iii) Sensitive to noisy data and overlapping classes.

**Question 6) What is the Kernel Trick in SVM?**

**Answer 6)**

The Kernel Trick is a powerful mathematical technique that allows Support Vector Machines (SVMs) to solve complex, non-linear classification problems.

At its core, the trick allows the SVM to operate in a very high-dimensional (even infinite-dimensional) feature space without ever having to compute the coordinates of the data in that space. This avoids an enormous computational cost.

Here’s a step-by-step breakdown:

1. The Problem: Non-Linear Data

A standard SVM works by finding the best linear separator (a hyperplane, which is just a line in 2D or a flat plane in 3D) that divides two classes.

This works perfectly if the data is linearly separable.

But what about data like this?

You cannot draw a single straight line to separate the blue dots from the red dots. This data is non-linearly separable.

2. The Solution: Map to a Higher Dimension

The main idea to solve this is to transform the data into a higher dimension where it does become linearly separable.

Imagine the "donut" data from above. It's in 2D (features $x_1, x_2$). Let's create a new, 3D space by applying a transformation function $\phi$ (phi).

Let's define a new feature, $z$, such that $z = x_1^2 + x_2^2$. Our new space is now 3D, with coordinates $(x_1, x_2, z)$.

If we plot the data in this new 3D space, the red "inner" dots (which have small $x_1$ and $x_2$ values) will have a low $z$ value. The blue "outer" dots (which have large $x_1$ and $x_2$ values) will have a high $z$ value.

In this new 3D space, the data is now perfectly separable by a simple 2D plane! The SVM can easily find this plane. When we project this plane back down to our original 2D space, it becomes the circular boundary we needed.

3. The New Problem: The Cost of Transformation

This transformation approach is brilliant, but it has a massive computational problem, known as the Curse of Dimensionality.

The SVM algorithm relies heavily on one specific calculation: the dot product of data points (e.g., $\vec{x_i} \cdot \vec{x_j}$).

If we transform all our data points $\vec{x}$ into the high-dimensional space $\phi(\vec{x})$, we would then have to compute the dot product in that new, very high-dimensional space: $\phi(\vec{x_i}) \cdot \phi(\vec{x_j})$.

This is computationally infeasible for two reasons:

   i) Too Slow: Calculating $\phi(\vec{x})$ for every point can be extremely complex.

  ii) Too Big: The new dimension can be so large (even infinitely large) that we can't store the $\phi(\vec{x})$ vectors.

4. The "Trick": The Kernel Function

This is where the magic happens. The Kernel Trick is based on the discovery that we don't need to know the coordinates $\phi(\vec{x})$ at all. We only need the result of their dot product: $\phi(\vec{x_i}) \cdot \phi(\vec{x_j})$.

A kernel function, written as $K(\vec{x_i}, \vec{x_j})$, is a special, computationally cheap function that takes the original low-dimensional vectors $\vec{x_i}$ and $\vec{x_j}$ as input and directly computes the dot product of their transformed, high-dimensional versions.

In short, a kernel function lets us do this:

$K(\vec{x_i}, \vec{x_j}) = \phi(\vec{x_i}) \cdot \phi(\vec{x_j})$

  a) The Left Side (What We Do): $K(\vec{x_i}, \vec{x_j})$. A simple, fast calculation using the original, low-dimensional data.

  b) The Right Side (What We Implicitly Get): $\phi(\vec{x_i}) \cdot \phi(\vec{x_j})$. The result of the dot product in the complex, high-dimensional space.

The SVM algorithm can be rewritten to only use the kernel function $K(\vec{x_i}, \vec{x_j})$ instead of the standard dot product $\vec{x_i} \cdot \vec{x_j}$.

Summary: Why it's a "Trick"

The Kernel Trick allows us to get all the separating power of a high-dimensional space without ever paying the computational price of transforming the data into that space.

We just plug in a non-linear kernel function, and the SVM algorithm "magically" finds a complex, non-linear boundary in the original feature space.


In [3]:
"""
Question 7 Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

Answer 7)
"""
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create two SVM classifiers with different kernels
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train both classifiers
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate accuracies
acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print results
print("SVM Classifier Comparison on Wine Dataset")
print("---------------------------------------")
print(f"Linear Kernel Accuracy: {acc_linear:.4f}")
print(f"RBF Kernel Accuracy:    {acc_rbf:.4f}")

# Check which performed better
if acc_linear > acc_rbf:
    print("\n✅ Linear kernel performed better.")
elif acc_rbf > acc_linear:
    print("\n✅ RBF kernel performed better.")
else:
    print("\n⚖️ Both kernels performed equally well.")


SVM Classifier Comparison on Wine Dataset
---------------------------------------
Linear Kernel Accuracy: 0.9815
RBF Kernel Accuracy:    0.7593

✅ Linear kernel performed better.


**Question 8) What is the Naïve Bayes classifier, and why is it called "Naïve"?**

**Answer 8)**

The Naïve Bayes classifier is a simple yet powerful and fast probabilistic machine learning algorithm. It is used for classification tasks, such as:

i) Spam filtering: Classifying an email as "Spam" or "Not Spam".

ii) Text classification: Determining the topic of an article (e.g., "Sports," "Technology," or "Politics").

iii) Medical diagnosis: Predicting whether a patient has a certain disease based on their symptoms.

The algorithm is based on Bayes' Theorem, which is a fundamental concept in probability.

The "Bayes" Part: How It Works

Bayes' Theorem provides a way to update our beliefs about something given new evidence. In the context of classification, the formula looks like this:

$$P(\text{Class} | \text{Features}) = \frac{P(\text{Features} | \text{Class}) \cdot P(\text{Class})}{P(\text{Features})}$$

Let's break this down with a spam filter example:

1. $P(\text{Class} | \text{Features})$ (Posterior Probability): This is what we want to find. "What is the probability the email is 'Spam', given that it contains the words 'Viagra' and 'free'?"

2. $P(\text{Features} | \text{Class})$ (Likelihood): "How likely are the words 'Viagra' and 'free' to appear given that an email is 'Spam'?" The algorithm learns this from the training data.

3. $P(\text{Class})$ (Prior Probability): "What is the overall probability of any email being 'Spam'?" (e.g., 20% of all emails are spam).

4. $P(\text{Features})$ (Evidence): "What is the overall probability of seeing the words 'Viagra' and 'free' in any email?" (We can often ignore this part, as it's the same for all classes).

To make a prediction, the classifier calculates the posterior probability for every class (e.g., 'Spam' and 'Not Spam'). The class that results in the highest probability is the winner.

**Why is it Called "Naïve"?**

This is the most important part of its name. The algorithm is called "Naïve" because it makes a strong, simplifying assumption about the data that is almost always false in the real world.

The "Naïve" Assumption: All features are independent of each other, given the class.

In simple terms, the classifier naïvely believes that the presence (or absence) of one feature has absolutely no effect on the presence (or absence) of any other feature.

Example of the "Naïve" Assumption

Let's stick with our spam filter. The features are the words in the email.

  a) Features: "Viagra", "free", "congratulations", "lottery"

  b) Class: "Spam"

Reality: In the real world, these words are highly correlated. An email containing "Viagra" is much more likely to also contain "free" and "lottery". The presence of one word gives us a strong hint about the others.

The Naïve Bayes Classifier's View: The classifier assumes these words are completely unrelated. It calculates the probability of "Viagra" appearing in spam, the probability of "free" in spam, and the probability of "lottery" in spam, and then multiplies them together as if they were independent coin flips.

It thinks that knowing "Viagra" is in the email gives it no information about whether "free" is also there. This is clearly an incorrect, or "naïve," way to view the world.

**Question 9) Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.**

**Answer 9)**

The main difference between Gaussian, Multinomial, and Bernoulli Naïve Bayes lies in the type of data they are designed to handle and the statistical distribution they assume for the features.

All three are "Naïve" because they share the same core assumption: that all features are independent of one another, given the class.2 Where they differ is in how they model the probability of those features.

Here’s a breakdown of each.

1. Gaussian Naïve Bayes (GNB)

    What it's for: Continuous numerical data.
    
    Core Idea: It assumes that the values for each feature are "normally distributed" (i.e., follow a Gaussian distribution, or bell curve) within each class.
    
    How it works: To calculate the probability of a feature value, the model first calculates the mean (4$\mu$) and variance (5$\sigma^2$) of that feature for each class from the training data.6 It then uses the Gaussian probability density function (PDF) to find the likelihood of a new data point.
    
    Simple Example: Predicting if a plant is Species A or Species B based on its petal_length and sepal_width in centimeters. The model would calculate the mean and variance of petal_length for all Species A plants, and separately for all Species B plants. It does the same for sepal_width.
  
2. Multinomial Naïve Bayes (MNB)

    What it's for: Discrete data, specifically "counts" or "frequencies."
    
    Core Idea: It's designed for features that represent the count of an event occurring (e.g., how many times a word appears in a document).
    
    How it works: It calculates the probability of a feature (like a specific word) based on its average frequency within each class.
    
    Simple Example: Text classification (like spam filtering).
    
    Features: The count of each word in the vocabulary (e.g., word_count('free'), word_count('viagra'), word_count('meeting')).
    
    Data: A document might be represented as: {'free': 3, 'viagra': 1, 'meeting': 0, ...}.
    
    Model: It learns that the word "free" appears, on average, 5 times in "Spam" emails but only 0.1 times in "Not Spam" emails.
    
3. Bernoulli Naïve Bayes (BNB

    What it's for: Binary/Boolean data (Yes/No, 1/0, True/False).
    
    Core Idea: It's used when features are binary variables, indicating the presence or absence of a feature.
    
    How it works: Instead of counting how many times a feature appears, it only cares if it appears at all.
    
    Simple Example: Also used for text classification, but with a different approach.
    
    Features: A binary value for each word in the vocabulary (e.g., contains('free'), contains('viagra'), contains('meeting')).
    Data: The same document would be represented as: {'free': 1, 'viagra': 1, 'meeting': 0, ...}. The 1 for 'free' just means "this word is present," ignoring that it appeared 3 times.
    
    Model: It learns the probability that the word "free" is present in a "Spam" email (e.g., 70% of spam emails contain 'free') vs. a "Not Spam" email (e.g., 5% of non-spam emails contain 'free').

In [5]:
"""
Question 10) Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

Answer 10)
"""
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Gaussian Naïve Bayes model
gnb = GaussianNB()

# Train (fit) the model
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset: {:.2f}%".format(accuracy * 100))

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset: 97.37%

Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

