#1. What is Information Gain, and how is it used in Decision Trees?

Information Gain tells us which attribute separates the data best into groups that are as pure (homogeneous) as possible.

1. The Core Concept: Entropy:-

To understand Information Gain, you first need to understand Entropy. Entropy is a measure of disorder or impurity in a dataset.

High Entropy: The dataset is very mixed (e.g., a basket with 50% apples and 50% oranges). It is hard to predict the label of a random item.

Low Entropy: The dataset is pure (e.g., a basket with 99% apples). It is easy to predict the label.

2. What is Information Gain:-
Information Gain measures the reduction in entropy achieved by splitting the dataset according to a specific attribute.

Mathematically, it is the difference between the entropy of the parent node and the weighted average entropy of the child nodes.

$$IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)$$

3. How It Is Used in Decision Trees (ID3 Algorithm):-

The decision tree algorithm (specifically ID3) uses Information Gain in a greedy approach:-

Calculate Parent Entropy: It calculates the impurity of the current dataset.

Test Every Attribute: It simulates splitting the data on every available feature (e.g., "Weather", "Temperature", "Wind").

Calculate Child Entropy: For each test split, it calculates the new weighted entropy of the resulting groups.

Find the Gain: It subtracts the new entropy from the parent entropy to see how much uncertainty was removed.

Choose the Winner: The tree selects the attribute with the highest Information Gain to be the splitting node.

Repeat: The process is repeated recursively for every branch until the nodes are pure (entropy is 0) or a stopping criterion is met.

4. Limitation: Bias Towards Many Values:-

One drawback of Information Gain is that it is biased towards attributes with a large number of distinct values.

Example: If you had an attribute "Date" (which is unique for every row), splitting on it would result in perfectly pure child nodes (1 row each). The Information Gain would be maximum, but the model would just be memorizing data (overfitting) and would be useless for prediction.

Solution: Advanced algorithms (like C4.5) use Gain Ratio, which penalizes attributes with too many unique branches.

#2. What is the difference between Gini Impurity and Entropy?

| Aspect      | Gini Impurity                       | Entropy                              |
| ----------- | ----------------------------------- | ------------------------------------ |
| Concept     | Misclassification probability       | Information uncertainty              |
| Value Range | 0 to ~0.5 (binary)                  | 0 to 1 (binary)                      |
| Computation | Faster (no logarithms)              | Slower (uses log)                    |
| Sensitivity | Less sensitive to small changes     | More sensitive to class distribution |
| Bias        | Favors larger, more balanced splits | Favors purer splits                  |
| Used In     | CART                                | ID3, C4.5                            |
| Metric Used | Gini Reduction                      | Information Gain                     |


Behavior Comparison:-

Pure node (all one class)

Gini = 0

Entropy = 0

Maximally mixed node (50–50 binary)

Gini = 0.5

Entropy = 1



#3. What is Pre-Pruning in Decision Trees?

Pre-Pruning (Early Stopping) in Decision Trees is a technique used to stop the tree from growing further during training to prevent overfitting and improve generalization on unseen data.

Why Pre-Pruning is Needed:-

Prevents overfitting to training data

Reduces model complexity

Improves training speed

Enhances performance on test data

Example:-

Suppose a node has 10 samples.
If the rule says minimum samples = 15, then:

No further split is allowed

Node becomes a leaf

Advantages:-

Faster training

Smaller, simpler trees

Lower variance

Good for large datasets

Disadvantages:-

May stop too early

Can lead to underfitting

Requires careful parameter tuning

In [1]:
#4. Write a Python program to train a Decision Tree Classifier using Gini
#   Impurity as the criterion and print the feature importances (practical).
#   Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Feature names
feature_names = data.feature_names

# Train Decision Tree Classifier using Gini Impurity
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Display feature importances
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

print(feature_importance_df)


             Feature  Importance
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611


#5. What is a Support Vector Machine (SVM)?

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used primarily for classification, though it can also be used for regression.

How SVM Works: The Core Concepts
To understand SVM, you need to visualize how it draws a line between two groups of data points.

The Hyperplane:-
The hyperplane is the decision boundary that divides the data. For a binary classification (e.g., "Spam" vs. "Not Spam"), the SVM tries to place this line so that it separates the two categories as clearly as possible.


The Margin:-
Unlike other algorithms that might just find any line that separates the data, SVM looks for the maximum margin. The margin is the distance between the hyperplane and the closest data points from either class. A larger margin provides a "safety buffer," making the model more robust and better at generalizing to new data.


Support Vectors:-
These are the most important data points in the set. They are the points that sit right on the edge of the margin. If you moved these specific points, the position of the hyperplane would change. They are called "support vectors" because they literally "support" or define the decision boundary.

Real-World Applications:-

Face Detection: Classifying parts of an image as "face" or "non-face."

Text Categorization: Sorting emails into spam/not-spam or news into categories.

Bioinformatics: Classifying proteins or gene sequences.




#6.  What is the Kernel Trick in SVM?

The Kernel Trick in Support Vector Machines (SVM) is a technique that allows SVMs to handle non-linearly separable data by implicitly mapping input data into a higher-dimensional feature space, where a linear separator can be found.

Mathematical Intuition:-

Instead of mapping:

ϕ(x):Rn→Rm

SVM uses a kernel:

K(xi​,xj​)=ϕ(xi​)⋅ϕ(xj​)


| Kernel         | Formula                             | When to Use                  |
| -------------- | ----------------------------------- | ---------------------------- |
| Linear         | ( x_i \cdot x_j )                   | Linearly separable data      |
| Polynomial     | ( (x_i \cdot x_j + c)^d )           | Feature interactions         |
| RBF (Gaussian) | ( \exp(-\gamma |x_i - x_j|^2) )     | Complex, non-linear patterns |
| Sigmoid        | ( \tanh(\alpha x_i \cdot x_j + c) ) | Neural-network-like behavior |

Example Intuition:-

Imagine data arranged in concentric circles:

Not separable in 2D

After kernel transformation → separable by a hyperplane


In [2]:
#7. Write a Python program to train two SVM classifiers with Linear and RBF
#   kernels on the Wine dataset, then compare their accuracies.
#   Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
#   on the same dataset.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling (important for SVM)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train Linear SVM
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)

# Train RBF SVM
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)

# Predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Accuracy scores
acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print results
print("Linear SVM Accuracy:", acc_linear)
print("RBF SVM Accuracy:", acc_rbf)


Linear SVM Accuracy: 0.9814814814814815
RBF SVM Accuracy: 0.9814814814814815


#8. What is the Naïve Bayes classifier, and why is it called "Naïve"?

It earns the name "Naïve" because it makes a massive, simplifying assumption: it assumes that all features are completely independent of one another.

In the real world, this is almost never true. For example:

In Weather: Humidity and Temperature are often related (high humidity usually accompanies certain temperatures). Naïve Bayes ignores this and treats them as unrelated signals.

In Text: In the phrase "Stock Market," the word "Stock" is highly likely to be followed by "Market." Naïve Bayes treats the presence of "Stock" and "Market" as two completely independent events that just happened to occur in the same email.

The Core FormulaNaïve Bayes uses the following formula to predict the class (9$y$) of a given set of features (10$X$):11$$P(y|x_1, ..., x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i|y)}{P(x_1, ..., x_n)}$$Since the denominator (the evidence) is the same for all classes being compared, the classifier simply finds the class that maximizes the numerator:$$Predicted Class = \text{argmax } P(y) \prod_{i=1}^{n} P(x_i|y)$$



#9. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

1. Gaussian Naïve Bayes:-
Best for: Continuous, real-valued data. This variant assumes that the features follow a Normal (Gaussian) Distribution (the "Bell Curve"). Instead of counting frequencies, it calculates the mean and standard deviation for each feature per class.

Example Features: Temperature, height, weight, or blood pressure.

How it works: If you’re predicting "Gender" based on "Height," the model calculates the average height for men and women. For a new person, it checks where their height falls on those two bell curves to see which is more likely.

2. Multinomial Naïve Bayes:-
Best for: Discrete counts or frequencies. This is the most popular choice for Natural Language Processing (NLP). It assumes that features represent the number of times an event occurred.

Example Features: The number of times the word "Discount" appears in an email, or the count of specific ingredients in a recipe.

How it works: It looks at the "Bag of Words" (word counts). It calculates the probability of seeing a specific word frequency given a class (e.g., "In spam emails, the word 'Prize' usually appears 3+ times").

3. Bernoulli Naïve Bayes
Best for: Binary/Boolean features (0 or 1). Like the Multinomial version, this is often used for text classification, but it ignores how many times a word appears. It only cares if the word is present or absent.

Example Features: Does the email contain the word "Winner"? (Yes/No), or is a specific pixel in an image black or white?

How it works: It models the data using a Bernoulli distribution. This is particularly useful for short documents where a word appearing multiple times doesn't necessarily add more "information" than it appearing just once



In [3]:
#10.  Breast Cancer Dataset
#     Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
#     dataset and evaluate accuracy.
#     Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
#     sklearn.datasets.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Gaussian Naïve Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Naïve Bayes Accuracy:", accuracy)


Gaussian Naïve Bayes Accuracy: 0.9415204678362573
