Question 1 :  What is Information Gain, and how is it used in Decision Trees?

Information Gain is a metric that measures how much a feature reduces the uncertainty (entropy) of the target variable when it is used to split the data.
​

Definition

* Information Gain is defined as the reduction in entropy of the dataset after splitting on a particular attribute.

* Mathematically, for dataset  T and attribute  a:  IG ( T , a ) = H ( T ) − H ( T ∣ a ) , where  H ( T ) is the entropy before the split and  H ( T ∣ a ) is the conditional entropy after splitting by attribute  a.

Entropy and Intuition

* Entropy H(T) measures the impurity or randomness of class labels in a dataset; higher entropy means more mixed classes, lower entropy means purer classes.
​

* A split that creates child nodes with purer class distributions (lower entropy) yields higher Information Gain, meaning the feature is more informative for classification.

Use in Decision Trees

* In decision tree algorithms such as ID3 and C4.5, Information Gain is used as a splitting criterion at each node. For every candidate attribute, the algorithm computes its Information Gain with respect to the current node’s data.
​

* The attribute with the maximum Information Gain is chosen to split the node, and this process is repeated recursively on each child node until stopping conditions are met (e.g., pure nodes, depth limit).

​Practical Role

* At the root node, the feature with the highest Information Gain becomes the root split, creating branches that best separate the classes initially.
​

* This greedy selection using Information Gain guides the tree to shorter, more discriminative paths, improving classification performance, although it can be biased toward attributes with many distinct values (motivation for Gain Ratio in C4.5).


Question 2: What is the difference between Gini Impurity and Entropy?

Hint: Directly compares the two main impurity measures, highlighting strengths,
weaknesses, and appropriate use cases.

Gini Impurity and Entropy serve as impurity measures in decision trees to assess node purity and select optimal splits, but they differ in formulas, computation, and behavior.

Gini Impurity calculates as 1−∑pi2
 , where
p
i
  is class probability, ranging from 0 (pure node) to 0.5 (max impurity for binary classes). Entropy computes as
−
∑
p
i
log
⁡
2
(
p
i
)
, ranging from 0 to 1, rooted in information theory for measuring uncertainty.

Computation Differences

Gini requires simpler squaring operations, making it faster without logarithms, ideal for large datasets. Entropy involves logarithmic terms, slowing computation but providing finer sensitivity to probability changes.

Strengths and Weaknesses

Gini offers efficiency as the default in CART and scikit-learn, favoring quicker splits toward dominant classes but potentially less effective on imbalanced data. Entropy, used in ID3/C4.5, yields theoretically sound splits and balanced trees yet demands more processing power.
​

Use Cases

Gini suits high-dimensional or large-scale training like random forests for speed. Entropy fits smaller, balanced datasets needing precise, information-based splits



Question 3:What is Pre-Pruning in Decision Trees?

Pre-Pruning in decision trees is a technique that stops tree growth early during construction to prevent overfitting by applying predefined stopping criteria at each potential split.
​

Definition and Purpose
Pre-Pruning, also called early stopping, halts the recursive splitting process before the tree becomes fully grown, avoiding complex structures that memorize training data noise. It uses heuristics to check conditions like minimum impurity decrease or sample size before creating child nodes.
​

Common Techniques

* Maximum tree depth limits overall height, preventing deep, overly specific branches.
​

* Minimum samples per split or leaf ensures nodes have sufficient data, discarding trivial splits.
​
​

* Minimum impurity decrease requires splits to reduce Gini or entropy by a threshold amount.
​

Advantages and Risks

Pre-Pruning is computationally efficient since it avoids building then trimming a full tree, making it suitable for large datasets. However, it risks underfitting by stopping too early, missing potentially beneficial deeper splits (known as the horizon effect).
​

Comparison to Post-Pruning

Unlike post-pruning, which trims a complete tree afterward, pre-pruning keeps trees smaller from the start and integrates directly into algorithms like scikit-learn via parameters such as max_depth or min_samples_split.


Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).

Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.
(Include your Python code and output in the code box below.)

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Load the Iris dataset (standard classification dataset)
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier using Gini Impurity criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Print feature importances
print("Feature Importances (Gini-based):")
for i, importance in enumerate(clf.feature_importances_):
    print(f"{feature_names[i]}: {importance:.4f}")

print("\nModel Accuracy on Test Set:", clf.score(X_test, y_test))


Feature Importances (Gini-based):
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876

Model Accuracy on Test Set: 1.0


Question 5: What is a Support Vector Machine (SVM)?

Support Vector Machine (SVM) is a supervised machine learning algorithm that finds the optimal hyperplane to separate data points of different classes while maximizing the margin between them.
​

Core Concept
SVM constructs a decision boundary (hyperplane) in feature space that best divides classes, with support vectors being the closest points to this boundary that define its position. The algorithm prioritizes the widest possible margin for better generalization and lower overfitting risk.
​

Key Components

* Hyperplane: The separating line (2D), plane (3D), or higher-dimensional equivalent represented as w⋅x+b=0.
​

* Margin: Distance from hyperplane to nearest support vectors, maximized during training for robustness.
​

* Support Vectors: Critical training points lying on margin boundaries; only these influence the model.
​

Handling Nonlinear Data

For non-linearly separable data, SVM uses kernel tricks (linear, polynomial, RBF) to map data into higher-dimensional space where linear separation becomes possible without explicit transformation. This enables effective classification of complex patterns like in image or text data.
​

Applications and Strengths

SVM excels in high-dimensional spaces, binary classification, and scenarios with clear margins, such as text categorization or bioinformatics, while being memory-efficient by relying solely on support vectors. It supports both classification (SVC) and regression (SVR) variants.

Question 6:  What is the Kernel Trick in SVM?

The Kernel Trick in SVM enables handling non-linearly separable data by implicitly mapping input features to a higher-dimensional space where linear separation becomes possible, without explicitly computing the costly transformation.

Core Mechanism

The kernel trick replaces dot products in the SVM dual formulation with a kernel function

K(x
i
 ,x
j
 )=ϕ(x
i
 )⋅ϕ(x
j
 ), where
ϕ maps data to higher dimensions. This computes similarity in the transformed space efficiently, avoiding direct feature mapping that could be computationally prohibitive.

Common Kernel Functions

* Linear kernel: K(x,y) = x⋅y for linearly separable data.

* Polynomial kernel: K(x,y)=(x⋅y+c) d for polynomial boundaries.

* RBF (Gaussian) kernel: K(x,y)=exp(−γ∥x−y∥ 2 ) for complex, smooth decision surfaces.

Advantages

It allows SVMs to create non-linear classifiers while maintaining the max-margin property and computational efficiency, relying only on support vectors. The approach scales well for high-dimensional implicit spaces, making SVM effective for images, text, and other non-linear problems.
​



​

Question 7:  Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.

(Include your Python code and output in the code box below.)

In [2]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import numpy as np

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# Split data into training and test sets (same split for fair comparison)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with Linear kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# Print results
print("SVM Comparison on Wine Dataset:")
print(f"Linear Kernel Accuracy: {accuracy_linear:.4f}")
print(f"RBF Kernel Accuracy:    {accuracy_rbf:.4f}")
print(f"Best Kernel: {'RBF' if accuracy_rbf > accuracy_linear else 'Linear'}")


SVM Comparison on Wine Dataset:
Linear Kernel Accuracy: 0.9815
RBF Kernel Accuracy:    0.7593
Best Kernel: Linear


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

Naive Bayes is a probabilistic supervised learning algorithm for classification that applies Bayes' Theorem to predict class labels by calculating posterior probabilities based on feature observations.

Core Mechanism

The classifier computes  P ( C k ∣ x ) = P ( C k ) ∏ P ( x i ∣ C k ) P ( x )  for each class  C k  , selecting the class with maximum posterior probability, where  P ( C k ) is the prior and  P ( x i ∣ C k ) are conditional likelihoods.  ​ Variants like Gaussian (continuous features), Multinomial (counts), and Bernoulli (binary) handle different data types.  

Why "Naive"?
It earns the "naive" label due to its strong assumption of conditional independence between all features given the class, meaning P(x  1  ,x  2  ∣C  k  )=P(x  1  ∣C  k  )P(x  2  ∣C  k  ).  This unrealistic simplification ignores feature correlations but enables fast, scalable computation with closed-form parameter estimation.
​

Strengths and Applications

Naive Bayes trains quickly on high-dimensional data like text, excels in spam filtering, sentiment analysis, and document classification despite the naive assumption often holding approximately. It handles missing data well and performs reliably even with limited training samples.

Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes.

Gaussian, Multinomial, and Bernoulli Naive Bayes are variants of the Naive Bayes classifier that differ primarily in the probability distribution they assume for features.

Gaussian Naive Bayes assumes continuous features follow a normal (Gaussian) distribution, using mean and variance to model
P(x i∣y). ​ Multinomial Naive Bayes handles discrete count data (like word frequencies), assuming a multinomial distribution suitable for non-negative integers.
​ Bernoulli Naive Bayes works with binary features (0/1 presence/absence), modeling them via Bernoulli trials and penalizing absent features.


Key Differences
Gaussian suits real-valued data like sensor readings or measurements (e.g., Iris dataset). Multinomial excels in text classification with term counts or TF-IDF vectors (e.g., spam detection). Bernoulli fits binary/Boolean data like bag-of-words presence (document classification) but ignores frequencies.

Formulas and Computation

* Gaussian: P(xi∣y)=2πσy21exp(−2σy2(xi−μy)2)

* Multinomial: Uses log probabilities of counts proportional to class priors

* P(xi∣y)=pyxi(1−py)1−xiP(xi∣y)=pyxi(1−py)1−xi where xi∈{0,1}xi∈{0,1}

Use Cases

Choose Gaussian for continuous datasets, Multinomial for frequency-based discrete data like NLP, and Bernoulli for binary feature vectors; scikit-learn implements all three for easy selection.


Question 10:  Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.

Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.
(Include your Python code and output in the code box below.)

In [4]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
feature_names = cancer.feature_names

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Gaussian Naive Bayes on Breast Cancer Dataset")
print(f"Dataset shape: {X.shape} (samples, features)")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
print("\nFeature means by class (first 5):")
print(f"Malignant mean (feature 0): {gnb.theta_[0][0]:.3f}")
print(f"Benign mean (feature 0): {gnb.theta_[1][0]:.3f}")


Gaussian Naive Bayes on Breast Cancer Dataset
Dataset shape: (569, 30) (samples, features)
Accuracy: 0.9415

Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171


Feature means by class (first 5):
Malignant mean (feature 0): 17.431
Benign mean (feature 0): 12.229
