## Question 1: What is Information Gain, and how is it used in Decision Trees?

**Answer (theory):**

Information Gain is a metric from information theory used to measure the reduction in uncertainty about a target variable after observing a feature. In decision trees, nodes are split by choosing features that maximize Information Gain — the idea being that a good split should result in child nodes that are purer (i.e., more homogeneous with respect to class labels) than their parent. Numerically, Information Gain is computed as the difference between the entropy of the parent node and the weighted sum of entropies of the child nodes. Entropy itself quantifies impurity or disorder: high entropy means classes are mixed. When constructing a decision tree, the algorithm evaluates candidate splits for each feature, computes the Information Gain for each, and selects the split that yields the highest gain. This greedy approach builds trees top-down: the most informative features are used near the root. Information Gain favors features that partition the data well, but can be biased toward features with many distinct values; techniques like gain ratio are sometimes used to correct this bias.

## Question 2: What is the difference between Gini Impurity and Entropy?

**Answer (theory):**

Gini Impurity and Entropy are two impurity measures used to evaluate the quality of splits in classification decision trees. Entropy (from information theory) measures the average amount of information needed to identify the class of a randomly drawn sample, and is maximized when classes are equally probable. Gini Impurity measures the probability of misclassifying a randomly chosen sample if it were labeled according to the class distribution in that node. In practice both prefer purer splits and often select similar splits, but they differ numerically and have distinct properties: entropy grows logarithmically and is more sensitive to changes in probabilities near 0 or 1, while Gini is simpler and computationally cheaper (no logarithms). Gini tends to produce slightly different splits and can be faster in large datasets. Entropy (Information Gain) is theoretically grounded in information theory and may be preferable when interpreting splits in terms of information. In many real-world tasks the difference in accuracy is minor; choice is often driven by algorithm implementation, interpretability, or computational constraints.

## Question 3: What is Pre-Pruning in Decision Trees?

**Answer (theory):**

Pre-pruning (also called early stopping) is a technique used during the construction of decision trees to halt tree growth before the tree fully fits the training data. The goal is to avoid overfitting — when a tree becomes overly complex, modeling noise rather than underlying patterns. Pre-pruning imposes constraints like maximum depth, minimum samples required to split a node, minimum samples per leaf, maximum number of leaf nodes, or a threshold on impurity decrease. During tree construction, if a candidate split does not satisfy these constraints or does not yield a sufficient improvement in impurity, the split is rejected and the node remains a leaf. Pre-pruning reduces model complexity, can improve generalization, and speeds up training. However, it risks underfitting if constraints are too strict. Selecting appropriate pre-pruning hyperparameters typically requires cross-validation. Compared to post-pruning (which first builds a full tree and then prunes nodes), pre-pruning avoids constructing very large trees but may miss beneficial deeper structure that would be revealed after further splits.

## Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).

In [1]:

# Q4: Decision Tree with Gini Impurity - feature importances
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
X, y = load_iris(return_X_y=True)
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)
importances = clf.feature_importances_
print("Feature importances (Iris dataset):")
for i, imp in enumerate(importances):
    print(f"Feature {i}: {imp:.4f}")


Feature importances (Iris dataset):
Feature 0: 0.0133
Feature 1: 0.0000
Feature 2: 0.5641
Feature 3: 0.4226


## Question 5: What is a Support Vector Machine (SVM)?

**Answer (theory):**

A Support Vector Machine (SVM) is a supervised learning algorithm primarily used for classification (and regression) tasks. SVM seeks to find the optimal decision boundary (hyperplane) that separates classes with the maximum margin — where margin is the distance between the hyperplane and the nearest data points of any class, called support vectors. Maximizing the margin often improves generalization on unseen data. For linearly separable data, SVM finds the unique maximum-margin hyperplane; for non-separable or noisy data, SVM uses slack variables to allow some misclassifications and introduces a regularization parameter (C) to balance margin size against misclassification penalties. For non-linear problems SVMs are extended with kernels that implicitly map inputs into higher-dimensional feature spaces where a linear separator may exist. Popular kernels include linear, polynomial, and radial basis function (RBF). SVMs are robust to high-dimensional spaces, can work well with clear margin separation, and rely on a subset of training points (support vectors), making them memory efficient for many datasets.

## Question 6: What is the Kernel Trick in SVM?

**Answer (theory):**

The Kernel Trick is a technique that enables Support Vector Machines to perform non-linear classification without explicitly mapping data into a higher-dimensional space. Instead of computing the coordinates of data in an expanded feature space, the kernel trick computes inner products between pairs of data points in that feature space using a kernel function. Because many algorithms (including SVM) can be expressed in terms of dot products between samples, replacing the dot product with a kernel function implicitly performs the mapping. Common kernel functions include the linear kernel, polynomial kernel, and the Radial Basis Function (RBF) kernel. For example, RBF computes similarity based on distance and corresponds to an infinite-dimensional feature mapping. The kernel trick allows SVMs to learn complex decision boundaries while avoiding the computational cost of explicit transformations. Choosing an appropriate kernel and its hyperparameters is crucial; it determines the shape of the decision boundary and influences bias–variance trade-offs.

## Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

In [2]:

# Q7: SVM with Linear and RBF kernels on Wine dataset - compare accuracies
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

svm_lin = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

svm_lin.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

y_pred_lin = svm_lin.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

acc_lin = accuracy_score(y_test, y_pred_lin)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"Linear SVM accuracy: {acc_lin:.4f}")
print(f"RBF SVM accuracy:    {acc_rbf:.4f}")


Linear SVM accuracy: 0.9556
RBF SVM accuracy:    0.7111


## Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

**Answer (theory):**

The Naïve Bayes classifier is a probabilistic supervised learning method based on Bayes’ theorem. It models the posterior probability of classes given the features by combining prior class probabilities with the likelihood of features under each class. The 'naïve' part stems from the strong assumption that all features are conditionally independent given the class label — an assumption rarely true in real data. Despite this simplification, Naïve Bayes performs surprisingly well in many domains (especially text classification) because it requires relatively few training samples, learns quickly, and handles high-dimensional input efficiently. During prediction, the model computes the product of likelihoods for each feature (or sum of log-likelihoods for numerical stability) and multiplies by the class prior; the class with the highest posterior probability is selected. The conditional independence assumption reduces computational complexity and allows straightforward updates for streaming data, but when features are heavily correlated, performance may degrade. Still, its simplicity and interpretability make Naïve Bayes a strong baseline classifier.

## Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

**Answer (theory):**

Gaussian, Multinomial, and Bernoulli Naïve Bayes are specialized variants tailored to different data types and likelihood assumptions. Gaussian Naïve Bayes assumes continuous features are normally distributed within each class; the likelihood for each feature is modeled using a Gaussian with class-specific mean and variance. It's appropriate for real-valued inputs (e.g., sensor measurements). Multinomial Naïve Bayes models feature counts (non-negative integers) and assumes features follow a multinomial distribution; it is widely used in text classification with bag-of-words where features are word counts or term frequencies. Bernoulli Naïve Bayes models binary-valued features (presence/absence); it uses a Bernoulli distribution per feature and is useful when only whether a feature occurs matters rather than how often (e.g., binary word occurrence). Each variant uses Bayes’ theorem but differs in the likelihood calculation to match data characteristics; choosing the correct variant improves performance and aligns model assumptions with the input data type.

## Question 10: Breast Cancer Dataset — Train Gaussian Naïve Bayes and evaluate accuracy

In [3]:

# Q10: Gaussian Naive Bayes on Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"GaussianNB accuracy on Breast Cancer dataset: {acc:.4f}")


GaussianNB accuracy on Breast Cancer dataset: 0.9371
