Question 1: What is Information Gain, and how is it used in Decision Trees?



Answer:

Information Gain (IG) is a key metric used to determine the most effective feature for splitting the data at each node of a decision tree (particularly in ID3 and C4.5 algorithms).

It measures the reduction in entropy (uncertainty) achieved by splitting the dataset based on a specific feature.

A higher Information Gain means the split does a better job of separating the data into purer subsets (nodes where all samples belong to the same class).

Usage in Decision Trees: The decision tree algorithm calculates the Information Gain for every available feature and selects the feature that yields the highest Information Gain to make the split at the current node. This greedy approach ensures the tree grows by prioritizing the most informative features first, leading to a more efficient and accurate classification model.

Question 2: What is the difference between Gini Impurity and Entropy?

Both Gini Impurity and Entropy are measures of node impurity (disorder) used by decision tree algorithms to evaluate the quality of a split.


FeatureGini ImpurityEntropyFormula$G(E) = 1 - \sum_{j=1}^{c}p_{j}^{2}$$H(E) = - \sum_{j=1}^{c}p_{j}\log_{2}p_{j}$AlgorithmPrimarily used by the CART (Classification and Regression Trees) algorithm.Primarily used by the ID3 and C4.5 algorithms.Computational CostFaster and simpler to calculate as it avoids logarithmic functions.Slower and more computationally intensive due to the use of logarithmic functions.GoalMeasures the probability of misclassifying a randomly chosen element.Measures the uncertainty or randomness in the data.In PracticeThey produce very similar decision trees, so the choice often comes down to computational efficiency.

Question 3: What is Pre-Pruning in Decision Trees?

Answer:

Pre-Pruning, also known as early stopping, is a technique used to prevent a decision tree from growing too large and becoming overfitted to the training data.

Instead of building the full tree and then trimming it back (post-pruning), pre-pruning imposes constraints that halt the tree's growth during the construction phase.

Common criteria for pre-pruning include:

Maximum Depth: Limiting the tree to a fixed number of levels.

Minimum Samples in a Node: Specifying the minimum number of data points a node must contain before it is allowed to split.

Minimum Impurity Decrease: Requiring the split to result in a reduction of impurity (e.g., Gini or Entropy) that exceeds a certain threshold. If the gain is too small, the split is rejected, and the node becomes a leaf.

Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).



In [4]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# 1. Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
feature_names = data.feature_names

# Split the data (good practice, though not strictly required for feature importance)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Initialize and train the Decision Tree Classifier with Gini
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_classifier.fit(X_train, y_train)

# 3. Print the feature importances
importances = dt_classifier.feature_importances_

print("Feature Importances:")
for name, importance in zip(feature_names, importances):
    # Format the importance to 4 decimal places for clean output
    print(f"  {name}: {importance:.4f}")

Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0191
  petal length (cm): 0.8933
  petal width (cm): 0.0876


Question 5: What is a Support Vector Machine (SVM)?

Answer:

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm primarily used for classification, but also for regression.

Its main objective in a classification task is to find the optimal hyperplane that best separates the data points of different classes in a high-dimensional space.

The "optimal" hyperplane is the one that maximizes the margin, which is the distance between the hyperplane and the closest data points from each class.

The data points that lie closest to the hyperplane (and define the width of the margin) are called the Support Vectors. These vectors are the most critical elements in the training set, as they dictate the position and orientation of the decision boundary.

Question 6: What is the Kernel Trick in SVM?

Answer:

The Kernel Trick is a fundamental technique that allows SVMs to solve non-linear classification problems without explicitly transforming the data into a higher-dimensional space.

Problem: If data is not linearly separable in its original low-dimensional space (e.g., a circle of red dots surrounding a cluster of blue dots), a straight-line hyperplane cannot separate them.

Solution (The Trick): Instead of manually calculating the coordinates for every data point in a very high-dimensional feature space (which would be computationally expensive), the kernel trick uses a kernel function (like the Radial Basis Function (RBF) or Polynomial) to calculate the dot product between two points as if they were already in that higher dimension.

This implicit mapping allows the SVM to find a linear separation (the hyperplane) in the high-dimensional space, which corresponds to a complex, non-linear decision boundary in the original low-dimensional space.

Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies

In [5]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load and prepare the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data (essential for optimal SVM performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Linear Kernel SVM
svc_linear = SVC(kernel='linear', random_state=42)
svc_linear.fit(X_train_scaled, y_train)
y_pred_linear = svc_linear.predict(X_test_scaled)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# 3. RBF Kernel SVM
svc_rbf = SVC(kernel='rbf', random_state=42)
svc_rbf.fit(X_train_scaled, y_train)
y_pred_rbf = svc_rbf.predict(X_test_scaled)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# 4. Print comparison
print(f"Accuracy of SVM with Linear Kernel: {accuracy_linear:.4f}")
print(f"Accuracy of SVM with RBF Kernel: {accuracy_rbf:.4f}")
print("\nComparison:")
if accuracy_rbf > accuracy_linear:
    print("The RBF kernel performed better for this dataset.")
elif accuracy_linear > accuracy_rbf:
    print("The Linear kernel performed better for this dataset.")
else:
    print("Both kernels achieved the same accuracy.")

Accuracy of SVM with Linear Kernel: 0.9815
Accuracy of SVM with RBF Kernel: 0.9815

Comparison:
Both kernels achieved the same accuracy.


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

Answer:

Naïve Bayes (NB) Classifier: It is a family of probabilistic classifiers based on Bayes' theorem (specifically Maximum A Posteriori (MAP) estimation). It predicts the probability of a given data instance belonging to a particular class. It is often used for text classification, spam filtering, and recommendation systems.

Why it is "Naïve": It gets its name from the "Naïve independence assumption". This assumption states that all features in the dataset are conditionally independent of each other, given the class variable. For example, in an email classification task, it assumes that the presence of the word "free" is independent of the presence of the word "money," given that the email is spam. In reality, this assumption is rarely true, yet NB often performs surprisingly well, which is why the model is considered "Naïve."

Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

Variant,Feature Type,Distribution Assumption,Common Use Case
Gaussian NB,Continuous,Assumes the continuous feature values are sampled from a Normal (Gaussian) distribution for each class.,"Classification problems where features are continuous, such as the Iris or Breast Cancer datasets."
Multinomial NB,Discrete/Count,"Assumes a Multinomial distribution for the features, typically modeling discrete counts (e.g., how many times a word appears).","Text classification, where features are word counts or word frequencies."
Bernoulli NB,Binary (0 or 1),"Assumes a Multivariate Bernoulli distribution, where features are independent binary variables, indicating only the presence or absence of a feature (e.g., does a word exist, yes/no).","Text classification, often used when modeling the absence of terms is important."

Question 10: Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

In [6]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# 2. Split data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize and train the Gaussian Naïve Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# 4. Make predictions and evaluate
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 5. Print the result
print(f"Gaussian Naïve Bayes Classifier Accuracy: {accuracy:.4f}")

Gaussian Naïve Bayes Classifier Accuracy: 0.9415
