**Question 1: What is Information Gain, and how is it used in Decision Trees?**

Answer:

Information Gain is a metric used to decide which feature should be used to split the data at each node of a Decision Tree. It measures the reduction in uncertainty (entropy) after a dataset is split on a particular feature.

Higher Information Gain means better feature for splitting.

Decision Trees choose the feature with maximum Information Gain.

Formula:
Information Gain = Entropy(parent) − Weighted Entropy(children)

**Question 2: What is the difference between Gini Impurity and Entropy?
Hint: Directly compares the two main impurity measures, highlighting strengths,
weaknesses, and appropriate use cases**

Answer



Gini Impurity and Entropy are both measures of impurity/uncertainty used in decision tree algorithms, with the primary difference being computational efficiency. Gini Impurity is faster to compute, while Entropy is more theoretically grounded in information theory and can produce slightly better, more balanced splits in some cases
formula:
| Situation                                                | Better Choice                                      |
| -------------------------------------------------------- | -------------------------------------------------- |
| You want **faster computation**                          | **Gini Impurity**                                  |
| You care about **information-theoretic interpretation**  | **Entropy**                                        |
| You’re mainly focused on **classification tree quality** | Either — test both with cross-validation           |
| You expect **imbalanced classes**                        | Entropy sometimes helps but results can be similar |


**Gini Impurity**

**Strengths, Weaknesses, and Use Cases**

**Strengths**

>Speed Its main advantage is computational efficiency, making it the default choice in many real-world systems like the CART algorithm.

>Robustness: It tends to be more robust to noise and can have lower variance.

**Weaknesses:**

>May perform less well when class distributions are highly imbalanced, as it favors isolating the majority class quickly.

**Appropriate Use Cases**

>Large datasets where training time is a primary constraint.
>Real-time applications or scenarios with limited computational resources.

**Entropy**

**Strengths, Weaknesses, and Use Cases**
**Strengths**
>Theoretical Soundness: It is more theoretically grounded in information theory.

>Sensitivity: More sensitive to class distribution, potentially finding finer, slightly more accurate splits, especially with balanced classes.

**Weaknesses:**

>Speed: Slower to compute due to the logarithmic operations at every node split.

>Can potentially lead to deeper trees, which might increase the risk of overfitting.

**Appropriate Use Cases**
>Smaller datasets where computational time is less critical.

>Situations requiring subtle distinctions between classes or when the goal is to explicitly maximize information gain.

**Question 3: What is Pre-Pruning in Decision Trees?**

Answer:

Pre-Pruning is a technique used to stop the tree from growing too deep by setting limits during training.

Common pre-pruning methods:



>Maximum depth (max_depth)

>Minimum samples split (min_samples_split)

>Minimum samples leaf >(min_samples_leaf)

Benefit:
Reduces overfitting, improves generalization, and speeds up training.

**Question 4**: Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).

Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

(Include your Python code and output in the code box below.)

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# 1. Load a sample dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# 2. Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train Decision Tree Classifier with Gini impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 4. Get feature importances
importances = clf.feature_importances_

# 5. Print feature importance values
print("Feature Importances (Gini-based):")
for feature, importance in zip(X.columns, importances):
    print(f"{feature}: {importance:.4f}")


Feature Importances (Gini-based):
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


**Question 5: What is a Support Vector Machine (SVM)?**

Answer:

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression. It finds the optimal hyperplane that maximizes the margin between different classes.

*Uses support vectors (critical data points)

*Effective in high-dimensional spaces

**Question 6: What is the Kernel Trick in SVM?**

Answer:

The Kernel Trick allows SVM to solve non-linear problems by mapping data into a higher-dimensional space without explicitly computing the transformation.

**Common kernels:**

* Linear
* Polynomial
* RBF (Gaussian)
* Sigmoid

**Question 7**:Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.

(Include your Python code and output in the code box below.)


In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Standardize the data (crucial for SVM performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train SVM with Linear Kernel
linear_svc = SVC(kernel='linear')
linear_svc.fit(X_train_scaled, y_train)
linear_pred = linear_svc.predict(X_test_scaled)
linear_acc = accuracy_score(y_test, linear_pred)

# 5. Train SVM with RBF (Radial Basis Function) Kernel
rbf_svc = SVC(kernel='rbf')
rbf_svc.fit(X_train_scaled, y_train)
rbf_pred = rbf_svc.predict(X_test_scaled)
rbf_acc = accuracy_score(y_test, rbf_pred)

# 6. Compare results
print(f"Accuracy with Linear Kernel: {linear_acc:.4f}")
print(f"Accuracy with RBF Kernel:    {rbf_acc:.4f}")


Accuracy with Linear Kernel: 0.9815
Accuracy with RBF Kernel:    0.9815


**Question 8**: What is the Naïve Bayes classifier, and why is it called "Naïve"?

Answer:

A Naïve Bayes classifier is a probabilistic machine learning algorithm used primarily for classification tasks, such as spam detection and sentiment analysis. It is based on Bayes' Theorem, which calculates the posterior probability of a class given certain evidence (features).

**Why is it called "Naïve"**

The algorithm is considered "naïve" because it makes a strong, simplifying assumption that all features are independent of each other given the class label.

* The Assumption: It assumes the presence or absence of one feature has no relationship with any other feature.

* Example: To classify a fruit as an orange, the algorithm considers its color (orange), shape (round), and size (3.5 inches) independently. Even if these characteristics are related in nature, the model treats them as separate, independent contributors to the final probability.
* Why it's "Naïve": In real-world data, features are almost always correlated. For instance, in an email, the word "Free" is often followed by "Money," yet the model ignores this dependency.

**Key Benefits of this "Naivety"**

Despite its unrealistic core assumption, this simplification provides several practical advantages:

* Computational Efficiency: It requires significantly less training data and is much faster to train than more complex models like SVMs or Neural Networks.
* Handling High Dimensions: It excels in text classification where there may be thousands of features (words) because it doesn't need to model complex interactions between them.
* Scalability: It is highly scalable, requiring only a single parameter for each feature in a learning problem.

**Common Types of Naïve Bayes**

Different variants are used depending on the nature of the data.
* Gaussian: For continuous features that follow a normal distribution (e.g., height or temperature).
* Multinomial: For discrete data, typically word frequencies in text classification.
* Bernoulli: For binary data (e.g., whether a word is present or absent in a document)

**Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes**


Answer:
1. **Gaussian Naïve Bayes**

Best for: Continuous numerical features
Typical use cases: Sensor data, real-valued measurements (height, weight, temperature)

**How it works**

* Assumes each feature follows a Gaussian (normal) distribution within each class.

* Models mean and variance for each feature and uses that to calculate likelihoods.

**Example intuition**

If you’re trying to classify whether a plant is healthy using leaf width and length which are real numbers, Gaussian NB can model how these continuous measurements vary by class

2. **Multinomial Naïve Bayes**

Best for: Count or frequency features
Typical use cases: Text classification with word counts (spam detection, topic labeling)

**How it works**

* Assumes features come from a multinomial distribution (think lots of discrete counts).

* Uses the counts of features (e.g., number of times each word appears) to compute likelihoods.

Why it shines in text

Text represented as a “bag of words” becomes a vector of word counts. Multinomial NB naturally models that, making it ideal for many NLP tasks.

 3. **Bernoulli Naïve Bayes**

Best for: Binary (presence/absence) features
Typical use cases: Indicators like “word present or not in document”, feature flags, yes/no attributes

**How it works**

* Assumes each feature is Bernoulli-distributed, i.e. takes values 0 or 1.

* Models both the presence and absence of features rather than counts.

When this helps

If you only care whether a word appears at all (not how many times), Bernoulli NB captures that pattern



| Variant        | Feature Type           | Distribution Assumed | Typical Use Case             |                      |
| -------------- | ---------------------- | -------------------- | ---------------------------- | -------------------- |
| Gaussian NB    | Continuous real-valued | Normal (Gaussian)    | Numeric measurements         |                      |
| Multinomial NB | Discrete counts        | Multinomial          | Text with word frequencies   |                      |
| Bernoulli NB   | Binary flags           | Bernoulli            | Binary presence/absence data | ([GeeksforGeeks][1]) |

[1]: https://www.geeksforgeeks.org/bernoulli-naive-bayes/?utm_source=chatgpt.com "Bernoulli Naive Bayes - GeeksforGeeks"


**Question 10: Breast Cancer Dataset**

Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.

Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.
(Include your Python code and output in the code box below.)


In [24]:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_breast_cancer(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = GaussianNB()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.956140350877193
