### SVM and Navie Bayes Assignment Q/A

### 1) What is Information Gain, and how is it used in Decision Trees?
#### > Information Gain (IG) is a key concept used in building Decision Trees for classification problems. It measures how much "information" or "purity" a feature contributes toward correctly classifying the data.It quantifies the reduction in entropy (or uncertainty) about the target variable after splitting the dataset based on a particular feature.
#### Formulae: Information=Gain=Entropy(Parent)−∑i​ni/n​​*Entropy(Childi​)
#### **It's Used in Decision Trees in the following ways**:
##### For each feature:
- Calculate the entropy of the parent dataset.
- Split the data on that feature and calculate the weighted average entropy of the resulting subsets.
- Compute Information Gain for that split.
##### Choose the feature with the highest Information Gain as the best split (most reduces uncertainty).
##### Repeat the process recursively for each branch until:
- All samples are classified (pure nodes), or
- A stopping criterion is met (example: max depth or min samples).

### 2) What is the difference between Gini Impurity and Entropy?
### > **Gini Impurity** : Measures how often a randomly chosen element would be incorrectly labeled if it were randomly labeled according to class distribution.
#### **Strengths**
- Computationally faster (no logarithms)
-  Produces similar results to entropy
- Works well for balanced datasets
#### **Weaknessess**
- Slightly less sensitive to class imbalance
- May bias toward features with many levels
#### **Use Cases**
- When speed and efficiency are important (large datasets)
- Used by CART algorithm (default in sklearn.DecisionTreeClassifier)
#### > **Entropy** : Measures the amount of information (or uncertainty) in the dataset.
#### **Strengths**
- Based on information theory (clear theoretical meaning)
- More sensitive to rare classes or uneven splits
#### **Weaknessess**
- Computationally slower (uses log function)
- Adds little practical benefit over Gini in many cases
#### **Use Cases**
- When information gain interpretation is desired
- Used in ID3, C4.5, and C5.0 decision tree algorithms

### 3) What is Pre-Pruning in Decision Trees?
#### > Pre-Pruning (also called early stopping) is a technique used in Decision Trees to stop the tree from growing too large during training before it perfectly fits (or overfits) the training data.It halting the tree's growth early if further splitting does not significantly improve model performance.Instead of letting the tree grow fully and then trimming it (as in post-pruning), we set constraints that limit its depth or size during construction.
#### **Pre-Pruning Parameters**
- max_depth
- min_samples_split
- min_samples_leaf
- max_leaf_nodes
- min_impurity_decrease
#### **Advantages**
- Prevents overfitting early
- Reduces training time
- Produces simpler, more interpretable trees
#### **Disadvantages**
- Might underfit the data if pruning is too aggressive
- Requires careful tuning of parameters to find the right balance

### 4) Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

print("Feature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")

accuracy = model.score(X_test, y_test)
print(f"\nModel Accuracy: {accuracy:.2f}")

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876

Model Accuracy: 1.00


### 5) What is a Support Vector Machine (SVM)?
#### > A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks most commonly for classification.It aims to find the best decision boundary (called a hyperplane) that separates data points of different classes with the maximum margin.It tries to draw a line (in 2D) or a plane (in higher dimensions) that divides the data into classes as clearly as possible.
#### **Types of SVM**
- Linear SVM - Works when data can be separated by a straight line (or plane).
- Non-linear SVM - Uses kernel functions (like RBF, polynomial, sigmoid) to separate data that is not linearly separable.
#### **Advantages**
- Works well for high-dimensional data
- Effective when there's a clear margin of separation
- Robust to overfitting (especially with regularization)
#### **Disadvantages**
- Not ideal for very large datasets (computationally expensive)
- Performance drops when classes overlap heavily
- Requires tuning of kernel and regularization parameters

### 6) What is the Kernel Trick in SVM?
#### > The Kernel Trick in Support Vector Machines (SVMs) is a mathematical technique that allows the algorithm to classify data that isn't linearly separable by transforming it into a higher-dimensional space without actually computing the transformation explicitly.
#### **Advantages**
- Makes SVMs powerful for non-linear classification
- Avoids the computational cost of explicitly transforming data
- Works well with complex boundaries
#### **Disadvantages**
- Computationally expensive for large datasets
- Hard to choose the right kernel and parameters
- Poor interpretability

### 7) Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.


In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_scaled, y_train)
y_pred_linear = svm_linear.predict(X_test_scaled)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_scaled, y_train)
y_pred_rbf = svm_rbf.predict(X_test_scaled)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"SVM with Linear kernel Accuracy: {accuracy_linear:.2f}")
print(f"SVM with RBF kernel Accuracy: {accuracy_rbf:.2f}")

SVM with Linear kernel Accuracy: 0.96
SVM with RBF kernel Accuracy: 0.98


### 8) What is the Naïve Bayes classifier, and why is it called "Naïve"?
#### > The Naïve Bayes (NB) classifier is a probabilistic machine learning algorithm based on Bayes Theorem. It is used for classification tasks, especially text classification, spam detection, and sentiment analysis.
#### Bayes Theorem formula:P(C/X)=P(X)P(X/C)⋅P(C)​
#### Where:
- P(C/X) - probability of class C given features X (posterior)
- P(X/C) - probability of features given class C (likelihood)
- P(C)- prior probability of class C
- P(X) - probability of features X (evidence)
##### The classifier predicts the class C that maximizes P(C/X)
#### **It is called “Naïve” because it makes a strong assumption** : All features are independent of each other given the class.In reality, features are often correlated, but Naïve Bayes ignores these dependencies. Despite this simplification, it often works surprisingly well in practice.
#### **Characteristics**
- Simple and fast to train.
- Performs well with high-dimensional data.
#### **Common variants**
- Gaussian Naïve Bayes - for continuous features
- Multinomial Naïve Bayes - for count-based features (text)
- Bernoulli Naïve Bayes - for binary features




### 9) Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.
#### > **Gaussian Naïve Bayes** : Gaussian Naïve Bayes (GNB) is a type of Naïve Bayes classifier that assumes the features (predictor variables) follow a normal (Gaussian) distribution. It is commonly used for continuous data.
#### **When to Use**
- The features are continuous.
- The features approximately follow a normal distribution (Example: height, weight, test scores, etc.).
#### **Advantages**
- Simple and fast to train.
- Works well with small datasets.
- Performs surprisingly well even with the independence assumption.
#### **Disadvantages**
- Assumes normality — not suitable for non-Gaussian data.
- Assumes features are independent (which is rarely true).
- Sensitive to outliers.
#### **Multinomial Naïve Bayes** : Multinomial Naïve Bayes (MNB) is a type of Naïve Bayes classifier designed for discrete (count-based) data, such as word counts or term frequencies in text classification tasks.It's one of the most popular algorithms for document classification for example, spam detection, sentiment analysis, and news categorization.
#### **When to Use**
- Your features represent counts (Example: term frequency vectors).
- Data is non-negative integers (Example:  number of occurrences).
#### **Advantages**
- Works very well for text data (Example:  word counts).
- Simple and computationally efficient.
- Performs well with large feature spaces (like vocabulary).
#### **Disadvantages**
- Assumes feature independence.
- Doesn't perform well with continuous data.
- Requires non-negative feature values.
#### **Bernoulli Naïve Bayes** : Bernoulli Naïve Bayes (BNB) is a variant of the Naïve Bayes classifier used for binary/boolean features where each feature can take only two values: 1 (present) or 0 (absent).It is commonly used in text classification, especially when we care only about whether a word appears in a document, not how many times it appears.
#### **When to Use**
- Features are binary (Example:  yes/no, present/absent).
- You're working with text data represented as binary word presence vectors.
- You want the model to penalize missing words as much as extra ones.
#### **Advantages**
- Works well with binary features (like word presence).
- Efficient and easy to implement.
- Good for high-dimensional data (Example:  text).
#### **Disadvantages**
- Not suitable for continuous or count-based data.
- Assumes features are independent.
- May perform worse than Multinomial NB when word frequency matters.

### 10) Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = GaussianNB()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Gaussian Naïve Bayes:", accuracy)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy of Gaussian Naïve Bayes: 0.9415204678362573

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171

