In [None]:
#Q 1: What is Information Gain, and how is it used in Decision Trees?
:-   ​Information Gain (IG) is a measure used in the construction of Decision Trees to determine the effectiveness of a feature in separating the training data according to its target classes.
     It quantifies the reduction in entropy (or impurity) that results from splitting a dataset based on a particular feature
     .How it's used:
     ​Decision Trees are built top-down. At each node, the algorithm calculates the Information Gain for every available feature.
     ​The feature that yields the highest Information Gain (meaning it causes the biggest drop in impurity/entropy) is chosen as the splitting criterion for that node.
     ​This process is repeated recursively until a stopping condition is met. The goal is to maximize the purity of the resulting child nodes.

#Q 2: What is the difference between Gini Impurity and Entropy?

:-   ​Gini Impurity and Entropy are both measures used to quantify the impurity or randomness in a set of data (a node in a Decision Tree).
    The main difference lies in their calculation and properties:
Feature
Gini Impurity
Entropy
Formula
Gini = 1 - \sum_{i=1}^{C} p_i^2
Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i)
Calculation
Measures the probability of incorrectly classifying a randomly chosen element.
Measures the degree of randomness or uncertainty in the data.
Range
[0, 0.5]
[0, 1] (for binary classification)
Computational Cost
Generally faster to compute as it avoids logarithmic calculations.
Involves logarithms, making it slightly slower to compute.
Preferred Split
Aims to minimize Gini Impurity.
Aims to minimize Entropy, which maximizes Information Gain.
Use Case
Used in algorithms like CART (Classification and Regression Trees).
Used in algorithms like ID3 and C4.5.
    Both measures achieve a similar goal: a value of zero indicates a perfectly pure node (all samples belong to the same class), while a higher value indicates a less pure node.
       
#Q 3: What is Pre-Pruning in Decision Trees?

:-   ​Pre-Pruning is a technique used to prevent a Decision Tree from growing too large and overfitting the training data.
      It involves stopping the tree construction early by imposing constraints or conditions on the tree's growth before a split is made at a node.
     ​Common pre-pruning techniques/stopping criteria include:
     ​Maximum Tree Depth: Limiting the maximum number of levels (e.g., max_depth=5).
     ​Minimum Samples for a Split: Requiring a minimum number of samples in a node before a split can be considered (e.g., min_samples_split=20).
     ​Minimum Samples in a Leaf Node: Requiring a minimum number of samples for any new leaf node that would be created by a split (e.g., min_samples_leaf=10).
     ​Maximum Impurity Decrease (or Minimum Information Gain): Stopping if the split doesn't improve impurity by at least a certain threshold (e.g., min_impurity_decrease).
     ​Advantage: Pre-pruning is computationally less expensive than Post-Pruning (where you grow the full tree and then cut back).
      Disadvantage: It can sometimes stop the tree from finding a split that, while not immediately beneficial,
      leads to much better splits later on, potentially resulting in an underfitted model.

#Q 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).
:-   
    import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# 1. Load the dataset (using Iris for a practical example)
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# 2. Split data into training and testing sets (optional but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize the Decision Tree Classifier with Gini criterion
# The criterion='gini' is the default, but we explicitly set it.
dtc = DecisionTreeClassifier(criterion='gini', random_state=42)

# 4. Train the model
dtc.fit(X_train, y_train)

# 5. Get and print the feature importances
importances = dtc.feature_importances_

print("Feature Importances for Decision Tree (Gini Criterion):")
for name, importance in zip(feature_names, importances):
    # Print the importance for each feature, formatted as a percentage
    print(f"{name}: {importance*100:.2f}%")

# 6. Evaluate the model (optional)
accuracy = dtc.score(X_test, y_test)
print(f"\nModel Accuracy on Test Set: {accuracy*100:.2f}%")
         Feature Importances for Decision Tree (Gini Criterion):
sepal length (cm): 0.00%
sepal width (cm): 0.00%
petal length (cm): 91.13%
petal width (cm): 8.87%

Model Accuracy on Test Set: 100.00%
  .
#Q5: What is a Support Vector Machine (SVM)?

:-  A Support Vector Machine (SVM) is a powerful and versatile supervised machine learning algorithm used for both classification and regression tasks, though primarily for classification.
    The core idea of SVM is to find an optimal hyperplane that distinctly separates the data points of different classes in the feature space.
    The "optimal" hyperplane is the one that has the largest margin—the maximum distance between the hyperplane and the nearest data points of any class.
    Support Vectors: These are the data points that lie closest to the hyperplane (on the margin). They are the most crucial elements of the dataset,
    as they directly influence the position and orientation of the optimal hyperplane.
    Margin: The distance between the hyperplane and the support vectors. SVM aims to maximize this margin for better generalization.
    SVM is particularly effective in high-dimensional spaces and when the classes are not linearly separable, by using the Kernel Trick (Question 6).

#Q 6: What is the Kernel Trick in SVM?

:-   The Kernel Trick is a fundamental technique that allows SVM to effectively handle non-linearly separable data without explicitly transforming the data into a higher-dimensional space.
     The Problem: Many real-world classification problems involve data that cannot be separated by a simple straight line (a linear hyperplane) in the original feature space.
     The Solution (The "Trick"): Instead of computationally expensive mapping the data points \mathbf{x} to a higher-dimensional feature space \phi(\mathbf{x}),
     the Kernel Trick uses a Kernel Function (e.g., RBF, polynomial, sigmoid) to calculate the dot product of the data points as if they were already in that higher-dimensional space.
     The Benefit: The Kernel Function K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j) directly computes the similarity between two points in the high-dimensional 
     space without ever calculating the coordinates \phi(\mathbf{x}) themselves. This significantly reduces computational complexity while still allowing a linear decision boundary 
     to be found in the new, higher-dimensional space.

#Q7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
:-     
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split data into training and testing sets
# Standardizing the data is often recommended for SVM but not strictly required for this problem.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train the Linear SVM Classifier
svc_linear = SVC(kernel='linear', random_state=42)
svc_linear.fit(X_train, y_train)
y_pred_linear = svc_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# 4. Train the RBF (Radial Basis Function) SVM Classifier
svc_rbf = SVC(kernel='rbf', random_state=42) # RBF is the default kernel
svc_rbf.fit(X_train, y_train)
y_pred_rbf = svc_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# 5. Compare Accuracies
print("--- SVM Kernel Accuracy Comparison (Wine Dataset) ---")
print(f"Linear Kernel Accuracy: {accuracy_linear*100:.2f}%")
print(f"RBF Kernel Accuracy:    {accuracy_rbf*100:.2f}%")
        --- SVM Kernel Accuracy Comparison (Wine Dataset) ---
Linear Kernel Accuracy: 98.15%
RBF Kernel Accuracy:    70.37%
            
#Q 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

 :-  The Naïve Bayes classifier is a family of probabilistic machine learning algorithms based on applying Bayes' theorem with a "naïve" assumption of feature independence.
     Where C is the class and X is the vector of features.
     Core Function: It calculates the probability of an observation belonging to a certain class C given its feature values X, and then predicts
     the class with the highest probability (Maximum A Posteriori hypothesis).
     Why is it called "Naïve"?
     The classifier is called "Naïve" because it makes the simplifying, yet often effective, assumption that all features are conditionally independent given the class.
     In reality, features are rarely perfectly independent (e.g., a person's height and weight are related). 
     However, this assumption drastically simplifies the computation, making Naïve Bayes models very fast to train and highly efficient,
     especially for high-dimensional data like text classification.

#Q 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

:-    These are the three most common variants of the Naïve Bayes classifier, distinguished by the underlying distribution they assume for the features (P(X|C)):
    
1. Gaussian Naïve Bayes
Used for continuous numeric features.
Assumes data follows a normal distribution.
Applications: Iris dataset, medical data, sensor data.

2. Multinomial Naïve Bayes

Used for count data.
Feature values must be non-negative integers.
Ideal for:
✔ text classification
✔ bag-of-words
✔ term frequency counts

3. Bernoulli Naïve Bayes
For binary features (0/1).
Features represent presence or absence of something.
Example:
✔ Email spam classification with 0/1 indicators
✔ Word present or not present

Key Differences Table
Type Data Type Best For
Gaussian NB Continuous values Sensor, numeric datasets
Multinomial NB Counts (integers) NLP text classification
Bernoulli NB Binary (0/1) Spam detection, document classification
         
#Q10: Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.
          
:- from sklearn.datasets import load_breast_cancer
   from sklearn.model_selection import train_test_split
   from sklearn.naive_bayes import GaussianNB
   from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
   data = load_breast_cancer()
   X = data.data
   y = data.target

# 2. Split data into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize the Gaussian Naive Bayes Classifier
   gnb = GaussianNB()

# 4. Train the model
   gnb.fit(X_train, y_train)

# 5. Make predictions
   y_pred = gnb.predict(X_test)

# 6. Evaluate accuracy
  accuracy = accuracy_score(y_test, y_pred)

  print("--- Gaussian Naïve Bayes on Breast Cancer Dataset ---")
  print(f"Number of test samples: {len(X_test)}")
  print(f"Accuracy Score: {accuracy*100:.2f}%")
           --- Gaussian Naïve Bayes on Breast Cancer Dataset ---
  Number of test samples: 171
  Accuracy Score: 93.57%