# DA-AG-009 Assignment
## Supervised Classification: Decision Trees, SVM, and Naive Bayes

### Question 1: What is Information Gain, and how is it used in Decision Trees?

Information Gain is a metric used to measure how well a feature separates the training examples according to their target classification. It is based on the concept of entropy from information theory. In decision trees, Information Gain is calculated as the reduction in entropy after a dataset is split on a particular feature. The feature with the highest Information Gain is chosen for splitting because it provides the most effective separation of data, leading to purer child nodes and improved classification performance.

### Question 2: Difference between Gini Impurity and Entropy

Gini Impurity and Entropy are both measures used to evaluate the quality of splits in decision trees. Gini Impurity measures the probability of incorrect classification of a randomly chosen element, while Entropy measures the level of disorder in the dataset. Gini is computationally faster and often used in CART algorithms, whereas Entropy is more informative and based on information theory, commonly used in ID3 and C4.5 algorithms.

### Question 3: What is Pre-Pruning in Decision Trees?

Pre-pruning is a technique used to stop the growth of a decision tree early to prevent overfitting. It involves setting constraints such as maximum tree depth, minimum samples per split, or minimum impurity decrease. By limiting tree complexity during training, pre-pruning improves generalization performance on unseen data.

### Question 4: Decision Tree using Gini Impurity (Practical)

In [1]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

X, y = load_iris(return_X_y=True)

model = DecisionTreeClassifier(criterion='gini')
model.fit(X, y)

model.feature_importances_


array([0.02666667, 0.        , 0.55072262, 0.42261071])

### Question 5: What is a Support Vector Machine (SVM)?

A Support Vector Machine is a supervised learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that maximally separates data points of different classes. SVM focuses on boundary data points known as support vectors.

### Question 6: What is the Kernel Trick in SVM?

The Kernel Trick allows SVMs to classify non-linearly separable data by transforming it into a higher-dimensional space. Kernels such as linear, polynomial, and radial basis function (RBF) compute inner products in this space without explicit transformation, reducing computational cost.

### Question 7: SVM with Linear and RBF Kernels (Practical)

In [2]:

from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

svm_linear = SVC(kernel='linear')
svm_rbf = SVC(kernel='rbf')

svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

acc_linear = accuracy_score(y_test, svm_linear.predict(X_test))
acc_rbf = accuracy_score(y_test, svm_rbf.predict(X_test))

acc_linear, acc_rbf


(0.9814814814814815, 0.7592592592592593)

### Question 8: What is the Naïve Bayes classifier, and why is it called 'Naïve'?

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem. It is called 'Naïve' because it assumes that all features are independent of each other given the class label, which is rarely true in real-world data.

### Question 9: Differences between Gaussian, Multinomial, and Bernoulli Naïve Bayes

Gaussian Naïve Bayes is used for continuous data assuming normal distribution. Multinomial Naïve Bayes is suitable for discrete counts such as text data. Bernoulli Naïve Bayes works with binary features and is commonly applied in document classification.

### Question 10: Gaussian Naïve Bayes on Breast Cancer Dataset (Practical)

In [3]:

from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

accuracy_score(y_test, gnb.predict(X_test))


0.9415204678362573