# Classification

Classification is one of fundamental tasks in the supervised machine learning. The goal is basically to train classifier model with the labelled dataset such that the model is able to categorize unseen data into predefined class. The classification algoritms create so-called mapping functions that relate the input spaces to output variables.

Remember that in the regression task, the model is trained to predict continuous target variable. Meanwhile, the model in classification task is trained to predict discrete values (i.e., label or class)

## Load dataset

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.datasets import load_wine

In [None]:
wine = load_wine(as_frame=True)

## Dataset description

In [None]:
print(wine.DESCR)

In [None]:
wine.data

## Visualize the data

In [None]:
# make dataframe and add target class to the wine data
df = pd.DataFrame(wine.data)
df['target'] = wine.target

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')

> It is noteworthy that before performing any classification task, make sure that the target class are **balance**.
>
> If you have any **imbalanced** class, you need to take additional step for **resampling** the train dataset.
>
> The following plot shows that the target classes have relatively imbalanced count. But let's try to use them and see how the models perform. 

In [None]:
sns.countplot(data=df, x = 'target')

In [None]:
sns.pairplot(data=df)

> `Pairplot` shows marginal distribution of the data in each column, indicated by histogram along the diagonal. The scatter pairs the data distribution between two corresponding column. See this [link](https://seaborn.pydata.org/generated/seaborn.pairplot.html) for further information of pairplot.
>
> **Double click** the plot to make it larger.
>
> Try the following code and see what changes!

```python
sns.pairplot(data=df, hue='target')

## Prepare dataset
Define predictor and target variables. Split dataset into train and test set. 

In [None]:
from sklearn.model_selection import train_test_split

X = wine.data # feature for predictor
y = wine.target # target to predict

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Train the classifier model

We train three different classifer including [KNN](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification), [Complement Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes), and [SVM](https://scikit-learn.org/stable/modules/svm.html#classification). 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import SVC

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

knn

In [None]:
cnb = ComplementNB()
cnb.fit(X_train, y_train)
cnb

In [None]:
svm = SVC(random_state=42, probability=False, kernel='rbf', decision_function_shape='ovo', C=2)
svm.fit(X_train, y_train)
svm

> For multi-class classification, it is common to use `ovo` in the `decision_function_shape` parameter. The `C` value refers to regularization parameter, preventing the risk of overfitting. See [here](https://scikit-learn.org/stable/modules/svm.html#multi-class-classification) for further explanation.

## 📊 Evaluate the model

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, precision_recall_curve, ConfusionMatrixDisplay

# predict using knn
y_pred_knn = knn.predict(X_test)
print(classification_report(y_test,y_pred_knn))

ConfusionMatrixDisplay.from_estimator(knn, 
                                      X_test, 
                                      y_test, 
                                      display_labels=wine.target_names, 
                                      cmap=plt.cm.Blues)

In [None]:
# predict using gnb
y_pred_cnb = cnb.predict(X_test)
print(classification_report(y_test,y_pred_cnb))

ConfusionMatrixDisplay.from_estimator(cnb, 
                                      X_test, 
                                      y_test, 
                                      display_labels=wine.target_names, 
                                      cmap=plt.cm.Blues)

In [None]:
# predict using svm
y_pred_svm = svm.predict(X_test)
print(classification_report(y_test,y_pred_svm))

ConfusionMatrixDisplay.from_estimator(svm, 
                                      X_test, 
                                      y_test, 
                                      display_labels=wine.target_names, 
                                      cmap=plt.cm.Blues)

## 🧠 Which one is better?

**KNN**

- Performs well on class 0, moderately on class 1, and poorly on class 2.
- Overall **accuracy** is 0.74.
- Likely struggles with class imbalance.
- Sensitive to the choice of $k$ and feature scaling.

**Complement Naive Bayes**

- Performs perfectly on class 2 and moderately similar on class 0 and 1.
- Possibly due to normally distributed features.
- Overall **accuracy** is 0.67, the poorest among the other models.

**Support Vector Machine (SVM)**

- Excellent on class 0, good performance on class 1, but poor on class 2.
- Overall **accuracy** is 0.78, the best model so far.
- Likely has trouble separating class 2 due to overlapping decision boundaries.
- May benefit from kernel tuning or class weighting.

---

<div class="alert alert-block alert-info">
<b>NOTE:</b> 
<p>Both KNN & SVM struggle to deal with class 2. Possibly due to class imbalance.</p>
<p>Complement NB is the underperformed model. This model is not generalize well to unseen data. </p>
</div>


## Resampling class for balancing

There are many techniques for resampling dataset. We will use [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - 'Synthetic Minority Over-sampling Technique' to balance the dataset. 

Resampling will be performed only on **train** set, thus the proportion of class in the **test** set is maintained.

First, make sure that the package is already installed by running the following code.

In [None]:
!pip show imbalanced-learn

> Run this code if it is not installed yet.

In [None]:
!pip install imbalanced-learn

In [None]:
from imblearn.over_sampling import SMOTE

resample = SMOTE()
X_train_resampled, y_train_resampled = resample.fit_resample(X_train, y_train)

### Check the class proportion

In [None]:
print('Total initial sample:', len(X_train))
# check the target class proportion
y_train.value_counts()

In [None]:
print('Total sample after resampling:', len(X_train_resampled))
y_train_resampled.value_counts()

> As we can see that before resampling the dataset contains **124** samples with **different** `count` on each target class.
>
> After resampling, we see that the proportion on each class is now balanced, **50 sample each class**, with new total sample of **150** records.

## 🔍 Retrain the model
Let's train the model with new resampled dataset.

In [None]:
knn_resampled = KNeighborsClassifier(n_neighbors=5)
knn_resampled.fit(X_train_resampled, y_train_resampled)

# predict using knn
y_pred_knn_resampled = knn_resampled.predict(X_test)
print(classification_report(y_test, y_pred_knn_resampled))

ConfusionMatrixDisplay.from_estimator(knn_resampled, 
                                      X_test, 
                                      y_test, 
                                      display_labels=wine.target_names, 
                                      cmap=plt.cm.Blues)

In [None]:
cnb_resampled = ComplementNB()
cnb_resampled.fit(X_train_resampled, y_train_resampled)

# predict using knn
y_pred_cnb_resampled = cnb_resampled.predict(X_test)
print(classification_report(y_test, y_pred_cnb_resampled))

ConfusionMatrixDisplay.from_estimator(cnb_resampled, 
                                      X_test, 
                                      y_test, 
                                      display_labels=wine.target_names, 
                                      cmap=plt.cm.Blues)

In [None]:
svm_resampled = SVC(random_state=42, decision_function_shape='ovo', C=2,
                    probability=False, kernel='rbf')
svm_resampled.fit(X_train_resampled, y_train_resampled)

# predict using knn
y_pred_svm_resampled = svm_resampled.predict(X_test)
print(classification_report(y_test, y_pred_svm_resampled))

ConfusionMatrixDisplay.from_estimator(svm_resampled, 
                                      X_test, 
                                      y_test, 
                                      display_labels=wine.target_names, 
                                      cmap=plt.cm.Blues)

> As we can see that the performance of all models are improved after balancing the sample.
>
> The accuracy of each model:
> - **KNN**: 0.74 → 0.76
> - **CNB**: 0.67 → 0.78 (highest improvement - 16%)
> - **SVM**: 0.78 → 0.80 (still the best performing model)