# Classification Exercises

This exercise uses [wheat](../dataset/wheat-seeds.csv) dataset. The data is downloaded from the following Github [repository](https://github.com/jbrownlee/Datasets). 

**Data Set Information:**

The examined group comprised kernels belonging to three different varieties of wheat: **Kama**, **Rosa** and **Canadian**.

**Attribute Information:**

To construct the data, seven geometric parameters of wheat kernels were measured:
1. area,
2. perimeter,
3. compactness,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.

All of these parameters were real-valued continuous.


To complete this exercise, please refer to [classification](classification.ipynb) notebook.

In [1]:
# load libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

In [2]:
wheat = pd.read_csv('../dataset/wheat-seeds.csv')
wheat

Unnamed: 0,area,perimeter,compactness,length_kernel,width_kernel,asymmetry_coefficient,length_kernel_groove,variety
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,3
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,3
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,3
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3


## Exercise 1
Try to get basic information of the dataset and statistical summary

<details>
  <summary>Click for answer</summary>
    
  ```python
  # display basic info
  wheat.info()
  
  # statistical summary with 2 decimal places
  wheat.describe().round(2)

## Exercise 2
Check the proportion of target feature - `variety` by plotting its count

<details>
  <summary>Click for answer</summary>
    
  ```python
  sns.countplot(data=wheat, x='variety')

## Exercise 3
1. Use all features as predictor, except `variety` as target.
3. Split dataset into `train` and `test` set with the proportion of **75%** and **25%**, respectively.
4. Calculate `counts` for each target class in the **train** set.

<details>
  <summary>Click for answer</summary>
    
  ```python
    from sklearn.model_selection import train_test_split
    
    X = wheat.drop('variety', axis=1) # use all features for predictor, except variety
    y = wheat['variety'] # target to predict
    
    # split dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    
    y_train.value_counts()

## Exercise 4
1. Fit the train set using `KNN`, `SVM`, and `RandomForest` classifier
2. Refer to the this [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for more information about **Random Forest**

<details>
  <summary>Click for answer</summary>
    
  ```python
    # import necessary packages
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    
    # create knn classifier. Feel free to set initial number of neighbors
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)
    
    # create SVM classifier model
    svm = SVC(random_state=42, probability=False, kernel='linear')
    svm.fit(X_train, y_train)
    
    # create random forest classifier model
    rf = RandomForestClassifier(max_depth=2, random_state=42)
    rf.fit(X_train, y_train)

## Exercise 5
1. Predict the models using test set.
2. Evaluate models' performance by print out classification report and plot the confusion matrix

<details>
  <summary>Click for answer</summary>
    
  ```python
    # import necessary packages
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    from sklearn.metrics import classification_report, precision_recall_curve, ConfusionMatrixDisplay
    
    # get unique class value
    target_labels = np.unique(y_test)
    
    # predict using knn
    y_pred_knn = knn.predict(X_test)
    
    # zero_division is set 0 to surpass warning when a classifier does not predict any instances of a particular class
    print('------ Classification report of KNN classifier ------')
    print(classification_report(y_test,y_pred_knn, zero_division=0))
    
    ConfusionMatrixDisplay.from_estimator(knn, 
                                          X_test, 
                                          y_test, 
                                          display_labels=target_labels,
                                          cmap=plt.cm.Blues)
    plt.title('Confusion Matrix of KNN Classifier')
    
    
    # predict using svm
    y_pred_svm = svm.predict(X_test)
    
    print('\n------ Classification report of SVM classifier ------')
    print(classification_report(y_test,y_pred_svm, zero_division=0))
    
    ConfusionMatrixDisplay.from_estimator(svm, 
                                          X_test, 
                                          y_test, 
                                          display_labels=target_labels,
                                          cmap=plt.cm.Blues)
    plt.title('Confusion Matrix of SVM Classifier')
    
    # predict using random forest
    y_pred_rf = rf.predict(X_test)
    
    print('\n------ Classification report of RF classifier ------')
    print(classification_report(y_test,y_pred_rf, zero_division=0))
    
    ConfusionMatrixDisplay.from_estimator(rf, 
                                          X_test, 
                                          y_test, 
                                          display_labels=target_labels,
                                          cmap=plt.cm.Blues)
    plt.title('Confusion Matrix of RF Classifier')

## Exercise 6
1. Resampling the **train** set using SMOTE to balance the class
2. Fit the models with resampled trainset
3. Evaluate their performance

<details>
  <summary>Click for answer of resampling</summary>
    
  ```python
    # run this code if imblearn package is not installed yet
    !pip install imbalanced-learn

    # run the following code in different code cell
    from imblearn.over_sampling import SMOTE
    
    resample = SMOTE()
    X_train_resampled, y_train_resampled = resample.fit_resample(X_train, y_train)
    y_train_resampled.value_counts()

<details>
  <summary>Click for answer of fitting with resampled data</summary>
    
  ```python
    # import necessary packages
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    
    # create knn classifier. Feel free to set initial number of neighbors
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_resampled, y_train_resampled)
    
    # create SVM classifier model
    svm = SVC(random_state=42, probability=True, kernel='linear')
    svm.fit(X_train_resampled, y_train_resampled)
    
    # create random forest classifier model
    rf = RandomForestClassifier(max_depth=2, random_state=42)
    rf.fit(X_train_resampled, y_train_resampled)

<details>
  <summary>Click for answer of model evaluation</summary>
    
  ```python
    # import necessary packages
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    from sklearn.metrics import classification_report, precision_recall_curve, ConfusionMatrixDisplay
    
    # get unique class value
    target_labels = np.unique(y_test)
    
    # predict using knn
    y_pred_knn = knn.predict(X_test)
    
    # zero_division is set 0 to surpass warning when a classifier does not predict any instances of a particular class
    print('------ Classification report of KNN classifier ------')
    print(classification_report(y_test,y_pred_knn, zero_division=0))
    
    ConfusionMatrixDisplay.from_estimator(knn, 
                                          X_test, 
                                          y_test, 
                                          display_labels=target_labels,
                                          cmap=plt.cm.Blues)
    plt.title('Confusion Matrix of KNN Classifier')
    
    
    # predict using svm
    y_pred_svm = svm.predict(X_test)
    
    print('\n------ Classification report of SVM classifier ------')
    print(classification_report(y_test,y_pred_svm, zero_division=0))
    
    ConfusionMatrixDisplay.from_estimator(svm, 
                                          X_test, 
                                          y_test, 
                                          display_labels=target_labels,
                                          cmap=plt.cm.Blues)
    plt.title('Confusion Matrix of SVM Classifier')
    
    # predict using random forest
    y_pred_rf = rf.predict(X_test)
    
    print('\n------ Classification report of RF classifier ------')
    print(classification_report(y_test,y_pred_rf, zero_division=0))
    
    ConfusionMatrixDisplay.from_estimator(rf, 
                                          X_test, 
                                          y_test, 
                                          display_labels=target_labels,
                                          cmap=plt.cm.Blues)
    plt.title('Confusion Matrix of RF Classifier')

> Compare with previous results, do you find any changes on performance?
>
> What is your analysis?