# Categorical Support Vector Classifier

In this example we will use a dataset about mushrooms, and the goal of the classifier is to determine which ones are edible and which ones are poisonous based on their features.

The first column is the dependent variable, and the one we'll be trying to predict. 

## Imports

In [None]:
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler # to standardize the data
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder # to encode the categorical data into numerical values
from sklearn import svm # the support vector machine algorithms
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, \
f1_score, classification_report,ConfusionMatrixDisplay # for model evaluation
from sklearn.model_selection import GridSearchCV # for cross-validation and parameter tuning

## Loading the database

#### Read the csv file and study the database

In [None]:
mushroom_data = pd.read_csv("data/mushrooms-full-dataset.csv", dtype = str)
mushroom_data.head()

In [None]:
mushroom_data['poisonous'].value_counts()

In [None]:
mushroom_data.isnull().sum()

## Preprocessing

#### Define the target and the inputs

In [None]:
target = mushroom_data['poisonous']
inputs = mushroom_data.drop(['poisonous'],axis=1)

#### Create a training and a testing dataset

In [None]:
x_train, x_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2, random_state=365, stratify = target)

#### Check the result from the stratification

In [None]:
y_train.value_counts(normalize = True)

In [None]:
pd.Series(y_test).value_counts(normalize = True)

#### Define a separate encoder for the target and the inputs 

In [None]:
enc_t = LabelEncoder() # preserves the values of the target labels ('e' for edible and 'p' for poisonous)
enc_i = OrdinalEncoder() # just transforms categorical data into numericals, without preserving the values

#### Apply the fit_transform() method on the training data and the transform() method on the test data.

In [None]:
x_train_transf = enc_i.fit_transform(x_train)
x_test_transf = enc_i.transform(x_test)

y_train_transf = enc_t.fit_transform(y_train)
y_test_transf = enc_t.transform(y_test)

In [None]:
# just checking if the encoding was successfull
y_train_transf

In [None]:
# just checking if the encoding was successfull
x_train_transf

## Rescaling

> __Important__: in order for the SVC to work correctly the inputs (but not the targets) need to be rescaled to the range (-1, 1)

In [None]:
scaling = MinMaxScaler(feature_range=(-1,1)).fit(x_train_transf)
x_train_rescaled = scaling.transform(x_train_transf)
x_test_rescaled = scaling.transform(x_test_transf)

## Classification

We'll start off by trying a linear SVM.

In [None]:
C = 1.0 # the parameter that helps us decide how wide the margins are
svc = svm.SVC(kernel='linear', C=C).fit(x_train_rescaled, y_train_transf)

In [None]:
enc_t.classes_

### Create a new dataframe with the encoded variables

features_list = data.columns[:-1]
features_list

data_enc = pd.DataFrame(inputs_enc, columns = features_list)
data_enc['poisonous'] = target_enc
data_enc

## Evaluation

#### Evaluate the model on the test data

In [None]:
y_pred_test = svc.predict(x_test_rescaled)

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

cmp = ConfusionMatrixDisplay(
    confusion_matrix(y_test_transf, y_pred_test),
    display_labels=["Edible", "Poisonous"],
)

cmp.plot(ax=ax);

In [None]:
print(classification_report(y_test_transf, y_pred_test, target_names = ["Edible", "Poisonous"]))

## Hyperparameter Tuning with GridSearchCV

#### Choose the best kernel and optimal C parameter based on Cross Validation of the training data

In [None]:
tuned_parameters = [
    {"kernel": ["linear"], "C": [1, 10]},
    {"kernel": ["poly"], "C":[1, 10]},
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10]}
]

In [None]:
scores = ["precision", "recall"]

In [None]:
for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(svm.SVC(), tuned_parameters, scoring="%s_macro" % score)
    clf.fit(x_train_rescaled, y_train_transf)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_["mean_test_score"]
    stds = clf.cv_results_["std_test_score"]
    for mean, std, params in zip(means, stds, clf.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test_transf, clf.predict(scaling.transform(x_test_transf))
    print(classification_report(y_true, y_pred))
    print()
