Ref:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html 
https://www.vebuso.com/2020/03/svm-hyperparameter-tuning-using-gridsearchcv/ 
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html 

(a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29. Choose 70% of the data
randomly as the training set.

In [49]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [50]:
data_path = "../data/"
#data_path = ""
data_file = data_path+"Frogs_MFCCs.csv"
data = pd.read_csv(data_file)
#print(data.shape)

col=list( data.columns)
labels = col[-2:-5:-1]

x = data[sorted(list(set(col)-set(col[-1:-5:-1])))]
y = data[labels]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=50)
#print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(x.shape, y.shape)

(7195, 22) (7195, 3)


(b) Each instance has three labels: Families, Genus, and Species. Each of the labels
has multiple classes. We wish to solve a multi-class and multi-label problem.
One of the most important approaches to multi-label classification is to train a
classifier for each label (binary relevance). We first try this approach:


i. Research exact match and hamming score/ loss methods for evaluating multi-label classification and use them in evaluating the classifiers in this problem.


Ref : https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics

Exact match ratio would just to ignore partially correct (consider them incorrect) and extend the accuracy used in single label case for multi-label prediction. Disadvantage of this measure is that it does not distinguish between complete incorrect and partially correct which might be considered harsh.


The Hamming loss is the fraction of labels that are incorrectly predicted.

In multilabel classification, the Hamming loss is different from the subset zero-one loss. The zero-one loss considers the entire set of labels for a given sample incorrect if it does not entirely match the true set of labels. Hamming loss is more forgiving in that it penalizes only the individual labels.

The Hamming loss is upperbounded by the subset zero-one loss, when normalize parameter is set to True. It is always between 0 and 1, lower being better.

ii. Train a SVM for each of the labels, using Gaussian kernels and one versus
all classifiers. Determine the weight of the SVM penalty and the width of
the Gaussian Kernel using 10 fold cross validation.1 You are welcome to try
to solve the problem with both standardized 2 and raw attributes and report
the results.

In [52]:
from sklearn.metrics import hamming_loss

In [53]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import RepeatedKFold, StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import multilabel_confusion_matrix, ConfusionMatrixDisplay, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

svm = SVC(random_state=42, kernel = "rbf", decision_function_shape ="ovr")

cross_validator = StratifiedKFold(n_splits=10, shuffle=True, random_state=69)
param1 = { 'C':[0.0001,0.001,0.01,0.1],'gamma': [0.0,0.2,0.4,0.6,0.8,1]}
param2 = { 'C':[1,10,100,1000,1000],'gamma': [1.2,1.4,1.6,1.8,2]}
gridcv_low = GridSearchCV(svm,param_grid= param1, cv= cross_validator,refit=True)
gridcv_high = GridSearchCV(svm,param_grid= param2, cv= cross_validator,refit=True)

In [54]:
gridcv_low0 = gridcv_low.fit(x_train,y_train[labels[0]])
gridcv_high0 = gridcv_high.fit(x_train,y_train[labels[0]])
print(gridcv_low0.best_estimator_, gridcv_high0.best_estimator_ )

SVC(C=0.1, gamma=1, random_state=42) SVC(C=10, gamma=1.6, random_state=42)


In [55]:
gridcv_low1 = gridcv_low.fit(x_train,y_train[labels[1]])
gridcv_high1 = gridcv_high.fit(x_train,y_train[labels[1]])
print(gridcv_low1.best_estimator_, gridcv_high1.best_estimator_ )

SVC(C=0.1, gamma=1, random_state=42) SVC(C=100, gamma=1.4, random_state=42)


In [56]:
gridcv_low2 = gridcv_low.fit(x_train,y_train[labels[2]])
gridcv_high2 = gridcv_high.fit(x_train,y_train[labels[2]])
print(gridcv_low2.best_estimator_, gridcv_high2.best_estimator_ )

SVC(C=0.1, gamma=1, random_state=42) SVC(C=100, gamma=1.8, random_state=42)


In [57]:
param = { 'C':[0.1,1,10],'gamma': [1,1.2,1.4,1.6]}
gridcv = GridSearchCV(svm,param_grid= param, cv= cross_validator,refit=True)
gridcv0 = gridcv.fit(x_train,y_train[labels[0]])
print(gridcv0.best_estimator_)
y_test_pred0 = gridcv0.predict(x_test)

SVC(C=10, gamma=1.6, random_state=42)


In [74]:
print(hamming_loss(y_test[labels[0]], y_test_pred0))


0.008800370541917554


In [59]:
param = { 'C':[0.1,1,10, 100],'gamma': [1,1.2,1.4]}
gridcv = GridSearchCV(svm,param_grid= param, cv= cross_validator,refit=True)
gridcv1 = gridcv.fit(x_train,y_train[labels[1]])
print(gridcv1.best_estimator_)
y_test_pred1 = gridcv1.predict(x_test)

SVC(C=100, gamma=1.4, random_state=42)


In [75]:
print(hamming_loss(y_test[labels[1]], y_test_pred1))

0.009726725335803613


In [61]:
param = { 'C':[0.1,1,10, 100],'gamma': [1,1.2,1.4,1.6,1.8]}
gridcv = GridSearchCV(svm,param_grid= param, cv= cross_validator,refit=True)
gridcv2 = gridcv.fit(x_train,y_train[labels[2]])
print(gridcv2.best_estimator_)
y_test_pred2 = gridcv2.predict(x_test)

SVC(C=100, gamma=1.8, random_state=42)


In [76]:
print(hamming_loss(y_test[labels[2]], y_test_pred2))

0.007874015748031496


In [117]:
def exact_match(df1,df2):
    df= pd.DataFrame()
    for col in df1.columns:
        #print(df1[col].values==df2[col].values)
        df[col]=np.array(df1[col].values==df2[col].values)
    
    df["e_match"] =df[labels[0]] & df[labels[1]] & df[labels[2]]
    f=df["e_match"].value_counts()[0]
    t=df["e_match"].value_counts()[1]
    return t/(t+f)         

In [119]:
y_pred_svm = pd.DataFrame({labels[0]:y_test_pred0, labels[1]:y_test_pred1,labels[2]:y_test_pred2}, columns = labels)
print("Exact match -", exact_match(y_test,y_pred_svm))
print("Avg Hamming loss - ", (hamming_loss(y_test[labels[0]], y_test_pred0) + hamming_loss(y_test[labels[1]], y_test_pred1) + hamming_loss(y_test[labels[2]], y_test_pred2) )/3 )

Exact match - 0.9888837424733673
Avg Hamming loss -  0.008800370541917554


(iii) Repeat 1(b)ii with L1-penalized SVMs.3 Remember to standardize4
the attributes. Determine the weight of the SVM penalty using 10 fold cross validation

In [120]:
#Standardizing
from sklearn import preprocessing
from sklearn.svm import LinearSVC

scaler = preprocessing.StandardScaler().fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_train_scaled = pd.DataFrame(x_train_scaled, columns = x_train.columns)

x_test_scaled = scaler.transform(x_test) 
x_test_scaled = pd.DataFrame(x_test_scaled, columns = x_test.columns)

linear_svm = LinearSVC(penalty='l1',  multi_class='ovr',  dual=False, max_iter =100000)
param = { 'C':[0.0001,0.001,0.01,0.1,1,10,100,1000]}
cross_validator = StratifiedKFold(n_splits=10, shuffle=True, random_state=69)
gridcv_linear = GridSearchCV(linear_svm,param_grid= param, cv= cross_validator,refit=True)


In [121]:
gridcv_linear0 = gridcv_linear.fit(x_train_scaled,y_train[labels[0]])
print(gridcv_linear0.best_estimator_)
y_test_pred_linear0 = gridcv_linear0.predict(x_test_scaled)



LinearSVC(C=1, dual=False, max_iter=100000, penalty='l1')


In [122]:
gridcv_linear1 = gridcv_linear.fit(x_train_scaled,y_train[labels[1]])
print(gridcv_linear1.best_estimator_)
y_test_pred_linear1 = gridcv_linear1.predict(x_test_scaled)

LinearSVC(C=100, dual=False, max_iter=100000, penalty='l1')


In [123]:
gridcv_linear2 = gridcv_linear.fit(x_train_scaled,y_train[labels[2]])
print(gridcv_linear2.best_estimator_)
y_test_pred_linear2 = gridcv_linear2.predict(x_test_scaled)

LinearSVC(C=1, dual=False, max_iter=100000, penalty='l1')


In [124]:
print("hamming_loss for label Families - ", hamming_loss(y_test[labels[0]], y_test_pred_linear0))
print("hamming_loss for label Species - ",hamming_loss(y_test[labels[1]], y_test_pred_linear1))
print("hamming_loss for label Species - ",hamming_loss(y_test[labels[2]], y_test_pred_linear2))

hamming_loss for label Families -  0.04353867531264474
hamming_loss for label Species -  0.05141269106067624
hamming_loss for label Species -  0.06530801296896711


In [125]:
y_pred_linear = pd.DataFrame({labels[0]:y_test_pred_linear0, labels[1]:y_test_pred_linear1,labels[2]:y_test_pred_linear2}, columns = labels)
print("Exact match -", exact_match(y_test,y_pred_linear))
print("Avg Hamming loss - ", (hamming_loss(y_test[labels[0]], y_test_pred_linear0) + hamming_loss(y_test[labels[1]], y_test_pred_linear1) + hamming_loss(y_test[labels[2]], y_test_pred_linear2) )/3 )

Exact match - 0.9124594719777674
Avg Hamming loss -  0.05341979311409603


(iv) Repeat 1(b)iii by using SMOTE or any other method you know to remedy
class imbalance. Report your conclusions about the classifiers you trained.

In [126]:
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from imblearn.over_sampling import SMOTE
n_folds = 10
param = {'classifier__C': [ 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 10000]}
linear_svm = LinearSVC(penalty='l1',  multi_class='ovr',  dual=False, max_iter =100000)
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=11)],
                                ['classifier', linear_svm]
                                ])
smote = SMOTE(random_state=0)
smp_pipeline = make_pipeline(smote, linear_svm)
cross_validator = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=69)
gridSearchCV = GridSearchCV(estimator = pipeline,param_grid=param, cv=cross_validator, n_jobs = 10 , scoring='f1_micro') 


In [138]:
girdcv_sm0 = gridSearchCV.fit(x_train_scaled,y_train[labels[0]])

In [139]:
cv_score = girdcv_sm0.best_score_
y_test_pred_sm0 = girdcv_sm0.predict(x_test_scaled)
print(hamming_loss(y_test[labels[0]], y_test_pred_sm0))

0.04955998147290412


In [141]:
girdcv_sm1 = gridSearchCV.fit(x_train_scaled,y_train[labels[1]])

In [142]:
cv_score1 = girdcv_sm1.best_score_
y_test_pred_sm1 = girdcv_sm1.predict(x_test_scaled)

print(hamming_loss(y_test[labels[1]], y_test_pred_sm1))

0.0968040759610931


In [132]:
girdcv_sm2 = gridSearchCV.fit(x_train_scaled,y_train[labels[2]])

In [133]:
cv_score = girdcv_sm2.best_score_
y_test_pred_sm2 = girdcv_sm2.predict(x_test_scaled)

print(hamming_loss(y_test[labels[2]], y_test_pred_sm2))

0.0856878184344604


In [134]:
y_pred_sm = pd.DataFrame({labels[0]:y_test_pred_sm0, labels[1]:y_test_pred_sm1,labels[2]:y_test_pred_sm2}, columns = labels)
print("Exact match -", exact_match(y_test,y_pred_sm))
print("Avg Hamming loss - ", (hamming_loss(y_test[labels[0]], y_test_pred_sm0) + hamming_loss(y_test[labels[1]], y_test_pred_sm1) + hamming_loss(y_test[labels[2]], y_test_pred_sm2) )/3 )

Exact match - 0.8499305233904585
Avg Hamming loss -  0.07735062528948587
