# HW5
## Yidan Wang 2973331278

### 1. Multi-class and Multi-Label Classiﬁcation Using Support Vector Machines

#### (a) process the data

In [2]:
# load the data and packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

raw_data = pd.read_csv('../data/Anuran Calls (MFCCs)/Frogs_MFCCs.csv')


x_data = raw_data.iloc[:,:-4]
y = raw_data.iloc[:,-4:-1]

X_train,X_test, Y_train_all, Y_test_all = train_test_split(x_data, y, test_size=0.3, random_state=434, stratify=y)

#### (b) Each instance has three labels: Families, Genus, and Species. Each of the labels has multiple classes. We wish to solve a multi-class and multi-label problem. One of the most important approaches to multi-label classiﬁcation is to train a classiﬁer for each label (binary relevance). We ﬁrst try this approach:

#### i. Research exact match and hamming score/ loss methods for evaluating multilabel classiﬁcation and use them in evaluating the classiﬁers in this problem.

#### we can use svm.score(X, y, sample_weight=None) to get the exact match. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
#### we can use sklearn.metrics.hamming_loss(y_true, y_pred, *, sample_weight=None) to get the hamming loss.

#### ii. Train a SVM for each of the labels, using Gaussian kernels and one versus all classiﬁers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation. 1 You are welcome to try to solve the problem with both standardized 2 and raw attributes and report the results.

#### Answer:
#### from the results shown below, we can see the raw attributes are better than the standardized attributes. 

In [5]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.metrics import hamming_loss
from sklearn.model_selection import GridSearchCV

# find the C range roughly
def find_c(k, X_train, y_train):
    C_range_scale = np.logspace(-3, 5, 9)
    C_range = []
    for c in C_range_scale:
        clf = SVC(decision_function_shape='ovr', C=c, kernel=k)
        clf.fit(X_train, y_train)
        s_score = clf.score(X_train, y_train)
        if s_score == 1:
            C_range.append(c)
            break
        elif s_score > 0.7:
            C_range.append(c)
    return np.array(C_range)

def CV_processing(k, X_train, y_train, class_name):
    C_range = find_c(k=k, X_train = X_train, y_train=y_train)
    gamma_range = np.linspace(0.1, 3.5, 18)
    param_grid = dict(gamma=gamma_range, C=C_range)
    grid = GridSearchCV(SVC(decision_function_shape='ovr', kernel=k), param_grid=param_grid, cv=10, n_jobs=-1)
    grid.fit(X_train, y_train)
    print(
    "The best parameters of %s are %s with a CV score of %0.2f"
    % (class_name, grid.best_params_, grid.best_score_))
    return grid.best_params_['C'], grid.best_params_['gamma']

def fit_predict_result(k, class_name, X_train = X_train, X_test = X_test):
    y_train = Y_train_all[class_name]
    y_test = Y_test_all[class_name]
    c, g = CV_processing(k, X_train, y_train, class_name)
    model = SVC(kernel=k, C=c, gamma=g, decision_function_shape='ovr')
    model.fit(X_train, y_train)
    emr = model.score(X_test, y_test)
    y_pred = model.predict(X_test)
    hl = hamming_loss(y_test, y_pred)
    return emr, hl


In [6]:
# for raw X

print("Gaussian Kernel:")
print("Raw attributes:")
for class_name in ["Family", "Genus", "Species"]:
    k = "rbf" # default
    emr_1, hl_1 = fit_predict_result(k = k, class_name = class_name)   
    print("For class: %s, the test score(exact match rate) is %s, while the hamming loss is %s."
         % (class_name, emr_1, hl_1))
    print("\n")  

    
# for std X
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
train_x_st = scaler.transform(X_train)
test_x_st = scaler.transform(X_test)

print("Standardized attributes:")
for class_name in ["Family", "Genus", "Species"]:
    k = "rbf" # default
    emr_2, hl_2 = fit_predict_result(k = k, X_train = train_x_st, X_test = test_x_st, class_name = class_name)   
    print("For class: %s, the test score(exact match rate) is %s, while the hamming loss is %s."
         % (class_name, emr_2, hl_2))
    print("\n")  


Gaussian Kernel:
Raw attributes:
The best parameters of Family are {'C': 10.0, 'gamma': 2.5} with a CV score of 0.99
For class: Family, the test score(exact match rate) is 0.9967577582213988, while the hamming loss is 0.003242241778601204.


The best parameters of Genus are {'C': 10.0, 'gamma': 1.5} with a CV score of 0.99
For class: Genus, the test score(exact match rate) is 0.9921259842519685, while the hamming loss is 0.007874015748031496.


The best parameters of Species are {'C': 10.0, 'gamma': 1.5} with a CV score of 0.99
For class: Species, the test score(exact match rate) is 0.9930523390458545, while the hamming loss is 0.006947660954145438.


Standardized attributes:
The best parameters of Family are {'C': 10.0, 'gamma': 0.1} with a CV score of 0.99
For class: Family, the test score(exact match rate) is 0.9930523390458545, while the hamming loss is 0.006947660954145438.


The best parameters of Genus are {'C': 10.0, 'gamma': 0.1} with a CV score of 0.99
For class: Genus, the t

#### iii. Repeat 1(b)ii with L1 -penalized SVMs. Remember to standardize the attributes. Determine the weight of the SVM penalty using 10 fold cross validation.

In [7]:
from sklearn.svm import LinearSVC


def CV_processing_2(X_train, y_train, class_name):
    C_range = np.logspace(-3, 6, 10)
    param_grid = dict(C=C_range)
    grid = GridSearchCV(LinearSVC(penalty="l1", dual=False, max_iter=10000), param_grid=param_grid, cv=10, n_jobs=-1)
    grid.fit(X_train, y_train)
    print(
    "The best parameters of %s are %s with a CV score of %0.2f"
    % (class_name, grid.best_params_, grid.best_score_))
    return grid.best_params_['C']

def fit_predict_result_2(class_name, X_train = X_train, X_test = X_test):
    y_train = Y_train_all[class_name]
    y_test = Y_test_all[class_name]
    c = CV_processing_2(X_train, y_train, class_name)
    model = LinearSVC(penalty="l1", C=c, dual=False, max_iter=10000)
    model.fit(X_train, y_train)
    emr = model.score(X_test, y_test)
    y_pred = model.predict(X_test)
    hl = hamming_loss(y_test, y_pred)
    return emr, hl

print("Linear Kernel:")

for class_name in  ["Family", "Genus", "Species"]:
    emr_3, hl_3 = fit_predict_result_2(X_train = train_x_st, X_test = test_x_st, class_name = class_name)   
    print("For class: %s, the test score(exact match rate) is %s, while the hamming loss is %s."
         % (class_name, emr_3, hl_3))
    print("\n")  

Linear Kernel:
The best parameters of Family are {'C': 1000.0} with a CV score of 0.93
For class: Family, the test score(exact match rate) is 0.9411764705882353, while the hamming loss is 0.058823529411764705.


The best parameters of Genus are {'C': 1.0} with a CV score of 0.95
For class: Genus, the test score(exact match rate) is 0.9541454377026402, while the hamming loss is 0.04585456229735989.


The best parameters of Species are {'C': 10.0} with a CV score of 0.96
For class: Species, the test score(exact match rate) is 0.9615562760537286, while the hamming loss is 0.03844372394627142.




#### iv. Repeat 1(b)iii by using SMOTE or any other method you know to remedy class imbalance. Report your conclusions about the classiﬁers you trained.

#### use linear kernal and smote in standardized attributes.

In [8]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings("ignore")


def CV_processing_3(X_train, y_train, class_name):
    C_range = np.logspace(-3, 6, 10)
    param_grid = {'clf__C' : C_range},
    model = Pipeline([
        ('sampling', SMOTE(random_state=424, n_jobs=-1)),
        ('clf', LinearSVC(penalty="l1", dual=False, max_iter=10000))
    ])
    grid = GridSearchCV(model, param_grid=param_grid, cv=10, n_jobs=-1)
    grid.fit(X_train, y_train)
    print(
    "The best parameters of %s are %s with a CV score of %0.2f"
    % (class_name, grid.best_params_, grid.best_score_))
    return grid.best_params_['clf__C']

def fit_predict_result_3(class_name, X_train = X_train, X_test = X_test):
    y_train = Y_train_all[class_name]
    y_test = Y_test_all[class_name]
    c = CV_processing_3(X_train, y_train, class_name)
    model = Pipeline([
        ('sampling', SMOTE(random_state=424, n_jobs=-1)),
        ('clf', LinearSVC(penalty="l1", C=c, dual=False, max_iter=10000))
    ])
    model.fit(X_train, y_train)
    emr = model.score(X_test, y_test)
    y_pred = model.predict(X_test)
    hl = hamming_loss(y_test, y_pred)
    return emr, hl

print("Linear Kernel with SMOTE:")

for class_name in  ["Family", "Genus", "Species"]:
    emr_4, hl_4 = fit_predict_result_3(X_train = train_x_st, X_test = test_x_st, class_name = class_name)   
    print("For class: %s, the test score(exact match rate) is %s, while the hamming loss is %s."
         % (class_name, emr_3, hl_3))
    print("\n")  


Linear Kernel with SMOTE:




The best parameters of Family are {'clf__C': 10.0} with a CV score of 0.92
For class: Family, the test score(exact match rate) is 0.9615562760537286, while the hamming loss is 0.03844372394627142.


The best parameters of Genus are {'clf__C': 10.0} with a CV score of 0.92
For class: Genus, the test score(exact match rate) is 0.9615562760537286, while the hamming loss is 0.03844372394627142.


The best parameters of Species are {'clf__C': 0.1} with a CV score of 0.95
For class: Species, the test score(exact match rate) is 0.9615562760537286, while the hamming loss is 0.03844372394627142.





#### In my mind, there is no need to do the Chain method, because using binary relevance and Gaussian kernel, the test score can reach to 0.99.