#### Hi, welcome to my project! Today I m going to run Support Vector Machine algorithms with different kernels (linear, gaussian, polynomial) and also tune the various parameters such as C ,gamma and degree to find out the best performing model to recongnize voice gender.

## Importing all the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

## Reading the comma separated values file as a dataframe

In [None]:
df=pd.read_csv('../input/voice-gender/voice_gender.csv')
df # Just a quick look at our dataframe

In [None]:
df.shape  

Our dataframe has 21 features and 3168 instances.

## Checking the correlation between each feature

In [None]:
df.corr()

In [None]:
df.isnull().sum()

In [None]:
print ('Unique values in label column: ', df.label.unique())
print ('How many non null values we have in our label column: ', len(df.label))
print ('Value counts for each class in our label: ') 
df.label.value_counts()

Thus, we can say there are equal number of male and female in our label column

## Separating features and labels

In [None]:
X=df.iloc[:,:-1]
X.head()

## Converting our categorical labels to int type values using label encoding

As you know, machine learning models expects our label to be numeric, so in our case we have to convert our label to a machine-readable form, for this we use LabelEncoder from sklearn which offers us do exactly this transformation.

In [None]:
from sklearn.preprocessing import LabelEncoder
y=df.iloc[:,-1]

# Encode label category
# male -> 1
# female -> 0

gender_encoder=LabelEncoder()
y=gender_encoder.fit_transform(y)
y

### Just to check the encoding, let's use the following array and see its corresponding transformation

In [None]:
test=['male','female', 'male']
gender_encoder.transform(test)

## Data Standardisation

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

### Let's see distribution of the first column before and after stardardized:

In [None]:
df.loc[:,'meanfreq'].hist(bins=30)   # Original distribution in column 'meanfreq'

In [None]:
XX=pd.DataFrame(X)
XX[0].hist(bins=30)  # Column 'meanfreq' stardardized

In [None]:
sns.boxplot(XX[0])

### Splitting dataset into training set and testing set for better generalisation

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Running SVM with default hyperparameter.

In [None]:
from sklearn.svm import SVC
from sklearn import metrics

svc=SVC() #Default hyperparameters
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))

### Default Linear kernel

In [None]:
svc=SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))

### Default RBF kernel

In [None]:
svc=SVC(kernel='rbf')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))

We can conclude from above that svm by default uses rbf kernel as a parameter for kernel

### Default Polynomial kernel

In [None]:
svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))

Polynomial kernel is performing poorly.The reason behind this maybe it is overfitting the training dataset

## Performing K-fold cross validation with different kernels

### CV on Linear kernel

In [None]:
from sklearn.model_selection import cross_val_score
svc=SVC(kernel='linear')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') 
print(scores)

We can see above how the accuracy score is different everytime. This shows that accuracy score depends upon how the datasets got split.

In [None]:
print(scores.mean())

In K-fold cross validation we generally take the mean of all the scores.

### CV on rbf kernel

In [None]:
svc=SVC(kernel='rbf')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') 
print(scores)

In [None]:
print(scores.mean())

### CV on Polynomial kernel

In [None]:
svc=SVC(kernel='poly')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') 
print(scores)

In [None]:
print(scores.mean())

**When K-fold cross validation is applied we can obtain different scores in each iteration. This happens because when we use train_test_split method,the dataset get split in random manner into testing and training dataset. Thus it depends on how the dataset got split and which samples are training set and which samples are in testing set. Due to the fact that it's random selecting, we could face an hypothetical case where an specific group in our label have a skew or certain characteristics in particular and surprisingly these values are taken as training data, which would make our model to give us a wrong prediction.**

**With K-fold cross validation we can see that the dataset got split into 10 equal parts thus covering all the data into training as well into testing set. This is the reason why we got 10 different accuracy scores.**

### Taking all the values of C and checking out the accuracy score with kernel as linear:

The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.

Thus for very large values we can cause overfitting of the model and for a very small value of C we can cause underfitting. Thus the value of C must be chosen in such a manner that it generalised the unseen data well.

In [None]:
C_range=list(range(1,26))
acc_score=[]
for c in C_range:
    svc = SVC(kernel='linear', C=c)
    scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(scores.mean())
print(acc_score)    

In [None]:
C_values=list(range(1,26))
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(C_values,acc_score)
plt.xticks(np.arange(0,27,2))
plt.xlabel('Value of C for SVC')
plt.ylabel('Cross-Validated Accuracy')

**From the above plot we can see that accuracy has been close to 97% for values of C around 1 and then it drops a bit and remains constant.**

### Let's look into more detail of what is the exact value of C which is giving us the best accuracy score

In [None]:
C_range=list(np.arange(0.1,2,0.1))
acc_score=[]
for c in C_range:
    svc = SVC(kernel='linear', C=c)
    scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(scores.mean())
print(acc_score)    

In [None]:
C_values=list(np.arange(0.1,2,0.1))
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(C_values,acc_score)
plt.xticks(np.arange(0.0,2,0.2))
plt.xlabel('Value of C for SVC ')
plt.ylabel('Cross-Validated Accuracy')

**Accuracy score is highest for C=0,1.**

### Taking kernel as rbf and evaluating accurary with different values of gamma:

Technically, the gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as similarity measure between two points. Intuitively, a small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. In the other hand, a large gamma value means define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other.

In [None]:
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
acc_score=[]
for g in gamma_range:
    svc = SVC(kernel='rbf', gamma=g)
    scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(scores.mean())
print(acc_score)    

In [None]:
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]

# plot the value of gamma for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.xticks(np.arange(0.0001,100,5))
plt.ylabel('Cross-Validated Accuracy')

We can see that for gamma=10 to 100 the kernel is performing poorly. We can also see a slight dip in accuracy score when gamma is around 1. Let's look into more details for the range 0.0001 to 1.

In [None]:
gamma_range=[0.0001,0.001,0.01,0.1,1]
acc_score=[]
for g in gamma_range:
    svc = SVC(kernel='rbf', gamma=g)
    scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(scores.mean())
print(acc_score)    

In [None]:
gamma_range=[0.0001,0.001,0.01,0.1,1]

# plot the value of gamma for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')

The score increases steadily and reaches its peak before 0.1 and then decreases till gamma=1.Thus Gamma should be around 0.01.

In [None]:
gamma_range=[0.01,0.02,0.03,0.04,0.05]
acc_score=[]
for g in gamma_range:
    svc = SVC(kernel='rbf', gamma=g)
    scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(scores.mean())
print(acc_score)    

In [None]:
gamma_range=[0.01,0.02,0.03,0.04,0.05]

# plot the value of gamma for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')

**We can see there is constant decrease in the accuracy score as gamma value is greater than 0.03. Thus gamma=0.01 is the best parameter.**

### Taking polynomial kernel with different degrees:

In [None]:
degree=[2,3,4,5,6]
acc_score=[]
for d in degree:
    svc = SVC(kernel='poly', degree=d)
    scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(scores.mean())
print(acc_score)

In [None]:
degree=[2,3,4,5,6]

# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(degree,acc_score,color='r')
plt.xlabel('degrees for SVC ')
plt.ylabel('Cross-Validated Accuracy')

**Accuracy score is highest for third degree polynomial and then there is a drop as degree of polynomial increases. Therefore increase in polynomial degree results in high complexity of the model and as a result overfitting.**

### Performing SVM taking hyperparameter C=0.1 and kernel as linear:

In [None]:
from sklearn.svm import SVC
svc= SVC(kernel='linear',C=0.1)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)

#### With K-fold cross validation (where K=10):

In [None]:
from sklearn.model_selection import cross_val_score
svc=SVC(kernel='linear',C=0.1)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores.mean())

The accuracy is slightly better without K-fold cross validation but it may fail to generalise the unseen data. Hence it is advisable to perform K-fold cross validation where all the data is covered so it may predict unseen data well.

### Performing SVM taking hyperparameter gamma=0.01 and kernel as rbf:

In [None]:
from sklearn.svm import SVC
svc= SVC(kernel='rbf',gamma=0.01)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
metrics.accuracy_score(y_test,y_predict)

#### With K-fold cross validation (where K=10):

In [None]:
svc=SVC(kernel='rbf',gamma=0.01)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores.mean())

### Performing SVM taking hyperparameter degree=3 and kernel as poly:

In [None]:
from sklearn.svm import SVC
svc= SVC(kernel='poly',degree=3)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)

#### With K-fold cross validation (where K=10):

In [None]:
svc=SVC(kernel='poly',degree=3)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores.mean())

## Let's perform Grid search technique to find the best hyperparameters:

In [None]:
from sklearn.svm import SVC
svm_model = SVC()

In [None]:
tuned_parameters = {
 'degree': [2,3,4],'gamma': [0.01,0.02,0.03],'C':(np.arange(0.1,0.5,0.1)) , 'kernel':['linear','rbf','poly']
                    }

In [None]:
from sklearn.model_selection import GridSearchCV

model_svm = GridSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy')

In [None]:
model_svm.fit(X_train, y_train)

In [None]:
print(model_svm.best_score_)

We are going to use the following code to see the best parameters found by the function:

In [None]:
print(model_svm.best_estimator_)

In [None]:
print(model_svm.best_params_)

In [None]:
y_predict=model_svm.predict(X_test)
print(metrics.accuracy_score(y_predict,y_test))

Check the accuracy above by running the best model and computing its corresponding accuracy:

In [None]:
from sklearn.svm import SVC
svc= SVC(kernel='linear',C=0.4,gamma=0.01,degree=2)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)

Above we could see the result of our best model match with the outcome of the GridSeachCV, but this is only one error metric, it would be better if we could see the performance of the model in only one plot, for this we will use the confusion matrix. 

### Let's plot a confusion matrix for this best model:

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm=confusion_matrix(y_test,y_predict, labels=model_svm.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model_svm.classes_)
disp.plot(cmap='Blues')