**Experiment No.07**

---

**Aim:** To perform Classification on a DataSet. 

---


**Objectives**
1. To implement Classification on a dataset.

---
**Theory:**

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 

Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.

Classification is an example of pattern recognition.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.


Classification can be performed using various sklearn algorithms like KNN, SVM,  Decision Trees, Random Forests,etc.
In this experiment, I have performed Support Vector machines to classify binary data.

**Dataset**:
Voice Gender
Gender Recognition by Voice and Speech Analysis

This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).

The following acoustic properties of each voice are measured and included within the CSV:

meanfreq: mean frequency (in kHz)
sd: standard deviation of frequency
median: median frequency (in kHz)
Q25: first quantile (in kHz)
Q75: third quantile (in kHz)
IQR: interquantile range (in kHz)
skew: skewness (see note in specprop description)
kurt: kurtosis (see note in specprop description)
sp.ent: spectral entropy
sfm: spectral flatness
mode: mode frequency
centroid: frequency centroid (see specprop)
peakf: peak frequency (frequency with highest energy)
meanfun: average of fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal
maxfun: maximum fundamental frequency measured across acoustic signal
meandom: average of dominant frequency measured across acoustic signal
mindom: minimum of dominant frequency measured across acoustic signal
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal
modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
label: male or female

The dataset is taken from : https://www.kaggle.com/primaryobjects/voicegender/download

Detailed description of the dataset can be found on the above url.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [0]:
df = pd.read_csv('voice.csv')

In [0]:
df.head(5)

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,0.0,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,0.0,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,0.0,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,0.083878,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,0.104261,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


These are the sample tuples from the dataset.

All the columns except the last one are the input to the dataset and the label column indicates whether the sound is a male sound or a female sound.



In [0]:
print(df.columns)

Index(['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
       'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
       'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx', 'label'],
      dtype='object')


These are the attributes that are given to the model and the study about the data input is out of the scope of the wxperiment.

In [0]:
len(df)

3168

In [0]:
Y = df['label']
Y.replace(0, 'male', inplace=True)
Y.replace(1, 'female', inplace= True)

The above code input converts text data to numerical data.
The only text data is the label i.e. male or female.
It replaces male with value 0 and female with value 1.

In [0]:
df.drop(['label'], inplace=True, axis = 1)

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaled_data = scaler.transform(df)
type(scaled_data)

numpy.ndarray

We scale the entire dataset before we pass it to the KNN model. 
Many machine learning algorithms work better when features are on a relatively similar scale and close to normally distributed.

 MinMaxScaler, RobustScaler, StandardScaler, and Normalizer are scikit-learn methods to preprocess data for machine learning.

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. 

Unit variance means dividing all the values by the standard deviation. 
StandardScaler does not meet the strict definition of scale I introduced earlier.

StandardScaler results in a distribution with a standard deviation equal to 1. The variance is equal to 1 also, because variance = standard deviation squared. And 1 squared = 1.

StandardScaler makes the mean of the distribution 0. About 68% of the values will lie be between -1 and 1.

In [0]:
scaled_df = pd.DataFrame(scaled_data)

The values in the DataFrame after performing Standard Scaler are as follows. 

In [0]:
scaled_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-4.049248,0.427355,-4.224901,-2.576102,-5.693607,-0.214778,2.293306,1.762946,-0.039083,0.471575,-2.14121,-4.049248,-1.812038,-1.097998,0.565959,-1.564205,-0.708404,-1.431422,-1.419137,-1.454772
1,-3.841053,0.611669,-3.999293,-2.486885,-5.588987,-0.258485,4.548056,4.433008,-0.065236,0.594431,-2.14121,-3.841053,-1.079594,-1.091533,-0.29403,-1.561916,-0.708404,-1.418107,-1.405818,-1.014103
2,-3.463066,1.603848,-4.095851,-2.706986,-3.928699,0.909326,6.513656,7.326207,-1.08373,0.398261,-2.14121,-3.463066,-1.365368,-1.100397,0.41048,-1.563866,-0.708404,-1.429203,-1.416917,-1.065344
3,-0.992157,0.899998,-0.759454,-0.901418,-0.711205,0.63269,-0.449858,-0.240099,1.516383,1.79734,-1.054576,-0.992157,-1.666966,-0.988934,-0.29403,-1.195367,-0.708404,-1.273867,-1.261532,0.614286
4,-1.53064,1.322561,-1.676948,-1.268395,-0.792029,1.005588,-0.480911,-0.23894,1.708336,2.11474,-0.790514,-1.53064,-1.127233,-1.034015,0.260185,-0.22166,-0.708404,0.124154,0.136933,0.289046


We now Split the dataset into test data and train data using the train_test_split. And pass the training data to the Support Vector classifier.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(scaled_df, Y, test_size  = 0.7 , random_state = 42)

In [0]:
from sklearn.svm import SVC

In [0]:
svc =SVC()
svc.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [0]:
predictions = svc.predict(X_test)

In [0]:
from sklearn.metrics import classification_report, confusion_matrix

The result of the models are as follows.

In [0]:
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

      female       0.98      0.98      0.98      1111
        male       0.98      0.98      0.98      1107

    accuracy                           0.98      2218
   macro avg       0.98      0.98      0.98      2218
weighted avg       0.98      0.98      0.98      2218

[[1084   27]
 [  27 1080]]


**Conclusion**

Thus, in the above experiment, I have learnt about classification of data into various classes. I have implemented Classification using Support Vector classifiers.
