<a href="https://colab.research.google.com/github/Nikunjbansal99/GenderPrediction/blob/main/GenderRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **About Data:**

**To analyze gender by voice and speech, a training database was required. A database was built using thousands of samples of male and female voices, each labeled by their gender of male or female. Voice samples were collected from the following resources:**

**The Harvard-Haskins Database of Regularly-Timed Speech Telecommunications & Signal Processing Laboratory (TSP) Speech Database at McGill University VoxForge Speech Corpus Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University Each voice sample is stored as a.WAV file, which is then pre-processed for acoustic analysis using the specan function from the WarbleR R package. Specan measures 22 acoustic parameters on acoustic signals for which the start and end times are provided.**

**The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female). You can download the pre-processed dataset in CSV format, using the link above.In order to analyze gender by voice and speech, a training database was required. A database was built using thousands of samples of male and female voices, each labeled by their gender of male or female.**



# **Methodology**


*   Importing Some Basic Libraries
*   Importing Data
*   Performing Descriptive Analysis on the dataset
    *   Data Description
    *   Checking null values
*   Processing Categorical Values using encoding
*   Analysis of Target Variable
    *   Plotting Kernel Density Estimate Plot
    *   Plotting Distance Plot
    *   Plotting Correlation Matrix and Heat Map
*   Select Features based on above analysis
*   Splitting voice_df into 70% and 30% to construct Training data and Testing data respectively
*   Optimizing Best Parameters for SVM Classifier
*   Applying Dimensionality reduction
*   Visualization
*   Creating Final SVM Classifier
    *   Perform Prediction on Training Data
    *   Perform Prediction on Testing Data
*   For Training data, Evaluating Model based on Confusion Matrix and Classification Report
*   For Testing data, Evaluating Model based on Confusion Matrix and Classification Report
*   Save predictions of Testing data in gender_pred.csv

# **Importing Some Basic Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import sys, os
from matplotlib import pyplot as plt
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score 
from sklearn.preprocessing import LabelEncoder
from itertools import product
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# **Importing Data**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
gender_data_dir = "/kaggle/input/voicegender/"
voice_df = pd.read_csv(os.path.join(gender_data_dir, "voice.csv"))

In [None]:
voice_df.head()

# **Descriptive Analysis of the dataset**

In [None]:
print("Size of Gender Recognition dataset       : {}".format(voice_df.shape))

## **Data Description**

In [None]:
voice_df.info()

In [None]:
voice_df.describe().T

## **Checking NULL/NaN Values :**

In [None]:
voice_df.isna().sum()                        # Printing a count of missing value w.r.t each feature in full_df

# **Analysis of Target Variable**

In [None]:
plt.figure(figsize=(9,6))
sns.countplot(x='label', data=voice_df, order=["male", "female"] )

In [None]:
voice_df['label'].value_counts()           # Prints the count of different classes in 'label'

**Hence, We found that our data is Balanced.**

## **Processing Categorical Values:**

In [None]:
# creating instance of labelencoder
label_encode = LabelEncoder()

In [None]:
# Perform Encoding by coverting 'label' feature into numerical form
voice_df['label'] = label_encode.fit_transform(voice_df['label'])

In [None]:
voice_df.head()

## **Kernel Density Estimate Plot :**

**It is analagous to a histogram. It represents the data using a continuous probability density curve.**

In [None]:
plt.subplots(4,5,figsize=(30,30))
for i in range(1,21):
    plt.subplot(4,5,i)
    plt.title(voice_df.columns[i-1])
    sns.kdeplot(voice_df.loc[voice_df['label'] == 0, voice_df.columns[i-1]], color= 'red', label='female')
    sns.kdeplot(voice_df.loc[voice_df['label'] == 1, voice_df.columns[i-1]], color= 'brown', label='male')

**Hence, it is clearly visible that Q25, IQR and meanfun features will play an important role while classification. Since, they can classify Male and Female more effectively.**

## **Distance Plot :**

In [None]:
fig = plt.figure(figsize = (20, 15))
j = 0
for i in voice_df.columns:
    plt.subplot(5, 5, j+1)
    j += 1
    sns.distplot(voice_df[i][voice_df['label']==0], color='r', label = 'Female')
    sns.distplot(voice_df[i][voice_df['label']==1], color='b', label = 'Male')
    plt.legend(loc='best')
fig.suptitle('Voice Data Analysis')
fig.tight_layout()
fig.subplots_adjust(top=0.90)
plt.show()

**Hence, it is clearly visible that Q25, IQR and meanfun features will play an important role while classification. Since, they can classify Male and Female more effectively.**

## **Correlation Matrix and Heat Map**

In [None]:
corr_data = voice_df.corr()                              # calculating correlation data between features
plt.figure(figsize=(32, 20))                            # setting figure size
sns.set_style('ticks')                                  # setting plot style
sns.heatmap(corr_data, cmap='viridis',annot=True)       # plotting heatmap using sns library
plt.show()

In [None]:
selected_pixel_features = corr_data['label'].apply(lambda x: abs(x)).sort_values(ascending=False).iloc[1:21][::-1]
plt.figure(figsize=(25,12))
selected_pixel_features.plot(kind='barh',color='red')
# calculating highest correlated faetures
# with respect to target variable i.e. "convert"
plt.title("Top highly correlated features", size=20, pad=26)
plt.xlabel("Correlation coefficient")
plt.ylabel("Features")

**If we will set the threshold i.e. correlation coefficient >= 0.5. We got three feature's which are meanfun, IQR, Q25**

# **Selected Features :**

**Using Above Analysis(KDE Plot, Distance Plot & correlation coefficient) on Voice DataFrame, we got to know that there are three important features which are IQR, Q25, meanfun.**

In [None]:
selected_features = ['IQR','Q25','meanfun']

In [None]:
voice_df_X = voice_df[selected_features]
voice_df_y = voice_df.label

In [None]:
voice_df_X.head()

In [None]:
voice_df_y.head()

# **Train-Test Splitting :**

In [None]:
# Splitting voice_df into 70% and 30% to construct Training and Testing Data respectively.
trainX, testX, trainy, testy = train_test_split(voice_df_X, voice_df_y,test_size=0.3,random_state=14)

In [None]:
trainX.shape

In [None]:
trainX.head()

In [None]:
trainy.shape

In [None]:
trainy.head()

In [None]:
testX.shape

In [None]:
testX.head()

In [None]:
testy.shape

In [None]:
testy.head()

# **Optimizing Best Parameters for SVM Classifier :**

In [None]:
def svm_kernel(trainX, trainy, testX, testy):
    rate=[]
    kernel=['rbf','poly','linear']
    for i in kernel:
        SVM_Model = SVC(kernel=i).fit(trainX,trainy)
        y_pred = SVM_Model.predict(trainX)
        print(i, 'Accuracy of Train Data : ', accuracy_score(trainy,y_pred))
        y_pred = SVM_Model.predict(testX)
        print(i, 'Accuracy of Test Data : ', accuracy_score(testy,y_pred))
        rate.append(accuracy_score(testy,y_pred))
    nloc = rate.index(max(rate))
    print("Highest accuracy is %s occurs at %s kernel." % (rate[nloc], kernel[nloc]))
    return kernel[nloc]

In [None]:
def svm_error(k,C,x_train,y_train,x_test,y_test):
    error_rate = []
    C = range(1,C)
    for i in C:
        model = SVC(kernel=k,C=i).fit(x_train,y_train)
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        error_rate.append(np.mean(y_pred != y_test))
    cloc = error_rate.index(min(error_rate))
    print("Lowest error is %s occurs at C=%s." % (error_rate[cloc], C[cloc]))

    plt.plot(C, error_rate, color='red', linestyle='dashed', marker='o', markerfacecolor='green', markersize=10)
    plt.title('Error Rate Vs C Value')
    plt.xlabel('C')
    plt.ylabel('Error Rate')
    plt.show()
    return C[cloc]

In [None]:
k = svm_kernel(trainX, trainy, testX, testy)

**Hence, RBF kernel is Selected for our final SVM Model.**

In [None]:
c = svm_error(k, 10, trainX, trainy, testX, testy)

**Hence, Value of C is Selected as 9 for our final SVM Model.**

# **Applying Dimensionality Reduction :**

In [None]:
# Initializing Principal Component Analysis(PCA)
PCA_method = PCA(n_components=2)

In [None]:
# Fit And Transorm Data
traindf= PCA_method.fit_transform(trainX)
testdf = PCA_method.transform(testX)

# **Visualization :**

In [None]:
# Plotting decision regions
x_min, x_max = traindf[:, 0].min() - 1, traindf[:, 0].max() + 1
y_min, y_max = traindf[:, 1].min() - 1, traindf[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

f, ax = plt.subplots(figsize=(20, 12))

SVM_Model = SVC(kernel=k, C=c).fit(traindf,trainy)

for clf, tt in zip([SVM_Model],['RBF Kernel SVM']):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.5)
    ax.scatter(traindf[:, 0], traindf[:, 1], c=trainy, s=30, edgecolor='k')
    ax.set_title(tt)
plt.show()

# **Creating Final SVM Classifier :**

In [None]:
# Initailizing the Final SVM Classifier
Final_SVM_Model = SVC(kernel=k, C=c)
# Train the model using the training sets
Final_SVM_Model.fit(trainX, trainy)

### **Perform Prediction on Training Data :**

In [None]:
Final_SVM_Model_train_predictions = Final_SVM_Model.predict(trainX)

### **Perform Prediction on Testing Data :**

In [None]:
Final_SVM_Model_test_predictions = Final_SVM_Model.predict(testX)

# **Evaluation**

### **On Training :**

In [None]:
print("SVM Model Confusion Matrix:")
print(confusion_matrix(trainy, Final_SVM_Model_train_predictions))

print("SVM Model Classification Report")
print(classification_report(trainy, Final_SVM_Model_train_predictions))

### **On Testing :**

In [None]:
print("SVM Model Confusion Matrix:")
print(confusion_matrix(testy, Final_SVM_Model_test_predictions))

print("SVM Model Classification Report")
print(classification_report(testy, Final_SVM_Model_test_predictions))

# **Predictions on Test Data :**

In [None]:
OutputDF = pd.DataFrame({'Actual_label':testy,'Predicted_label':Final_SVM_Model_test_predictions})

In [None]:
#Save to csv
OutputDF.to_csv('gender_pred.csv',index=False)
OutputDF.head()

**Thank you**,<br>
Nikunj Bansal,<br>
R177218063,<br>
B2 Batch<br>