# Gender recognition of voice prediction

The data set used for this exercise is taken from Kaggle: https://www.kaggle.com/primaryobjects/voicegender/home <br>
<br>
I chose this data set because the topic of recognising audio features (including voice) is something that recently come to my interest and therefore this data set is personally interesting to work with.<br>
<br>
<strong>Objective</strong><br>
The objective of this exercise is to predict the outcomes in a data set using either Random Forest, Decision Tree or k-NN. <br>
<br>
For this exercise, I chose to use Random Forest model to predict the gender label of the voice

<hr><br>
First step, get all necessaries libraries and dependencies for the exercise.

In [56]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

The following is the first 10 lines of the provided data set for this excercise: voice.csv.

In [57]:
df = pd.read_csv('voice.csv')
df = df.dropna() #first get rid of rows with empty cells
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


The following is the count/ amount of each different label present in the column 'label'

In [58]:
df['label'].value_counts()

female    1584
male      1584
Name: label, dtype: int64

There are the same amount of female and male voice. Since the column 'label' has categorial values, I need to make dummy variables (turning them into 1s and 0s) to be able to further use them in the prediction calculation.

In [59]:
dummies = pd.get_dummies(df['label'])
df = pd.concat([df, dummies], axis=1)
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label,female,male
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male,0,1
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male,0,1
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male,0,1
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male,0,1
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male,0,1


Since I'm only interested in the topic, but never actually work with voice recognition variables, I have no idea which variable would have  strong predictive value. Therefore, in the below, I calculate the correlation of all variables and sort the values that are highest to Female (label) variable that I want to predict (also the number is not different than male label).

In [60]:
df.corr().sort_values('female', ascending=False)

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,female,male
female,0.337415,-0.479539,0.283919,0.511455,-0.066906,-0.618916,-0.036627,-0.087195,-0.490552,-0.357499,...,0.833921,0.136692,0.166461,0.191067,0.194974,0.195657,0.192213,-0.030801,1.0,-1.0
meanfun,0.460844,-0.466281,0.414909,0.545035,0.155091,-0.534462,-0.167668,-0.19456,-0.513194,-0.421066,...,1.0,0.339387,0.31195,0.27084,0.162163,0.277982,0.275154,-0.054858,0.833921,-0.833921
Q25,0.911416,-0.846931,0.774922,1.0,0.47714,-0.874189,-0.319475,-0.350182,-0.648126,-0.766875,...,0.545035,0.320994,0.199841,0.467403,0.302255,0.459683,0.454394,-0.141377,0.511455,-0.511455
meanfreq,1.0,-0.739039,0.925445,0.911416,0.740997,-0.627605,-0.322327,-0.316036,-0.601203,-0.784332,...,0.460844,0.383937,0.274004,0.536666,0.229261,0.519528,0.51557,-0.216979,0.337415,-0.337415
centroid,1.0,-0.739039,0.925445,0.911416,0.740997,-0.627605,-0.322327,-0.316036,-0.601203,-0.784332,...,0.460844,0.383937,0.274004,0.536666,0.229261,0.519528,0.51557,-0.216979,0.337415,-0.337415
median,0.925445,-0.562603,1.0,0.774922,0.731849,-0.477352,-0.257407,-0.243382,-0.502005,-0.66169,...,0.414909,0.337602,0.251328,0.455943,0.191169,0.438919,0.435621,-0.213298,0.283919,-0.283919
maxdom,0.519528,-0.482278,0.438919,0.459683,0.335114,-0.337877,-0.305651,-0.2745,-0.324253,-0.436649,...,0.277982,0.31786,0.35539,0.812838,0.02664,1.0,0.999838,-0.425531,0.195657,-0.195657
mindom,0.229261,-0.357667,0.191169,0.302255,-0.02375,-0.357037,-0.061608,-0.103313,-0.294869,-0.289593,...,0.162163,0.082015,-0.243426,0.099656,1.0,0.02664,0.008666,0.200212,0.194974,-0.194974
dfrange,0.51557,-0.475999,0.435621,0.454394,0.335648,-0.331563,-0.30464,-0.272729,-0.319054,-0.43158,...,0.275154,0.316486,0.35988,0.811304,0.008666,0.999838,1.0,-0.429266,0.192213,-0.192213
meandom,0.536666,-0.482726,0.455943,0.467403,0.359181,-0.333362,-0.336848,-0.303234,-0.293562,-0.428442,...,0.27084,0.375979,0.337553,1.0,0.099656,0.812838,0.811304,-0.180954,0.191067,-0.191067


## Training the algorithm ##

Turns out the strongest correlation to female (label) variables are 'meanfun', 'Q25', 'meanfreq', 'centroid', 'median', 'maxdom', 'mindom'. So these variables will be used to predict the label.<br>
<br>
In the below cell, we also separate the training and test data. This built-in function from sk-learn splits the data set randomly into a train set and a test set. The test_size is 0.3, so the data is split into 70% training data and 30% test data

In [61]:
X = df[['meanfun', 'Q25', 'meanfreq', 'centroid', 'median', 'maxdom', 'mindom']]

y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

X_train.head()

Unnamed: 0,meanfun,Q25,meanfreq,centroid,median,maxdom,mindom
1866,0.174179,0.043235,0.141083,0.141083,0.179804,3.59375,0.007812
767,0.100181,0.106603,0.167021,0.167021,0.187109,2.367188,0.007812
2862,0.179095,0.223822,0.234486,0.234486,0.240764,8.765625,0.023438
1064,0.138659,0.138016,0.197073,0.197073,0.217386,8.015625,0.023438
270,0.108932,0.094235,0.142018,0.142018,0.139775,4.75,0.007812


For this exercise, I chose to use random forest model. Why random forest? Random forests consist of multiple single trees each based on a random sample of the training data. They are typically more accurate than single decision trees.<br>
<br>
The below random forest model is a built-in function from sk-learn. RF uses randomness, so I need to set a random_state if we want the result to be stable for presentation purposes.
<br>
<br>
I've also set the number of trees (n_estimators) to 100. This will become the default number of trees in the future of the sklearn package, since current literature suggests using more trees than was used traditionally (10). Also, computing power has increased (more trees require more computing power).

In [62]:
rf = RandomForestClassifier(random_state=1, n_estimators=100)
rf = rf.fit(X_train, y_train)

## Evaluating the model ##
Let's evaluate the model using our standard approach for a *classification* problem: making a confusion matrix and calculating accuracy, precision and recall.

The confusion matrix uses the *sorted* labels, so 0 comes first, 1 second.

In [63]:
y_test_pred = rf.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[447,  10],
       [ 17, 477]])

In [64]:
y_pred = rf.predict(X_test) #the predicted values
conf_matrix = confusion_matrix(y_test, y_pred) #creates a "confusion matrix"
conf_matrix = pd.DataFrame(cm, index=['Female (actual)', 'Male (actual)'], columns = ['Female (predicted)', 'Male (predicted)']) 
conf_matrix

Unnamed: 0,Female (predicted),Male (predicted)
Female (actual),447,10
Male (actual),17,477


As we can see from the confusion matrix, Male voice is predicted better than Female voice. In the below *classification_report*, I'll calculate precision and recall.

In [65]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      female       0.96      0.98      0.97       457
        male       0.98      0.97      0.97       494

    accuracy                           0.97       951
   macro avg       0.97      0.97      0.97       951
weighted avg       0.97      0.97      0.97       951



The precision for both female and male voice is very good (very close to 1): only about 2% of the male voice turns out to be female and 4% female voice turns out to be male.<br>
<br>
The recall is also really hight. It misses only 2% of female voice and 3% of male voice.<br>
<br>
I would say that the above numbers are pretty concrete prediction looking that only 2-4% of prediction were missed.