# Research of the "Gender Recognition by Voice" dataset.

Let's research the "Voice Gender" dataset.  It contains a various characteristics of human voices labeled by gender. I will analyse these characteristics to find the most valuable of them. Then i will build a classification model for this dataset.

## Data analysis

Let's inspect a few rows from the dataset.

In [None]:
import pandas as pd

df = pd.read_csv('../input/voicegender/voice.csv')
df.head()

Here we see a plenty of the numerical voice characteristics and the "label" column where a gender is specified. Let's inspect the correlations of various columns with a gender.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

label = df['label'].map({ 'male': 0, 'female': 1 }, na_action='ignore')
corr_data = df.drop('label', axis=1).corrwith(label).abs().sort_values()

plt.figure(figsize=(12, 8))
sns.barplot(x=corr_data.values, y=corr_data.index)
plt.title('Correlation with gender')
plt.xlabel('correlation (abs)')
plt.show()

The plot shows a high correlation of the "meanfun" column with a gender. Below I have provided descriptions of the most valuable columns.
* meanfun: "average of dominant frequency measured across acoustic signal"
* IQR: "interquantile range (in kHz)"
* Q25: "first quantile (in kHz)"

Next I drew a graph which shows a distribution of the "meanfun" values grouped by gender.

In [None]:
plt.figure(figsize=(10, 6))
sns.swarmplot(x=df['label'], y=df['meanfun'])
plt.title('meanfun distribution by gender')
plt.show()

Here we see that "average of dominant frequency" values in men are lower than in women. This reflects that usually men voice is rougher.

To inspect IQR and Q25 columns i built a distribution plot of both of them.

In [None]:
sns.jointplot(x=df['IQR'], y=df['Q25'], kind='kde')
plt.show()

Distribution of these values displays two intersecting clusters. These clusters definitely represents genders.

## Model training

Because of high correlation i prefer to use a simple classification model. I used only valuable columns in training.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

y = df['label'].map({ 'male': 0, 'female': 1 }, na_action='ignore')
X = df[['meanfun', 'IQR', 'Q25', 'sp.ent', 'sd', 'sfm', 'centroid', 'meanfreq', 'median']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

print('Accuracy is', metrics.accuracy_score(y_test, y_pred))

Let's inspect model's accuracy without the 'meanfun' column.

In [None]:
y = df['label'].map({ 'male': 0, 'female': 1 }, na_action='ignore')
X = df[['IQR', 'Q25', 'sp.ent', 'sd', 'sfm', 'centroid', 'meanfreq', 'median']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

print('Accuracy is', metrics.accuracy_score(y_test, y_pred))

## Thanks

This is my first research on a kaggle. I will be happy for your comments and corrections.