This notebook shows a simple approach to analyzing voice samples to detect gender using different classifiers and use an ensembling technique to combine. Overall, the best classifier gives an accuracy of 98.4% on the test set.

We will first use well-known classifiers like *Logistic Regression, Random Forest, Multilayer Perceptron (Neural Network)* and *Gradient Boosting*.
We will then use the outputs of the last three classifiers into an *Ensemble* classifier to see if can get slightly better performance compared to the individual classifiers. Overall, this data set seemed well behaved and relatively easy to work with, although it took several hours to tune the hyper-parameters for each of these classifiers to achieve the advertised performance.

## Environment Setup
First, we set up the environment by importing relevant libs, data and pre-processing the data.

In [1]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score

# Input data files are available in the "../input/" directory.

In [2]:
from sklearn.model_selection import train_test_split

# Read data into pandas data frame
df = pd.read_csv('../input/voice.csv')

# Convert categorical label data to integers
df['label'] = df['label'].map({'male': 0, 'female': 1})
# Split data into training and test samples
x, y = df, df.pop('label')
# setting random_state to fixed value to replicate same results in each run
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

## Logistic Regression

We will use a ***Logistic Regression*** classifier to get a quick idea on how to proceed with the analysis. We will tune the hyper-parameters using *Grid Search*. However, I soon found out that the classifier accuracy was not improving beyond a certain point.

In [3]:
logr = LogisticRegression()

logr.fit(X_train, y_train)


In [4]:
logr.score(X_train, y_train)

This looks ok for a first-cut, but let's see if the accuracy improves with a hyperparameter search.

In [5]:
    grid_params = {'C': [0.01, 0.1, 1, 2, 5],
                  'solver': ['newton-cg', 'lbfgs']}
    grid = GridSearchCV(logr, grid_params, cv=10)
    grid.fit(X_train, y_train)

In [6]:
grid.score(X_train, y_train)

In [7]:
grid.best_params_

In [9]:
y_pred_logr = grid.predict(X_test)
accuracy_score(y_test, y_pred_logr)

We will try to refine the classifier around these best parameters.  I did not see significant change either in the score or accuracy of the predictions.

In [11]:
grid_params = {'C': [4.9, 5.0, 5.1, 5.5, 6.0], 'solver': ['newton-cg', 'lbfgs'], 'max_iter':[100, 300, 600]}
grid = GridSearchCV(logr, grid_params, cv=10)
grid.fit(X_train, y_train)

In [12]:
grid.score(X_train, y_train)

In [13]:
grid.best_params_

## Gradient Boosting

We now turn to ***Gradient Boosting (XGBoost)***. Even with default parameters, we get good results compared to *Logistic Regression*, as expected.

In [14]:
gbm = xgb.XGBClassifier()
gbm.fit(X_train, y_train)
gbm.score(X_train, y_train)

Starting with the default parameters and increasing the number of estimators to 300, we get perfect training!

In [15]:
gbm = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=300, n_jobs=1, nthread=None, objective='binary:logistic', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)
gbm.fit(X_train, y_train)
gbm.score(X_train, y_train)

In [16]:
accuracy_score(y_test, gbm.predict(X_test))

## Random Forest
Following a similar strategy of hyper-parameter tuning with *Random Forest* classifier, we get the following score and prediction results. Increasing the number of estimators all the way up to 1000 gave the best performance (at which point, hyperparameter tuning fatigue set in).

In [17]:
forest = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
forest.fit(X_train, y_train)
forest.score(X_train, y_train)

In [18]:
accuracy_score(y_test, forest.predict(X_test))

In [19]:
# increase n_estimators to 1000
forest = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
forest.fit(X_train, y_train)
forest.score(X_train, y_train)

In [20]:
accuracy_score(y_test, forest.predict(X_test))

## Feature Importances
At this point I wanted to find out if there were any features that can be removed.  We can compute and plot the feature importances for this forest.

In [21]:
## plot feature importances
imports = forest.feature_importances_
indices = np.argsort(imports)[::-1]
# stddev of each feature in the forest
std = np.std([t.feature_importances_ for t in forest.estimators_], axis=0)

plt.figure()
plt.bar(range(X_train.shape[1]), imports[indices], yerr=std[indices], color='r', align='center')
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()


As we can see most of the features have a non-zero value. However, we can still try to find out if there is any impact or improvement by removing some features. I decided to remove the least important feature "***maxfun***".

In [22]:
df.columns[indices]

In [23]:
# Dropping the least important feature...

forest.fit(X_train.drop('maxfun', axis=1), y_train)
forest.score(X_train.drop('maxfun', axis=1), y_train)

In [24]:
accuracy_score(y_test, forest.predict(X_test.drop('maxfun', axis=1)))

That slightly improved the performance! However, further dropping one and two more least important features did not help.

## Multilayer Perceptron (MLP)
We now turn our attention to MLP. Again, the approach was to start with the default settings and work my way up to better performing settings. The parameters that really worked were the number of iterations, the solver type and changing the activation from default *relu* to *tanh*.

In [25]:
mlp = MLPClassifier(activation='tanh', alpha=0.1, hidden_layer_sizes=(100, 100, 100), max_iter=3000, random_state=42, solver='lbfgs', tol=0.0001)
mlp.fit(X_train, y_train)
mlp.score(X_train, y_train)

In [26]:
accuracy_score(y_test, mlp.predict(X_test))

## Ensemble of classifiers
We now take the 3 best performing classifiers (*Random Forest, XGBoost and MLP*) and then combine their outputs to see if performance improves even more. We will use the *VotingClassifier* to use (1) a majority vote (hard decision thresholding) or (2) the average predicted probabilities (soft decision thresholding) to predict the target classes.

In [27]:
vclf = VotingClassifier(estimators=[('forest', forest), ('mlp', mlp), ('gbm', gbm)], voting = 'hard')
vclf.fit(X_train, y_train)
accuracy_score(y_test, vclf.predict(X_test))

In [28]:
vclf = VotingClassifier(estimators=[('forest', forest), ('mlp', mlp), ('gbm', gbm)], voting = 'soft')
vclf.fit(X_train, y_train)
accuracy_score(y_test, vclf.predict(X_test))

Bummer :). Ensembling did not help. VotingClassifier is not guaranteed to be better always. So for now, *Random Forest* at 98.4% it is.