# Predicting the gender of voice

The goal of this project is to develop a machine learning pipeline to classify voices as either male or female, based on the acoustic properties of the voice and speech by running a logistic regression from `spark.ml` on data from Kaggle's [voice.csv](https://www.kaggle.com/primaryobjects/voicegender/data) dataset.

The following acoustic properties of each voice are measured and included within the CSV:

* `meanfreq`: mean frequency (in kHz)
* `sd`: standard deviation of frequency
* `median`: median frequency (in kHz)
* `Q25`: first quantile (in kHz)
* `Q75`: third quantile (in kHz)
* `IQR`: interquantile range (in kHz)
* `skew`: skewness (see note in specprop description)
* `kurt`: kurtosis (see note in specprop description)
* `sp.ent`: spectral entropy
* `sfm`: spectral flatness
* `mode`: mode frequency
* `centroid`: frequency centroid (see specprop)
* `peakf`: peak frequency (frequency with highest energy)
* `meanfun`: average of fundamental frequency measured across acoustic signal
* `minfun`: minimum fundamental frequency measured across acoustic signal
* `maxfun`: maximum fundamental frequency measured across acoustic signal
* `meandom`: average of dominant frequency measured across acoustic signal
* `mindom`: minimum of dominant frequency measured across acoustic signal
* `maxdom`: maximum of dominant frequency measured across acoustic signal
* `dfrange`: range of dominant frequency measured across acoustic signal
* `modindx`: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
* `label`: male or female

In [1]:
import pyspark.sql.types as typ
from pyspark.sql.types import *
import pyspark.ml.feature as ft
import pyspark.ml.classification as cl
from pyspark.ml import Pipeline

# libraries for model tuning
import pyspark.ml.tuning as tune
import pyspark.ml.evaluation as ev

## Data loading and preprocessing

After downloading the dataset into the master node of the cluster and copying it into Hadoop in the `tmp` folder, we can move on to preprocessing.

In [2]:
voice_path = "/tmp/voice.csv"

In [3]:
# metadata / schema for the voice.csv dataset
schema = StructType([
    StructField("meanfreq", DoubleType(), True),    
    StructField("sd", DoubleType(), True),
    StructField("median", DoubleType(), True),
    StructField("Q25", DoubleType(), True),
    StructField("Q75", DoubleType(), True),
    StructField("IQR", DoubleType(), True),
    StructField("skew", DoubleType(), True),
    StructField("kurt", DoubleType(), True),
    StructField("sp_ent", DoubleType(), True),
    StructField("sfm", DoubleType(), True),
    StructField("mode", DoubleType(), True),
    StructField("centroid", DoubleType(), True),
    StructField("meanfun", DoubleType(), True),
    StructField("minfun", DoubleType(), True),
    StructField("maxfun", DoubleType(), True),
    StructField("meandom", DoubleType(), True),
    StructField("mindom", DoubleType(), True),
    StructField("maxdom", DoubleType(), True),
    StructField("dfrange", DoubleType(), True),
    StructField("modindx", DoubleType(), True),
    StructField("label", StringType(), True)    
])

# loading the data into a dataframe using the schema defined above
voice = spark.read.csv(voice_path, header=True, schema=schema)

In [4]:
# dataset structure and sample instance
voice.show(1)

[Stage 0:>                                                          (0 + 1) / 1]

+------------------+------------------+-----------------+------------------+------------------+------------------+----------------+----------------+-----------------+-----------------+----+------------------+-----------------+------------------+-----------------+---------+---------+---------+-------+-------+-----+
|          meanfreq|                sd|           median|               Q25|               Q75|               IQR|            skew|            kurt|           sp_ent|              sfm|mode|          centroid|          meanfun|            minfun|           maxfun|  meandom|   mindom|   maxdom|dfrange|modindx|label|
+------------------+------------------+-----------------+------------------+------------------+------------------+----------------+----------------+-----------------+-----------------+----+------------------+-----------------+------------------+-----------------+---------+---------+---------+-------+-------+-----+
|0.0597809849598081|0.0642412677031359|0.03202691337

                                                                                

In [5]:
voice.count()

3168

This is the size of the dataset; i.e., it contains 3,168 recorded voice samples.

### Converting the data into correct format

We need to cast the gender labels into integer values (instead of `StringType`) before fitting the model:

In [6]:
voice = voice.withColumn("label", (voice["label"]=="male").cast(IntegerType()))
voice.printSchema()

root
 |-- meanfreq: double (nullable = true)
 |-- sd: double (nullable = true)
 |-- median: double (nullable = true)
 |-- Q25: double (nullable = true)
 |-- Q75: double (nullable = true)
 |-- IQR: double (nullable = true)
 |-- skew: double (nullable = true)
 |-- kurt: double (nullable = true)
 |-- sp_ent: double (nullable = true)
 |-- sfm: double (nullable = true)
 |-- mode: double (nullable = true)
 |-- centroid: double (nullable = true)
 |-- meanfun: double (nullable = true)
 |-- minfun: double (nullable = true)
 |-- maxfun: double (nullable = true)
 |-- meandom: double (nullable = true)
 |-- mindom: double (nullable = true)
 |-- maxdom: double (nullable = true)
 |-- dfrange: double (nullable = true)
 |-- modindx: double (nullable = true)
 |-- label: integer (nullable = true)



### Split the data into training and testing samples

We will split the data set to get 70% training data and 30% testing data.

In [7]:
# split data into training (70% of the samples) and testing (30% of the samples)
voice_train, voice_test = voice.randomSplit([0.7, 0.3], seed=1)
print("Voice_train shape: ", voice_train.count())
print(" Voice_test shape: ", voice_test.count())

                                                                                

Voice_train shape:  2227


[Stage 7:>                                                          (0 + 1) / 1]

 Voice_test shape:  941


                                                                                

## Using pipeline to fit the logistic regression

We use an **ML (machine learning) pipeline** which chains multiple transformers and estimators together to specify an ML workflow. One advantage of this paradigm is that we can use the same transformations for training and testing the data. For training, we need to plug in an estimator in the end. For testing, we need to plug in the model that has been produced by the estimator.

### Creating transformers

Now, we will be defining a transformer (`VectorAssembler`) which creates a single column with all the features collated together. Here, this will be the feature vector.

In [8]:
featuresCreator = ft.VectorAssembler(
    inputCols = voice.columns[:-1],  # input columns
    outputCol = 'features'           # output is a single column with all the features
)

In [9]:
featuresCreator.transform(voice).head().features

24/09/04 23:51:49 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

DenseVector([0.0598, 0.0642, 0.032, 0.0151, 0.0902, 0.0751, 12.8635, 274.4029, 0.8934, 0.4919, 0.0, 0.0598, 0.0843, 0.0157, 0.2759, 0.0078, 0.0078, 0.0078, 0.0, 0.0])

### Creating an estimator 

We will instantiate a **logistic regression model** with specific parameters:

In [10]:
lr_model = cl.LogisticRegression(  # logistic regression model 
    maxIter = 10,                  # maximum number of iterations (>= 0)
    regParam = 0.01,               # regularization parameter (>= 0)
    labelCol = 'label')            # label column name.

### Creating a pipeline 

We will now create an abstract list of transformers and estimators:

In [11]:
pipeline = Pipeline(stages=[
        featuresCreator, # transformer
        lr_model         # estimator
    ])

## Hyper-parameter tuning

### Grid search

We want to figure out which training parameters work well using grid search - so, we will specify our model and the list of parameters we want to loop through. Specifically, we will test the different maximum numbers of iterations and regularistaion parameters (L2). 

In [12]:
logistic = cl.LogisticRegression(labelCol = 'label')

# we can build a grid and add the parameters we want to tune
grid = tune.ParamGridBuilder() \
    .addGrid(logistic.maxIter, [2, 10, 50]) \
    .addGrid(logistic.regParam, [0.01, 0.05, 0.3]) \
    .build()

In [13]:
# for binary classification error estimation
evaluator = ev.BinaryClassificationEvaluator( 
    rawPredictionCol = 'probability', 
    labelCol = 'label')

In [14]:
# cross-validation setup
cv = tune.CrossValidator(
    estimator = logistic, 
    estimatorParamMaps = grid, 
    evaluator = evaluator
)

We will now create a purely transforming `Pipeline`, run it and estimate our model.

In [15]:
# running the pipeline over the training data by applying the transformer (featuresCreator) and 
# generating a trained logistic regression model 
pipeline = Pipeline(stages = [featuresCreator])
data_transformer = pipeline.fit(voice_train)

We can now check for the optimal combination of parameters for our model.

In [16]:
cvModel = cv.fit(data_transformer.transform(voice_train))

                                                                                

In [17]:
results = [
    (
        [
            {key.name: paramValue} 
            for key, paramValue 
            in zip(
                params.keys(), 
                params.values())
        ], metric
    ) 
    for params, metric 
    in zip(
        cvModel.getEstimatorParamMaps(), 
        cvModel.avgMetrics
    )
]

sorted(results, 
       key=lambda el: el[1], 
       reverse=True)[0]

([{'maxIter': 50}, {'regParam': 0.01}], 0.9925411535016605)

## Model evaluation

In [18]:
data_train = data_transformer.transform(voice_test)
results = cvModel.transform(data_train)

In [19]:
results.take(1)

[Row(meanfreq=0.0621823118609672, sd=0.0878894037873831, median=0.0109745762711864, Q25=0.00177966101694915, Q75=0.117457627118644, IQR=0.115677966101695, skew=9.61220808953177, kurt=114.803500510109, sp_ent=0.786650358576641, sfm=0.329569856469261, mode=0.000889830508474576, centroid=0.0621823118609672, meanfun=0.0997761550401998, minfun=0.0171122994652406, maxfun=0.258064516129032, meandom=0.0955528846153846, mindom=0.0078125, maxdom=1.4140625, dfrange=1.40625, modindx=0.105777777777778, label=1, features=DenseVector([0.0622, 0.0879, 0.011, 0.0018, 0.1175, 0.1157, 9.6122, 114.8035, 0.7867, 0.3296, 0.0009, 0.0622, 0.0998, 0.0171, 0.2581, 0.0956, 0.0078, 1.4141, 1.4062, 0.1058]), rawPrediction=DenseVector([-2.689, 2.689]), probability=DenseVector([0.0636, 0.9364]), prediction=1.0)]

The last two values are notably `probability` and `prediction`.

We will check how well the model performs by using AUROC and AUPR:

In [20]:
print('Area under ROC:', evaluator.evaluate(results, {evaluator.metricName: 'areaUnderROC'}))
print(' Area under PR:', evaluator.evaluate(results, {evaluator.metricName: 'areaUnderPR'}))

Area under ROC: 0.9944933323696312
 Area under PR: 0.9953270478148899


Since scores close to 1 indicate excellent performance, our model is performing very well at both distinguishing between the 2 genders and also maintaining high precision and recall.