# Final Exam <font color='red'>PART B</font> | BUAN 6341 Applied Machine Learning

<font color='#ccc'>**PART A:** Answer the MCQ Q1-Q20.</font><br>
<font color='red'>**PART B:** Answer the following questions and write your own code/answers in the boxes below. (A Jupyter Notebook simulator is available below as a courtesy tool for you to test and debug your answer.)</font>

In this part of the exam, you need to 
 - write down your codes and corresponding discussions in the cells below the questions. 
 - Use the online Jupyter simulator, which is available as a courtesy tool, to test your code. 
   - To create a .ipynb file, navigate to 'Notebook' -> 'Python (Pyodide)'. Then, copy the Python code blocks provided in the question and paste them into the simulator. 
   - After testing your code, copy your answers back into the box below each question for submission. 
   - If the simulator is unavailable or if you are unsure how to use it, you may continue by directly typing your answers into the test boxes provided. 


In [1]:
%matplotlib inline 
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# <font color='red'>PART B1. Gender Recognition by Speech Analysis</font> 
You need to include both codes and discussions in this coding file.


##### Setting and Data
This dataset is created to identify a voice as male or female, based upon acoustic properties of the voice and speech. It consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis, with an analyzed frequency range of 0Hz-280Hz (human vocal range).

The CSV file contains 20 acoustic properties of each voice, and one outcome variable, “label”, which identifies the gender of the speaker. The detailed information is listed below (you do NOT need to read through the variable description). 

- meanfreq: mean frequency (in kHz)
- sd: standard deviation of frequency
- median: median frequency (in kHz)
- Q25: first quantile (in kHz)
- Q75: third quantile (in kHz)
- IQR: interquantile range (in kHz)
- skew: skewness (see note in specprop description)
- kurt: kurtosis (see note in specprop description)
- sp.ent: spectral entropy
- sfm: spectral flatness
- mode: mode frequency
- centroid: frequency centroid (see specprop)
- meanfun: average of fundamental frequency measured across acoustic signal
- minfun: minimum fundamental frequency measured across acoustic signal
- maxfun: maximum fundamental frequency measured across acoustic signal
- meandom: average of dominant frequency measured across acoustic signal
- mindom: minimum of dominant frequency measured across acoustic signal
- maxdom: maximum of dominant frequency measured across acoustic signal
- dfrange: range of dominant frequency measured across acoustic signal
- modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
- label: male or female

## Preliminaries
Use the code below to load data and check the variable names.


In [2]:
import pandas as pd
voice = pd.read_csv('voice.csv')
voice.columns

Index(['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
       'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
       'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx', 'label'],
      dtype='object')

Convert the values in the 'label' column of the DataFrame voice into numerical codes while preserving the categorical nature of the data. We would like to use all other variables to predict the gender of the speaker (label). To start, we split the data using the following code. Define the input and output and observe the sizes of the data sets.

In [3]:
voice['label'] = voice['label'].astype('category').cat.codes

X = voice.iloc[:, 0:19]
y = voice.iloc[:, 20]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

X.shape, y.shape

((3168, 19), (3168,))

In [4]:
y  # the lable column

0       1
1       1
2       1
3       1
4       1
       ..
3163    0
3164    0
3165    0
3166    0
3167    0
Name: label, Length: 3168, dtype: int8

In [5]:
X_train.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange
1326,0.183874,0.058597,0.200797,0.122709,0.229243,0.106534,1.022852,3.1476,0.921592,0.449218,0.225896,0.183874,0.122806,0.048,0.277457,1.119141,0.023438,4.617188,4.59375
478,0.170463,0.075548,0.169768,0.112567,0.24031,0.127743,1.301147,4.114986,0.963778,0.734822,0.242311,0.170463,0.129276,0.03125,0.266667,1.25,0.007812,7.0,6.992188
868,0.160308,0.058925,0.165,0.103667,0.197667,0.094,1.424574,4.94475,0.935771,0.534223,0.187,0.160308,0.087385,0.023952,0.202532,0.664062,0.085938,3.601562,3.515625
2539,0.208885,0.032823,0.209317,0.191743,0.229233,0.03749,2.478713,11.444726,0.863735,0.179466,0.206974,0.208885,0.179614,0.047198,0.27907,1.036659,0.023438,10.101562,10.078125
1049,0.192763,0.059136,0.2163,0.126,0.2408,0.1148,2.616276,11.28992,0.900464,0.392967,0.24045,0.192763,0.110624,0.047856,0.275862,1.017379,0.023438,9.1875,9.164062


In [6]:
X_train.describe()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange
count,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0,2376.0
mean,0.181003,0.057017,0.185839,0.140734,0.224674,0.083939,3.135955,36.357424,0.895336,0.409211,0.165477,0.181003,0.143202,0.036818,0.259047,0.831274,0.05281,5.084506,5.031696
std,0.029407,0.016619,0.035595,0.048322,0.023019,0.042831,4.178764,133.747697,0.044909,0.177692,0.076873,0.029407,0.032207,0.018977,0.029578,0.522162,0.062812,3.518041,3.517817
min,0.048254,0.018363,0.010975,0.000235,0.058268,0.014558,0.141735,2.068455,0.747569,0.036876,0.0,0.048254,0.055565,0.009775,0.105263,0.007812,0.004883,0.007812,0.0
25%,0.164173,0.041954,0.170274,0.112088,0.208891,0.042225,1.653858,5.705607,0.861865,0.258601,0.118937,0.164173,0.117175,0.01837,0.253968,0.419828,0.007812,2.318359,2.304688
50%,0.184462,0.05922,0.189852,0.140771,0.225201,0.093211,2.202395,8.379475,0.902052,0.396176,0.186512,0.184462,0.141882,0.045662,0.271186,0.769927,0.023438,5.0,4.96875
75%,0.198861,0.066914,0.210207,0.175626,0.242841,0.114034,2.945119,13.708137,0.929242,0.535338,0.220977,0.198861,0.169805,0.047904,0.277457,1.192142,0.070312,7.195312,7.154297
max,0.247041,0.114508,0.257417,0.242124,0.269852,0.24877,34.725453,1309.612887,0.981997,0.842936,0.28,0.247041,0.237636,0.204082,0.279114,2.957682,0.449219,21.84375,21.820312


In [7]:
y_train

1326    1
478     1
868     1
2539    0
1049    1
       ..
763     1
835     1
1653    0
2607    0
2732    0
Name: label, Length: 2376, dtype: int8

Many machine learning methods require data scaling. In this problem, we first scale our data using standard scaler, so that the data can be applied for **supervised learning models**.
<font color="red">Note: Use this scaled data for all classification models (i.e., Q21 – Q25).</font>

In [8]:
from sklearn.preprocessing import StandardScaler

standscaler = StandardScaler()
X_stand_scaled = standscaler.fit_transform(X)
X_stand_scaled

array([[-4.04924806,  0.4273553 , -4.22490077, ..., -0.70840431,
        -1.43142165, -1.41913712],
       [-3.84105325,  0.6116695 , -3.99929342, ..., -0.70840431,
        -1.41810716, -1.4058184 ],
       [-3.46306647,  1.60384791, -4.09585052, ..., -0.70840431,
        -1.42920257, -1.41691733],
       ...,
       [-1.29877326,  2.32272355, -0.05197279, ..., -0.70840431,
        -0.5992661 , -0.58671739],
       [-1.2452018 ,  2.012196  , -0.01772849, ..., -0.70840431,
        -0.41286326, -0.40025537],
       [-0.51474626,  2.14765111, -0.07087873, ..., -0.70840431,
        -1.27608595, -1.2637521 ]])