In [1]:
import pandas as pd
df = pd.read_csv('voice.csv')

In [2]:
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


Part 1: Data exploration

1. Frequencies are the variables in this dataset,that used to identify a voice as male or female, based on human vocal range i.e., a typical adult male will have a fundamental frequency from 85 to 180 Hz, and a typical adult female from 165 to 255 Hz.
2. Median frequency means the frequency that is found at the exact middle of all recorded frequencies for one voice.
3. Output label determines if the voice is from a male or female.

Part 2: Data preparation

In [3]:
from sklearn.model_selection import KFold
frequency_folder = KFold(n_splits=10, shuffle=True)

In [4]:
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 0], [3, 1], [4, 7], [9, 2], [5, 8], [0, 6]])
y = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for train_index, test_index in frequency_folder.split(X):    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("X_train:", *X_train, ", X_test:", *X_test, sep = " ")
    print("y_train:", *y_train, ", y_test:", *y_test, sep = " ")

X_train: [1 2] [3 4] [5 6] [7 8] [9 0] [3 1] [4 7] [5 8] [0 6] , X_test: [9 2]
y_train: 0 1 2 3 4 5 6 8 9 , y_test: 7
X_train: [1 2] [3 4] [5 6] [7 8] [9 0] [3 1] [4 7] [9 2] [0 6] , X_test: [5 8]
y_train: 0 1 2 3 4 5 6 7 9 , y_test: 8
X_train: [1 2] [3 4] [5 6] [7 8] [3 1] [4 7] [9 2] [5 8] [0 6] , X_test: [9 0]
y_train: 0 1 2 3 5 6 7 8 9 , y_test: 4
X_train: [1 2] [3 4] [7 8] [9 0] [3 1] [4 7] [9 2] [5 8] [0 6] , X_test: [5 6]
y_train: 0 1 3 4 5 6 7 8 9 , y_test: 2
X_train: [1 2] [3 4] [5 6] [7 8] [9 0] [4 7] [9 2] [5 8] [0 6] , X_test: [3 1]
y_train: 0 1 2 3 4 6 7 8 9 , y_test: 5
X_train: [1 2] [3 4] [5 6] [7 8] [9 0] [3 1] [9 2] [5 8] [0 6] , X_test: [4 7]
y_train: 0 1 2 3 4 5 7 8 9 , y_test: 6
X_train: [3 4] [5 6] [7 8] [9 0] [3 1] [4 7] [9 2] [5 8] [0 6] , X_test: [1 2]
y_train: 1 2 3 4 5 6 7 8 9 , y_test: 0
X_train: [1 2] [3 4] [5 6] [9 0] [3 1] [4 7] [9 2] [5 8] [0 6] , X_test: [7 8]
y_train: 0 1 2 4 5 6 7 8 9 , y_test: 3
X_train: [1 2] [5 6] [7 8] [9 0] [3 1] [4 7] [9 2] [5 8]

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn import neighbors

X_data = df.iloc[:,:-1]
y_data = df['label']

1. Logistic Regression

In [6]:
model_genderlogi = make_pipeline(LogisticRegression(solver='lbfgs'))

Based on the given data, a sigmoid function will be maked which determine if a voice is from a male(1) or female(0).

2. Support Vector Machine classifier

In [7]:
model_gendersvm = make_pipeline(svm.SVC(gamma='scale'))

A Linear SVM Classification with Hard Margin will perform here, which means there is not any "noise" in training data. Decision boundary will place in between two closest data points.

3. Decision Tree classifier

In [8]:
model_genderdecision = make_pipeline(DecisionTreeClassifier())

DecisionTree will divide data until it gets a binary result, which means it may have the best accuracy.

4. k-Nearest Neighbors classifier

In [9]:
model_genderknn = make_pipeline(neighbors.KNeighborsClassifier())

KNN does not learn any model, but it uses all nearest neighbors to determine how we classify a given data point. KNN has relatively high accuracy.