# Machine Learning to Classification Gender by Voice 
In this notebook I will try build ML models using sklearn to predict gender by voice. I created this notebook to apply the classification algorithms I learned. Also, this notebook may contain errors as I am a beginner. You can add mistakes you find or your suggestions to the comment. Thank You!

# Contents
* [Importing Packages and Libraries](#Importing)
* [Loading and Viewing Dataset](#Loading-Dataset)
* [Data Preprocessing](#Data-Preprocessing)
* [Train test split](#Train-test-split)
* [Data Visualization](#Visualization)
* [Logistic Regression Classification](#LR)
* [K-Nearest Neighbor Classification](#KNN)
* [Support Vector Classification](#SVM)
* [Naive Bayes Classification](#NBC)
* [Decision Tree Classification](#DTC)
* [Random Forst Classification](#RFC)

<a id="Importing"></a>
# Importing Packages and Libraries

We will use numpy for mathematical operations, pandas for reading and processing data, matplotlib and seaborn for data visualization, and sklearn to build our classification models.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="Loading-Dataset"></a>
#  Loading and Viewing Data Set

We include the data set in our project using the Pandas library. Then we get general information about the data set and examine the first five rows.

In [None]:
data=pd.read_csv('../input/voicegender/voice.csv')

In [None]:
data.info()

In [None]:
data.head()

<a id="Data-Preprocessing"></a>
# Data Preprocessing

First, we check whether there is a null value in the data set with the isnull function. Because a dataset containing null values makes our work difficult. 

In [None]:
data.isnull().values.any()

Let's determine the dependent (y) and independent (x) variables of the data set:

In [None]:
y=data.label
x_data=data.drop(["label"],axis=1)
x=(x_data-np.min(x_data))/(np.max(x_data)-np.min(x_data))

Often we need to convert dependent variable values to integer types before creating an ML model. Our dependent variable is the "label" column. And built with string values ("male" and "female"). So we have to change these values. Let's replace "male" ones with 0 and "female" ones with 1 :

In [None]:
data.label=[0 if each=="male" else 1 for each in data.label]

<a id="Train-test-split"></a>
# Train Test Split

Before building our ML models, we split the data into training and test data as a final step:

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

We are now ready to build machine learning models, but take your time. Before building ML models, we will visualize them to better understand the data.

<a id="Visualization"></a>
# Data Visualization

Let's visualize distribution of gender with pie chart:

In [None]:
#  Distribution of gender 
print(y.value_counts())
plt.pie(y.value_counts(),labels=["Female","Male"],colors=["pink","green"],autopct='%1.0f%%')
plt.title("Distribution of gender:")
plt.show()

We create a correlation heatmap to see how the variables relate to each other:

In [None]:
data_corr=data.corr()
plt.figure(figsize=(12,12))
sns.heatmap(data_corr,annot=True, fmt= '.2f')

Scatter of Spectral Entropy by Average of Fundamental Frequency Measured Across Acoustic Signal

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(data=data, x="meanfun", y="sp.ent",hue="label")
plt.title("Scatter of Spectral Entropy by Average of Fundamental Frequency")
plt.xlabel("Average of Fundamental Frequency")
plt.ylabel("Spectral Entropy")
plt.show()

Scatter of  Mean Frequency (in kHz) by Average of Fundamental Frequency Measured Across Acoustic Signal

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(data=data, x="meanfun", y="meanfreq",hue="label")
plt.title("Scatter of Spectral Entropy by Average of Fundamental Frequency")
plt.xlabel("Average of Fundamental Frequency")
plt.ylabel("Mean Frequency (in kHz)")
plt.show()

IQR by Gender (Stripplot):

*(IQR: interquantile range (in kHz))*

In [None]:
plt.figure(figsize=(7,5))
sns.stripplot(data=data, x="label", y="IQR",jitter=True)
plt.title("IQR by Gender (Stripplot)")
plt.xlabel("Gender")
plt.ylabel("IQR")
plt.show()

<a id="LR"></a>
# Logistic Regression Classification

* Logistic regression is a statistical method used to analyze a data set with one or more independent variables that define a result. 
* In this method the result is defined by two different values (True-False, Yes-No, 1-0, Cat-Dog, Good-Bad ...).
* Purpose of Logistic Regression is build optimal model to describe of link between  a set of independent variables and two dimensional result. Classification is achieved with the created model.


You can examine the structure of this model in detail: [Gender Classification (Logistic Regression)](https://www.kaggle.com/ahmetozdemir1071/gender-classification-logistic-regression)

In [None]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(x_train,y_train)
print("Test Accuracy: {} %".format(lr.score(x_test,y_test)*100))

In [None]:
#Confusion matrix:
y_pred_lr=lr.predict(x_test)
y_true=y_test
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred_lr,y_true)
f,ax=plt.subplots(figsize=(3,3))
sns.heatmap(cm,annot=True,linecolor="blue",linewidth=0.3,fmt=".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

<a id="KNN"></a>
# K-Nearest Neighbor (KNN) Classification

* KNN (K-nearest neighbor) is one of the simplest machine learning algorithms.
* This algorithm memorizes the dataset rather learn it.
* When a prediction is made, the nearest neighbors are searched in the entire data set. 
* To make a prediction, the nearest neighbors in the dataset are searched. 
* The value of k we choose determines how many neighboring elements we will examine. When there is a value to be classified, the distance to the k neighbors is calculated separately. The calculated distances are listed and the corresponding value assigned to the appropriate class. 
* Usually the Euclidean Function is used to calculate the distance.
    Eucliden Function:

![eucl.PNG](attachment:eucl.PNG)



In [None]:
from sklearn.neighbors import KNeighborsClassifier

scoreList=[]
#Let's find the optimal number k
for each in range(1,15):
    knn=KNeighborsClassifier(n_neighbors=each)
    knn.fit(x_train,y_train)
    scoreList.append(knn.score(x_test,y_test))
plt.figure(figsize=(7,5))
plt.plot(range(1,15),scoreList)
plt.xlabel("K Values")
plt.ylabel("Accuracy")
plt.show()

-->>> As shown in the graph above, if we make k = 9, we get the highest accuracy value.

In [None]:
#Confusion matrix (k=9):
knn=KNeighborsClassifier(n_neighbors=9)
knn.fit(x_train,y_train)
print("Test Accuracy: {} %".format(knn.score(x_test,y_test)*100))
y_pred_knn=knn.predict(x_test)
y_true=y_test
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred_knn,y_true)
f,ax=plt.subplots(figsize=(3,3))
sns.heatmap(cm,annot=True,linecolor="blue",linewidth=0.3,fmt=".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

<a id="SVM"></a>
# Support Vector Machine Classification

* Support Vector Machine is a supervised learning algorithm based on statistical learning theory.
* It tries to find the optimal line that separates the two labeled classes. 
* The optimal line is the furthest away from members of both classes.
* This model does not have an overfitting problem.
* Easy to apply
* Its accuracy is high.
![svm.PNG](attachment:svm.PNG)

In [None]:
from sklearn.svm import SVC
svm=SVC(random_state=42)
svm.fit(x_train,y_train)
print("Test Accuracy: {} %".format(svm.score(x_test,y_test)*100))

In [None]:
#Confusion matrix:
y_pred_svm=svm.predict(x_test)
y_true=y_test
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred_svm,y_true)
f,ax=plt.subplots(figsize=(3,3))
sns.heatmap(cm,annot=True,linecolor="blue",linewidth=0.3,fmt=".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

<a id="NBC"></a>
# Naive Bayes Classification

* Naive Bayes classification is based on Bayes' Theorem.
* Baye' Theorem shows the relationship between conditional probabilities and marginal probabilities within the probability distribution for a random variable. (Wikipedia)
* Calculate the probability of each state for an element and classify it according to the one with the highest probability value.
* In this algorithm  probability of each state is calculated for an element and classified according to the highest probability value.
![nb,.PNG](attachment:nb,.PNG)

            P(A|B): Probability that event A will occur when event B occurs.
            P(A): The probability of event A will occur
            P(B|A): The probability that event B will occur when event A occurs
            P(B): The probability of event B will occur

In [None]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
nb.fit(x_train,y_train)
print("Test Accuracy: {} %".format(nb.score(x_test,y_test)*100))

In [None]:
#Confusion matrix:
y_pred_nb=nb.predict(x_test)
y_true=y_test
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred_nb,y_true)
f,ax=plt.subplots(figsize=(3,3))
sns.heatmap(cm,annot=True,linecolor="blue",linewidth=0.3,fmt=".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

<a id="DTC"></a>
# Decision Tree Claasification

* Decision trees  classification is a classification method that creates a model in the form of a tree structure consisting of decision nodes and leaf nodes by feature and target.
* It can process both numerical and categorical data.
* It is easy to understand and interpret.
* This model can address multi-output problems.
* Overfitting may occur in this model.

![dt.PNG](attachment:dt.PNG)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
print("Test Accuracy: {} %".format(dt.score(x_test,y_test)*100))

In [None]:
#Confusion matrix:
y_pred_dt=dt.predict(x_test)
y_true=y_test
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred_dt,y_true)
f,ax=plt.subplots(figsize=(3,3))
sns.heatmap(cm,annot=True,linecolor="blue",linewidth=0.3,fmt=".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

<a id="RFC"></a>
# Random Forest Classification

* The basic structure of this model consists of decision trees.
* If we build n trees using random values in the dataset, actually we create a random forest model. So random forest is collected trees that built by random values and has branches.
* The algorithm creates so many tree structures that it helps to get the best results from the results. Voting is done within the results and correct branches are created.
* Decision Trees' biggest problem is overfitting. This problem is less in Random Forest since training on different datasets.

![random%20forest.png](attachment:random%20forest.png)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=10,random_state=42)
rf.fit(x_train,y_train)
print("Test Accuracy = {} %".format(rf.score(x_test,y_test)*100))

In [None]:
#Confusion Matrix
y_pred_rf=rf.predict(x_test)
y_true=y_test
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred_rf,y_true)
f,ax=plt.subplots(figsize=(3,3))
sns.heatmap(cm,annot=True,linecolor="blue",linewidth=0.3,fmt=".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

# Last Words
* Among the algorithms we applied, we reached the highest accuracy in the Support Vector Machine algorithm, and the lowest accuracy in the Naive Bayes Algorithm. However, this does not mean that these two algorithms are the best and the worst for our data.
* Some of the models we use are used for both classification and regression but we examined the issue of classification.
* All models we apply are supervised learning algorithms.

* Thanks [DATAI](https://www.kaggle.com/kanncaa1/notebooks)

