# Introduction

Couple of days ago, I worked on my first data set: Titani survivability prediction. As a result, I officially dipped my toes in data science and python. I will continue to dip my toes, and perhaps, my feet with this project. The data in this project is all numeric with one column being binary information. The columns define various voice chararcteristics and the label column indicates whether it is male or female. A quick peak tells me that the data is clean and usable with no null values - perfect for a newbie. 

There are 3168 observations.

In [None]:
import matplotlib.pyplot as plt
from pandas import read_csv
import seaborn as sns
import pandas as pd
import numpy as np


vdf = read_csv('/kaggle/input/voicegender/voice.csv')
print(vdf.shape)
vdf.info()
#NO null data, all numeric except label

The label containes male and female entries, which we categorize and attach label as follows:

$$ y = \begin{cases} 0 &\mbox{if } \text{label = 'male'}\\ 
                     1 & \mbox{if } \text{label = 'female'}
        \end{cases}.$$

Once, converted to a numeric label ($y$), we drop them from the dataframe ($x$).

In [None]:
print(vdf['label'].unique())
vdf["label"] = vdf["label"].astype('category')
y = vdf["label"].cat.codes #save label code as y variabl

#drop label from dataframe
x = vdf.drop(['label'],axis=1)
features = x.columns.tolist() #save all the features

In [None]:
print(x[features].round(2).describe().transpose())

# Is data in standard format?

No, data is not centered and scaled - identified by non-zero means. We can do that as follows:

In [None]:
feature_mean = x.mean()
feature_std = x.std()
#center and scale the data
x = (x - feature_mean)/feature_std
print(x[features].round(2).describe().transpose())

# Is there any linear dependency between columns?


Next, we check for linearly dependent columns. We can check the rank of a matrix, if rank < num_features, then we have colinearity. Here, the rank is 17 < 20, therefore we can remove highly correlated columns and hopefully, we can get a full column rank matrix. We check for correlation heat map, where correlation is more than 0.9. The heatmap below shows that features dfrange, meanfun, kurt, Q25, and median have high correlation values and should be dropped from the dataframe. 

As a result, we have 15 features and a full column rank matrix. I found the code to drop linearly dependent columns [here](https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/).

In [None]:
from numpy.linalg import matrix_rank
print(matrix_rank(x))

In [None]:
max_corr = 0.9 #largest acceptable correlation value
corr_matrix = x.corr().abs() #get absolute values for correlation
#work with upper triangular matrix, corr_matrix is symmetric
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
sns.heatmap(upper>max_corr); #check for high collinearity

In [None]:
#drop columns/features
to_drop = [column for column in upper.columns if any(upper[column] > max_corr)]
x.drop(to_drop, axis=1, inplace=True)
print('Drop features: ', to_drop)
print('Rank: ', matrix_rank(x), '\nShape: ', x.shape)

In [None]:
#check the new correlation matrix
corr_matrix = x.corr().abs();
sns.heatmap(corr_matrix);

# Is data balanced?

Yes, observations have 50/50 split across male and female labels.

In [None]:
sns.countplot(x=y); #equal counts of male and female data
plt.xticks(np.arange(2), ('Male','Female'));

# Cross-validation and Model training

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
#split into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,shuffle=True)

# **SVM**


In [None]:
from sklearn.svm import SVC
#create classifier objects.
svm = SVC(kernel='linear')
#fit the model
svm.fit(x_train,y_train)
#perform cross validation
scores = cross_val_score(svm,x,y)#get cross validation score
#do prediction
y_pred = svm.predict(x_test)
print("SVM training accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), 100*scores.std()))
print("SVM prediction accuracy: %0.2f" % accuracy_score(y_test, y_pred))
#check confusion matrix
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, cmap="Greens");

# **Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(max_iter=200)
LR.fit(x_train, y_train)
scores = cross_val_score(LR,x,y)
y_pred = svm.predict(x_test)
print("LR training accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), 100*scores.std()))
print("LR prediction accuracy: %0.2f" % accuracy_score(y_test, y_pred))
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, cmap="Greens");

# **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(max_depth=10)
RF.fit(x_train, y_train)
y_pred = RF.predict(x_test)
scores = cross_val_score(RF,x,y)
print("RF training accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), 100*scores.std()))
print("RF prediction accuracy: %0.2f" % accuracy_score(y_test, y_pred))
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, cmap="Greens");

# **Neural Network**

In [None]:
from sklearn.neural_network import MLPClassifier

NN = MLPClassifier(random_state = 100,max_iter=500)
NN.fit(x_train, y_train);
scores = cross_val_score(NN,x,y)
y_pred = NN.predict(x_test)
print("NN training accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
print("NN prediction accuracy: %0.2f" % accuracy_score(y_test, y_pred))
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, cmap="Greens");