### Domain – Media
### focus – optimize selection process
### Business challenge/requirement

#### Motion Studios is the largest Radio production house in Europe. Their total revenue 1B+ dollars. Company has launched a new reality show – "The Star RJ". The show is about finding a new Radio Jockey who will be the star presenter on upcoming shows.
#### In first round participants have to upload their voice clip online and the clip will be evaluated by experts for selection into the next round. There is a separate team in the first round for evaluation of male and female voice. 

#### Response to the show is unprecedented and company is flooded with voice clips.
#### You as a ML expert have to classify the voice as either male/female so that first level of filtration is quicker.

In [1]:
import pandas as pd

In [2]:
dj_df = pd.read_csv("voice-classification.csv")
dj_df

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.000000,0.000000,male
1,0.066009,0.067310,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.250000,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.007990,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.250000,0.201497,0.007812,0.562500,0.554688,0.247119,male
4,0.135120,0.079146,0.124656,0.078720,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.135120,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3163,0.131884,0.084734,0.153707,0.049285,0.201144,0.151859,1.762129,6.630383,0.962934,0.763182,...,0.131884,0.182790,0.083770,0.262295,0.832899,0.007812,4.210938,4.203125,0.161929,female
3164,0.116221,0.089221,0.076758,0.042718,0.204911,0.162193,0.693730,2.503954,0.960716,0.709570,...,0.116221,0.188980,0.034409,0.275862,0.909856,0.039062,3.679688,3.640625,0.277897,female
3165,0.142056,0.095798,0.183731,0.033424,0.224360,0.190936,1.876502,6.604509,0.946854,0.654196,...,0.142056,0.209918,0.039506,0.275862,0.494271,0.007812,2.937500,2.929688,0.194759,female
3166,0.143659,0.090628,0.184976,0.043508,0.219943,0.176435,1.591065,5.388298,0.950436,0.675470,...,0.143659,0.172375,0.034483,0.250000,0.791360,0.007812,3.593750,3.585938,0.311002,female


In [3]:
dj_df.label.value_counts() # Dayta has equal labels for both female and male

male      1584
female    1584
Name: label, dtype: int64

In [9]:
label = dj_df.label.map({"male":1, "female":0})
features = dj_df.drop('label', axis = 1)


# Scale the data to be between -1 and 1

In [12]:
# Scale the data to be between -1 and 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(features)
features = scaler.transform(features)

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size = 0.3, random_state = 5)

### LogisticRegression

In [14]:
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

logReg = LogisticRegression(solver = 'lbfgs', max_iter=1000)
logReg.fit(X_train, y_train)
y_pred = logReg.predict(X_test)

print("Logistic Regression Accuracy Score: ")
accuracy_score(y_pred, y_test)

0.9716088328075709

### GaussianNB

In [15]:
from sklearn.naive_bayes import GaussianNB 

# GaussianNB model
gauss_model = GaussianNB()
gauss_model.fit(X_train, y_train)
y_pred = gauss_model.predict(X_test)

print("GaussianNB Accuracy Score: ")
accuracy_score(y_pred, y_test)

0.8927444794952681

### Train a Model
### its time to train a Support Vector Machine Classifier.

### SVC() model from sklearn and fit the model to the training data.

In [17]:
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix

In [25]:
# ALL Default hyperparameters
svc_model=SVC()
svc_model.fit(X_train,y_train)
y_pred=svc_model.predict(X_test)

print("SVC Accuracy Score: ", accuracy_score(y_pred, y_test))


print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

SVC Accuracy Score:  0.982124079915878
[[463  10]
 [  7 471]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.98       473
           1       0.98      0.99      0.98       478

    accuracy                           0.98       951
   macro avg       0.98      0.98      0.98       951
weighted avg       0.98      0.98      0.98       951



## Gridsearch
1. Gridsearch is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. As mentioned above, the performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.
    
2. We pass predefined values for hyperparameters to the GridSearchCV function. We do this by defining a dictionary in which we mention a particular hyperparameter along with the values it can take. Example given below: 
 { 'C': [0.1, 1, 10, 100, 1000],  
   'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
   'kernel': ['rbf',’linear’,'sigmoid']  }
   

In [28]:
from sklearn.model_selection import GridSearchCV

In [29]:
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001]} 
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END .....................................C=0.1, gamma=1; total time=   0.1s
[CV] END .....................................C=0.1, gamma=1; total time=   0.1s
[CV] END .....................................C=0.1, gamma=1; total time=   0.1s
[CV] END .....................................C=0.1, gamma=1; total time=   0.1s
[CV] END .....................................C=0.1, gamma=1; total time=   0.1s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ..................................C=0.1, gamma=0.01; total time=   0.0s
[CV] END ..................................C=0.1

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001]},
             verbose=2)

In [30]:
grid_predictions = grid.predict(X_test)

In [31]:
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,grid_predictions))

Accuracy Score:
0.9894847528916929


In [32]:
print(confusion_matrix(y_test,grid_predictions))

[[469   4]
 [  6 472]]


In [33]:
print(classification_report(y_test,grid_predictions))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       473
           1       0.99      0.99      0.99       478

    accuracy                           0.99       951
   macro avg       0.99      0.99      0.99       951
weighted avg       0.99      0.99      0.99       951

