# Discriminating Sonar Signal
- The task is to train a classifier to discriminate between sonar signal bouncing off a metal cylinder from those bouncing off a roughly cylindrical rock.
- Have to only use Decision Tree, Random Forest and Support Vector Classifier.
- No data cleansing & feature selection is required as the focus is on model building and optimization.

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing Dataset

In [None]:
dataset = pd.read_table("http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data", header = None , delimiter = ",")

## Separating & Pre-processing the independent & dependent variables

In [None]:
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
y = label.fit_transform(y)

## Splitting the dataset into Train and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

## Function for accuracy analysis

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.model_selection import cross_val_score

def performance_metrics(y_test,y_pred,classifier):
    print("Model accuracy = {:.2f} %".format(accuracy_score(y_test,y_pred)*100)) 
    print("\nConfusion Matrix \n", confusion_matrix(y_test,y_pred))
    print("\nF1 score = {:.2f}".format(f1_score(y_test,y_pred)))
    accuracies = cross_val_score(classifier,X_train,y_train,cv = 5, n_jobs = -1)
    print("\nMean accuracy for 5-fold cross validation on train set = {:.2f} %".format(accuracies.mean()*100))

## Function for hyperparameter tuning

In [None]:
from sklearn.model_selection import  GridSearchCV
def grid_search_function(classifier,parameters):
    grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 5, n_jobs = -1)
    grid_search.fit(X_train, y_train)
    best_accuracy = grid_search.best_score_
    best_parameters = grid_search.best_params_
    print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
    print("Best Parameters:", best_parameters)
    return best_parameters

# Classification Models

Since the shuffling/splitting of data in these classifiers introduce some randomness in our results, so to ensure consistency of our results whenever this code is run, we specify a 'random_state' seed.

## Support Vector Classifier

In [None]:
from sklearn.svm import SVC
svm_classifier = SVC(random_state=42)
svm_classifier.fit(X_train,y_train)

SVC(random_state=42)

### Performance of non-parameterized classifier

In [None]:
y_pred = svm_classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

In [None]:
performance_metrics(y_test,y_pred,svm_classifier)

Model accuracy = 83.33 %

Confusion Matrix 
 [[21  5]
 [ 2 14]]

F1 score = 0.80

Mean accuracy for 5-fold cross validation on train set = 74.12 %


### Hyper-parameters tunning
There are a few hyper-parameters that can be tunned for any SVM model. Some of the common ones are, **C** or the regularization parameter, the type of **kernel**, the kernel co-efficient **gamma** _(if 'rbf' or 'poly' kernel),_ and the **degree** of polynomial kernel _(if 'poly' kernel)_.

In [None]:
parameters = [{'C': [0.25, 0.5, 0.75, 1], 'kernel': ['linear']},\
              {'C': [0.25, 0.5, 0.75, 1], 'kernel': ['rbf'], 'gamma': [0.1, 0.3, 0.5, 0.7, 0.9]},\
    {'C': [0.25, 0.5, 0.75, 1], 'kernel': ['poly'], 'gamma': [0.1, 0.3, 0.5, 0.7, 0.9], 'degree':[2,3,4,5]}]
best_parameter = grid_search_function(svm_classifier,parameters)

Best Accuracy: 85.53 %
Best Parameters: {'C': 1, 'degree': 2, 'gamma': 0.7, 'kernel': 'poly'}


In [None]:
svm_best_classifier = SVC(C = best_parameter["C"], kernel = best_parameter["kernel"],\
                          gamma = best_parameter["gamma"], degree = best_parameter["degree"],random_state=42)
svm_best_classifier.fit(X_train,y_train)
y_pred = svm_best_classifier.predict(X_test)
performance_metrics(y_test,y_pred,svm_best_classifier)

Model accuracy = 85.71 %

Confusion Matrix 
 [[22  4]
 [ 2 14]]

F1 score = 0.82

Mean accuracy for 5-fold cross validation on train set = 85.53 %


## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train,y_train)

DecisionTreeClassifier(random_state=42)

### Performance of non-parameterized classifier

In [None]:
y_pred = dt_classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

In [None]:
performance_metrics(y_test,y_pred,dt_classifier)

Model accuracy = 71.43 %

Confusion Matrix 
 [[19  7]
 [ 5 11]]

F1 score = 0.65

Mean accuracy for 5-fold cross validation on train set = 74.69 %


### Hyper-parameters tunning
There are a few hyper-parameters that can be tunned for any Decision Tree model. Some of the common ones are, **criterion** based on which the split is done, the **splitter** strategy used to choose the split at each node, the **max_features** that is considered before making the best split, and the **min_samples_split** which dictates if an internal node having certain number of nodes will split.

In [None]:
parameters = [{'criterion':['gini', 'entropy'], 'splitter': ['best','random'],\
               'max_features':['sqrt','log2',0.1,0.2,0.3], 'min_samples_split':[2,3,4,5]}]
best_parameter  = grid_search_function(dt_classifier,parameters)

Best Accuracy: 77.79 %
Best Parameters: {'criterion': 'entropy', 'max_features': 0.1, 'min_samples_split': 5, 'splitter': 'random'}


In [None]:
dt_best_classifier = DecisionTreeClassifier(criterion=best_parameter['criterion'],splitter=best_parameter['splitter'],\
        max_features=best_parameter["max_features"], min_samples_split=best_parameter['min_samples_split'],random_state=42)
dt_best_classifier.fit(X_train,y_train)
y_pred = dt_best_classifier.predict(X_test)
performance_metrics(y_test,y_pred,dt_best_classifier)

Model accuracy = 66.67 %

Confusion Matrix 
 [[15 11]
 [ 3 13]]

F1 score = 0.65

Mean accuracy for 5-fold cross validation on train set = 77.79 %


## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train,y_train)

RandomForestClassifier(random_state=42)

### Performance of non-parameterized classifier

In [None]:
y_pred = rf_classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

In [None]:
performance_metrics(y_test,y_pred,rf_classifier)

Model accuracy = 85.71 %

Confusion Matrix 
 [[22  4]
 [ 2 14]]

F1 score = 0.82

Mean accuracy for 5-fold cross validation on train set = 78.32 %


### Hyper-parameters tunning
There are a few hyper-parameters that can be tunned for any random Forest model. Some of the common ones are, the **n_estimators** or number of random decision trees that are used to create the random forest, **criterion** based on which the split is done, the **max_features** that is considered before making the best split, and the **min_samples_split** which dictates if an internal node having certain number of nodes will split.

In [None]:
parameters = [{"n_estimators":[10,25,50],'criterion':['gini', 'entropy'],\
               'max_features':['sqrt','log2',0.1, 0.2], 'min_samples_split':[5,7,9]}]
best_parameter = grid_search_function(rf_classifier,parameters)

Best Accuracy: 87.31 %
Best Parameters: {'criterion': 'entropy', 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 25}


In [None]:
rf_best_classifier = RandomForestClassifier(n_estimators=best_parameter['n_estimators'],criterion=best_parameter['criterion'],\
                                            max_features=best_parameter['max_features'],min_samples_split=best_parameter['min_samples_split'],random_state=42)
rf_best_classifier.fit(X_train,y_train)
y_pred = rf_best_classifier.predict(X_test)
performance_metrics(y_test,y_pred,rf_best_classifier)

Model accuracy = 88.10 %

Confusion Matrix 
 [[22  4]
 [ 1 15]]

F1 score = 0.86

Mean accuracy for 5-fold cross validation on train set = 87.31 %
