**Our main goal is to create a machine learning model capable of detecting the difference between a rock or a mine based on the response of the 60 separate sonar frequencies.**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../input/connectionist-bench-sonar-mines-vs-rocks/sonar.all-data.csv")

In [None]:
df.info()

In [None]:
df.head(10)

we have 60 frequencies as our features and our label is either R or M wich means we detect Rock or Mine.

In [None]:
sns.countplot(data=df, x='Label')

In [None]:
fig = plt.figure(figsize=(10,10), dpi=100)

sns.heatmap(df.corr(),cmap="BuPu")

That purplish color alongside the diognal of our heat map shows that frequencies that are close to each other are somehow more corrolated.

Now I'm looking for frequencies that are more corrolated with out label so first thing to do is to change my label to 0,1 and I am going to do this using map() in python.

In [None]:
df['NumbLabel'] = df['Label'].map({'R':0, 'M':1})

we can see the corrolations in numbers:

In [None]:
df.corr()['NumbLabel'].sort_values(ascending=False)

# Train|Test Split

Since Sklearn has no problem with categorical labels, we are going to use out Label feature in its own way although we already convert it to 0 and 1 in NumbLabel.

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop(['Label', 'NumbLabel'], axis = 1)
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1, random_state=101)

Since we're going to use cross validation, these X_test and y_test are more kinda a hold out. Therefor I fill out the test_size with 10 percent.

In KNN modeling, feature scaling is neccary because we are caculating distance and all our feature should have the same unit in order to help us comparing them better.
Thus, I am going to make a pipeline incluing scaling and out knn model.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
scaler = StandardScaler()
knn = KNeighborsClassifier()

operations = [('scaler', scaler),('knn', knn)]

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline(operations)

In the next step we are going to make a grid seearch and testing out diffrect values for k to find out wich k is best for out model.

In [None]:
from sklearn.model_selection import GridSearchCV
k_values = list(range(1,30))
param_grid = {'knn__n_neighbors': k_values}

In [None]:
cv_classifier = GridSearchCV(pipe, param_grid, cv = 5, scoring='accuracy')

In [None]:
cv_classifier.fit(X_train,y_train)

The grid-search has been performed now let's see what's our best parameters:

In [None]:
cv_classifier.best_estimator_.get_params()

In [None]:
pd.DataFrame(cv_classifier.cv_results_)

We can see the results for diffrent n_neighbors in above data frame

In [None]:
fig = plt.figure(figsize=(9,6), dpi=100)
plt.plot(pd.DataFrame(cv_classifier.cv_results_)['mean_test_score'])

This prove why we should use only 1 neighbor because as we can see as our k value increase our accuracy decrease and k=1 has the best accuracy

# Model Evaluation

In [None]:
y_pred = cv_classifier.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test,y_pred))