In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## First look at the csv file

Using Pandas library as pd we use the method [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to open the file and take a look at how the data is displayed.

In [None]:
pd.read_csv('../input/mines-vs-rocks/sonar.all-data.csv')

1. This data has no header, so we must open it using the parameter header=None parameter.

2. There are 207 observations with 61 columns, and the last column is the data we wish to predict.

In [None]:
main_df = pd.read_csv('/kaggle/input/mines-vs-rocks/sonar.all-data.csv',header=None)
main_df

Now the columns are enumerated from 0 to 60.

The last column have values for "R" and "M", wich stands for "Rock" or "Mine" observation.

We will try to predict it based on the columns from 0 to 59 while the column 60 says if its a "Rock" or "Mine" observation.

Lets check the classes balance.

In [None]:
main_df[60].value_counts().plot(kind='barh')

There are not much difference between the classes proportion, so I will not apply any rebalance to it.

I choose to split the inputs (first 59 columns) and targets (column 60 dummie data) for then use it as the model inputs and outputs.

First the inputs.

In [None]:
inputs_df = main_df.drop(60, axis=1)
inputs_df.head()

Get the [dummy data for our classification column](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html), creates a new column for each class representing if the row belongs to the column class (1) or not (0).

Then the outputs.

In [None]:
targets_df = pd.get_dummies(main_df[60])
targets_df

For the targets we have now the columns 'R' wich stand for Rock and 'M' for Mine.

For these columns we have the values 1 for "belongs to" and 0 for "doesn't belongs to" the column class.

I choose to split it into two Series object in a way for me to test the classification results for each one.

In [None]:
rock_y_df = targets_df['R']
mine_y_df = targets_df['M']

We must then [split our data into train and test](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for we to be able to measure the model generalization as we predict unseen data by the model.

This step has a great impact on the model selection stage.

I choose to predict 1 if its a mine and 0 if its a rock.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs_df, mine_y_df, test_size=0.30, random_state=42)

We will use [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) for feature creation as we are dealing with numerical data.

Then we will use [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) with the feature creator and the classifier we choose ahead to get our predictions done.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# For feature creation
poly = PolynomialFeatures(2)

We will import some sklearn classifiers, test them and select the best one to use in our problem.

In [None]:
#Importing classifiers
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier

Time to declare our classification models for testing and then choose the one with better generalization.

In [None]:
classifiers_ = [
    ("AdaBoost",AdaBoostClassifier()),
    ("Decision Tree", DecisionTreeClassifier(max_depth=10)),
    ("Gaussian Process", GaussianProcessClassifier(1.0 * RBF(1.0))),
    ("Linear SVM", SVC(kernel="linear", C=0.025,probability=True)),
    ("Naive Bayes",GaussianNB()),
    ("Nearest Neighbors",KNeighborsClassifier(3)),
    ("Neural Net",MLPClassifier(alpha=1)),
    ("QDA", QuadraticDiscriminantAnalysis()),
    ("Random Forest",RandomForestClassifier(n_jobs=2, random_state=1)),
    ("RBF SVM",SVC(gamma=2, C=1,probability=True)),
    ("SGDClassifier", SGDClassifier(max_iter=1000, tol=10e-3,penalty='elasticnet'))
    ]

Train each Classifier to take its training results.

In [None]:
clf_names = []
train_scores = []
test_scores = []
for n,clf in classifiers_:
    clf_names.append(n)
    # Model declaration with pipeline
    clf = Pipeline([('POLY', poly),('CLF',clf)])
    
    # Model training
    clf.fit(X_train, y_train)
    print(n+" training done!")
    
    # Measure training accuracy and score
    train_scores.append(clf.score(X_train, y_train))
    print(n+" training score done!")
    
    # Measure test accuracy and score
    test_scores.append(clf.score(X_test, y_test))
    print(n+" testing score done!")
    print("---")

We can plot each one results for comparing.

In [None]:
#Plot results
plt.title('Accuracy Training Score')
plt.grid()
plt.plot(train_scores,clf_names)
plt.show()

plt.title('Accuraccy Test Score')
plt.grid()
plt.plot(test_scores,clf_names)
plt.show()

From the 11 classifiers we used, 7 got overfitting with 100% accuracy on the train data, but the test score shows us that only a few of them was able to generalize the problem.

As seen in the Test Score results, the Gaussian Process shows better generalization, followed by the methods of Artificial Neural Networks, K-Neares Neighbors then SGD.

We will then train a model using Gaussian Process method together with Polynomial Features as it shows better results for this experiment.

In [None]:
rng = np.random.RandomState(1)

clf = GaussianProcessClassifier(1.0 * RBF(1.0))

clf = Pipeline([('POLY', poly),
                ('ADABOOST', clf)])

# Training our model
%time clf.fit(X_train, y_train)

Measure its performance on the training set.

In [None]:
clf.score(X_train, y_train)

It shows a kind of overfitting, where its high complexity makes it fit the the whole training dataset.

It can become a problem depending on the context that you're dealing with, but first lets check its score on the test dataset.

In [None]:
clf.score(X_test, y_test)

The accuraccy of 92% on the test dataset shows that the model was able to generalize well for the task of classifying if the observation is a rock or a mine.

Lets count how much mines our classifier points in the test dataset.

In [None]:
clf.predict(X_test).sum()

Lets count how much there really is.

In [None]:
y_test.sum()


For better conclusions is a good choice to plot a [confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) for better describing our model accuracy on both: train and test data.

In [None]:
from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(clf, X_train, y_train,
                             display_labels=['ROCK','MINE'],
                             cmap=plt.cm.Blues,
                             normalize=None)
disp.ax_.set_title('Confusion matrix')

print('Train results: confusion matrix')
print(disp.confusion_matrix)

The training dataset confusion matrix shows that it has 100% accuray, correct classifying each observation.

We must then take measurements on the confusion matrix of the test data:

In [None]:
disp = plot_confusion_matrix(clf, X_test, y_test,
                             display_labels=['ROCK','MINE'],
                             cmap=plt.cm.Blues,
                             normalize=None)
disp.ax_.set_title('Confusion matrix')

print('Test results: confusion matrix')
print(disp.confusion_matrix)

Our model wrong predicted some samples on the test dataset.

The model predicted some rocks as mines, and no one would be in danger, no stone would explode by being carefull to disarm it.

The other hand it predicted some mines as rock and it may put people life in danger if dealing with the task of predicting real mines even though the chances are low.

One available solution, if the risk is not worth, is to use this predictor attached to a robot to avoid injuries.