# Supervised machine Learning - Classification

## Splitting the initial dataset

As mentioned in the introduction, supervised machine learning is the process of finding a future result, using a model created by an algorithm from known results.

Now it is time to apply the machine learning algorithms in order to make a prediction. But what type of prediction could be applied when the results are already known? In the specific dataset () there are distinctive results.

It is time to make an agreement. In order to continue, we should divide our dataset into two groups. The first one, the larger (usually 70%-80% of the initial dataset) should be our **training set**. The dataset that we are going to use the algorithm with the known results in order to create a model. The rest of the initial dataset could be used as a **test set,** the set to apply the model and test it, checking and counting the agreement of predicted results with the observed ones, the **accuracy** of the model.

In [None]:
from sklearn.decomposition import PCA

In [None]:
import pandas as pd
df = pd.read_csv('sonar.all-data.csv',header=None)
#all rows all columns but last
sonar = df.iloc[:, :-1].values
#all rows, only the last column
sonar_class = df.iloc[:, -1].values

In [None]:
sonar

In [None]:
sonar_class

In [None]:
from sklearn.model_selection import train_test_split
#test_size is the percentage of the test size to the complete dataset
# random_state is the seed. A specific pseudorandom number to split the data set
# in order to produce same splitting every time we run the script.
sonar_train, sonar_test, sonar_class_train, sonar_class_test = train_test_split(sonar, sonar_class, test_size = 0.25, random_state = 42)

In [None]:
sonar_train

In [None]:
sonar_test

In [None]:
# the train set responses of the dataset
sonar_class_train

In [None]:
#the test set responses of the dataset
sonar_class_test

# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame
#name StandardScaler()
scaler = StandardScaler()
#call it for the data (the result is array)
sonar_train_scaled = scaler.fit_transform(sonar_train)
sonar_test_scaled =scaler.transform(sonar_test)

## PCA

In [None]:
#Take the first 10 components
#n_components = 10 # Example value
#pca = PCA(n_components=n_components)

# Take as many components as necessary to explain the 80% of total variability
pca = PCA(n_components=0.8, svd_solver='full')

# Fit PCA on training data and transform both training and test data
sonar_train_pca = pca.fit_transform(sonar_train_scaled)
sonar_test_pca = pca.transform(sonar_test_scaled)

In [None]:
pca.n_components_

# Knn classification algorithm (K nearest neighbors)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
#choosing the Eucledian distance (See classifier's help) 
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(sonar_train_pca, sonar_class_train)

## The prediction of the type M or F

In [None]:
sonar_test_pred = classifier.predict(sonar_test_pca)

In [None]:
sonar_test_pred

## The test set responses

In [None]:
sonar_class_test

In [None]:
#Calculating the confusion matrix and the accuracy 
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(sonar_class_test, sonar_test_pred)
print(cm)
accuracy_score(sonar_class_test, sonar_test_pred)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)
disp.plot()
plt.show()

In [None]:
from sklearn.metrics import classification_report
print(classification_report(sonar_class_test, sonar_test_pred))

### ROC curve and AUC

In [None]:
from sklearn.metrics import roc_curve,roc_auc_score

y_pred_prob_knn = classifier.predict_proba(sonar_test_pca) # predicted probabilities

# rename M,R to 0,1 

sonar_class_test = pd.factorize(sonar_class_test)[0].tolist()

fpr, tpr, _ = roc_curve(sonar_class_test, y_pred_prob_knn[:,1])

plt.plot(fpr, tpr, label="knn")

plt.xlabel('Recall', fontsize=18)
plt.ylabel('Precision', fontsize=18)
plt.legend(fontsize=15)

In [None]:
print(f'model 1 AUC score: {roc_auc_score(sonar_class_test, y_pred_prob_knn[:,1])}') 