## Data Pre-Processing 

In [43]:
# Load libraries

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
import sklearn.metrics as metrics

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [44]:
# Load data
dfPoints = pd.read_csv("df_points.txt", delimiter="\t")

In [45]:
# Check some rows
dfPoints.head()

Unnamed: 0.1,Unnamed: 0,x,y,z,label
0,0,326.488285,188.988808,-312.205307,0.0
1,1,-314.287214,307.276723,-179.037412,1.0
2,2,-328.20891,181.627758,446.311062,1.0
3,3,-148.65889,147.027947,-27.477959,1.0
4,4,-467.065931,250.467651,-306.47533,1.0


In [47]:
# There are several warnings due to the solver argument in our models.
# This is not a problem, so I will disable.
# If this code is used in the future, it might be good to have warnings not disabled.

import warnings
warnings.filterwarnings('ignore')

## (a) Segregate a test and training frame

In [48]:
# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(dfPoints[['x','y','z']], dfPoints[['label']], random_state=0)

## (b) Logistic Regression Model and Results

In [49]:
# Run logistic regression (C parameter control for regularization, solver warning is not a problem)
clf = LogisticRegression(C=0.001).fit(X_train, y_train.values.ravel())

# Check accuracy
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of Logistic regression classifier on training set: 0.54
Accuracy of Logistic regression classifier on test set: 0.53


Let's check another evaluation metrics besides accuracy:<br>
- Precision = TP / (TP + FP);<br>
- Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate;<br>
- F1 = 2 * Precision * Recall / (Precision + Recall).

In [50]:
print('Precision: {:.2f}'.format(precision_score(y_test, clf.predict(X_test))))
print('Recall: {:.2f}'.format(recall_score(y_test, clf.predict(X_test))))
print('F1: {:.2f}'.format(f1_score(y_test, clf.predict(X_test))))

Precision: 0.51
Recall: 0.67
F1: 0.58


Also, let's check the confusion matrix:

In [51]:
labels = np.unique(y_test)
a =  confusion_matrix(y_test, clf.predict(X_test), labels=labels)
print("Confusion Matrix\n",pd.DataFrame(a, index=labels, columns=labels))

Confusion Matrix
      0.0  1.0
0.0  503  777
1.0  402  818


In [52]:
# Another way to look at the Confusion Matrix
tn, fp, fn, tp = confusion_matrix(y_test, clf.predict(X_test), labels=labels).ravel()
print("TN = ", tn, "\nFP = ", fp, "\nFN = ",fn, "\nTP = ", tp)

TN =  503 
FP =  777 
FN =  402 
TP =  818


In [53]:
# Checking the AUC

fpr, tpr, threshold = metrics.roc_curve(y_test, clf.predict(X_test))

x = fpr
y = tpr 

# AUC
auc = np.trapz(y,x)
print(auc)

0.5317302766393442


As we can see, the accuracy of the chosen model is not so good. However, we didn't normalize our data.<br>
In other words, the metrics are in different scale and our model could benefit from normalization.

In [54]:
# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(dfPoints[['x','y','z']], dfPoints[['label']], random_state=0)

# scale
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Run logistic regression (C parameter control for regularization won't improve in this case)
clf = LogisticRegression().fit(X_train_scaled, y_train.values.ravel())

# Check accuracy
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train_scaled, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test_scaled, y_test)))

Accuracy of Logistic regression classifier on training set: 0.52
Accuracy of Logistic regression classifier on test set: 0.51


In [55]:
print('Precision: {:.2f}'.format(precision_score(y_test, clf.predict(X_test_scaled))))
print('Recall: {:.2f}'.format(recall_score(y_test, clf.predict(X_test_scaled))))
print('F1: {:.2f}'.format(f1_score(y_test, clf.predict(X_test_scaled))))

print("")

labels = np.unique(y_test)
a =  confusion_matrix(y_test, clf.predict(X_test_scaled), labels=labels)
print("Confusion Matrix\n",pd.DataFrame(a, index=labels, columns=labels))

print("")

tn, fp, fn, tp = confusion_matrix(y_test, clf.predict(X_test_scaled), labels=labels).ravel()
print("TN = ", tn, "\nFP = ", fp, "\nFN = ",fn, "\nTP = ", tp)

Precision: 0.50
Recall: 0.72
F1: 0.59

Confusion Matrix
      0.0  1.0
0.0  393  887
1.0  336  884

TN =  393 
FP =  887 
FN =  336 
TP =  884


In [56]:
fpr, tpr, threshold = metrics.roc_curve(y_test, clf.predict(X_test_scaled))

x = fpr
y = tpr 

# AUC
auc = np.trapz(y,x)
print(auc)

0.5158107069672131


It looks like scale isn't the problem.<br>
The logistic regression doesn't look like the best option here.

## (c) Model Chosen by Me: Random Forest

<b>Why Random Forest?</b><br>
A few models could be tested here, such as Decision Trees, SVM and so on.<br>
However, since I have to choose one, I would like to go with Random Forest because:<br>
- Random Forest is a much robust model, i.e. it has a high performance when comparing to models such as logistic regression;<br>
- Since we average several trees, there is less variance than with decision tree (which would be another good choice);<br>

The main disadvantages are:<br>
- It's less interpretable than models such as decision trees or even logistic regression;<br>
- Depending on how much trees you are training, it can have computational costs. This is not our case, so this problem can be ignored;<br>
- Considerable risk of overfitting (even though combining several trees reduces the risk compared to a decision tree).<br>

Two nice sources to understand Random Forest:
- https://towardsdatascience.com/why-random-forest-is-my-favorite-machine-learning-model-b97651fa3706; <br>
- https://towardsdatascience.com/understanding-random-forest-58381e0602d2; <br>

In [57]:
# Carrega as bibliotecas
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(dfPoints[['x','y','z']], dfPoints[['label']], random_state=0)

# run random forest classifier
clf = RandomForestClassifier().fit(X_train, y_train)

print('Accuracy of RF classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of RF classifier on training set: 0.98
Accuracy of RF classifier on test set: 0.72


<b>Regarding the model above:</b> the difference among the performance on the training and testing set is too large.<br>
I.e., the model might be overfitting. By lowering the max_depth argument, we might get better results.<br>
Let's try a new approach:

In [58]:
# run random forest classifier
clf = RandomForestClassifier(max_depth=15).fit(X_train, y_train)

print('Accuracy of RF classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of RF classifier on training set: 0.87
Accuracy of RF classifier on test set: 0.76


In [59]:
print('Precision: {:.2f}'.format(precision_score(y_test, clf.predict(X_test))))
print('Recall: {:.2f}'.format(recall_score(y_test, clf.predict(X_test))))
print('F1: {:.2f}'.format(f1_score(y_test, clf.predict(X_test))))

print("")

labels = np.unique(y_test)
a =  confusion_matrix(y_test, clf.predict(X_test), labels=labels)
print("Confusion Matrix\n",pd.DataFrame(a, index=labels, columns=labels))

print("")

tn, fp, fn, tp = confusion_matrix(y_test, clf.predict(X_test), labels=labels).ravel()
print("TN = ", tn, "\nFP = ", fp, "\nFN = ",fn, "\nTP = ", tp)

Precision: 0.75
Recall: 0.75
F1: 0.75

Confusion Matrix
      0.0  1.0
0.0  978  302
1.0  301  919

TN =  978 
FP =  302 
FN =  301 
TP =  919


In [60]:
import matplotlib.pyplot as plt
import numpy as np
import sklearn.metrics as metrics

fpr, tpr, threshold = metrics.roc_curve(y_test, clf.predict(X_test))

# AUC
auc = np.trapz(tpr,fpr)
print(auc)

0.758670594262295


This model is our final choice.

## (d) Comparing the Results 

<br>
<br>
As we could see, Random Forest model accuracy was way better than the logistic regression approach.<br>
The final model needed to lower the argument max_depth to 15, so we worsen our accuracy on the training set, but improved it on the test set. <br>
This means we lowered the risk of overfitting our model and now the model will perform better on new dataset.<br>
Due to the advantages already mentioned above, we could improve the accuracy by 33pp.<br>
Additionally, our model is now acceptable for use, since the logistic regression was equivalent to let things by chance.<br>
