# Task 4

When performing clustering, categorical features become problematic, because the usual distance for categorical features gives bad results using, for instance K-Means.

That distance is given by
$$d(a, b) = \left\{
\begin{array}{rcl}
     1 & \mbox{ , if } & a = b
  \\ 0 & \mbox{ , if } & a\neq b
\end{array}
\right.$$
In order to adapt K-Means to the case of categorical features, there is an algorithm called K-Modes that is able to handle such features in an efficient way.

The documentation of kmodes package is available here https://pypi.org/project/kmodes/. In order to install it, you will have to use pip because it is not available in any Anaconda package repository. In general, it is not recommended, for Anaconda users, to install packages using pip. In this case, I had never have problems with this package.


The dataset 'USCrimeMDLP.csv' contains 300 samples and 79 categorical features, plus a class feature. Use that dataset for the following exercises:
1.	Split the data into train and test, keeping 200 samples for training. Use stratification. Always use random_state=0.
<br><br>
2.	Use K-Means algorithm with 2clusters, and evaluate it using the area under the ROC curve (AUC) as external measure (we can do it because the class column is available). Take into account that the classes are {-1, 1} and the clusters names are {0, 1}. Did you get a strange value? Why can AUC be so low? Correct the problem. (Hint: the prediction made by clustering assigns the name of the cluster as the class, but the names are just tag names). Is the K-Means algorithm adequate in this case?
<br><br>
3.	Consider the clustering you have obtained as a classification algorithm, i.e. each cluster predicting a class. Apart from the AUC, obtained above, calculate the classification report, area under the ROC curve (AUC), and confusion matrix. 
<br><br>
4.	Compare these results with the classification performed with random forest (n_estimators=100), and with SVC (C=2.0) and NuSVC (nu=0.001).
<br><br>
5.	Use K-Modes algorithm with 2 clusters, evaluating it in the same way as with K-Means. Comparing with the result in exercise 2, should we discard kmodes as an unsupervised classification procedure?
<br><br>
6.	Make summary comments of all the exercises as a general conclusion.

## Solution

In [1]:
# 1.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter

df = pd.read_csv('USCrimeMDLP.csv')
    
X_train, X_test, y_train, y_test = train_test_split(df.values[:,:-1], df.values[:,-1], test_size=100, random_state=0, stratify=df.values[:,-1])
print(Counter(y_train))
print(Counter(y_test))

Counter({1: 100, -1: 100})
Counter({1: 50, -1: 50})


We see that data is stratified

In [2]:
# 2.
import numpy as np
from sklearn.metrics import roc_auc_score

from sklearn.cluster import KMeans
kmeans = KMeans(2, random_state=0)  # 2 clusters
kmeans.fit(X_train)

kmeans.fit(X_train)
y_pred_means = kmeans.predict(X_test)

AUC = roc_auc_score(y_test, y_pred_means)
print("score test: " + str(AUC))

score test: 0.31000000000000005


The obtainer AUC score isvery low. Thats is due to the fact that data is being classified using 0 and 1 values rather tahn -1 and 1. This happens because kmeans doesnt know the target class names.

Not only that, the low score also could mean that is predicting:
0 when it belongs to -1
1 when it belongs to 1

And in reality it is the oposite.

In [3]:
y_pred_means = y_pred_means * -1

In [4]:
AUC = roc_auc_score(y_test, y_pred_means)
print("score test: " + str(AUC))

score test: 0.69


In [5]:
# 3.
#classification report
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_means))


              precision    recall  f1-score   support

          -1       0.91      0.42      0.58        50
           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00        50

    accuracy                           0.21       100
   macro avg       0.30      0.14      0.19       100
weighted avg       0.46      0.21      0.29       100



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [6]:
from sklearn.metrics import confusion_matrix
print("confusion matrix test: \n")
print(confusion_matrix(y_test, y_pred_means))

confusion matrix test: 

[[21 29  0]
 [ 0  0  0]
 [ 2 48  0]]


In [7]:
AUC = roc_auc_score(y_test, y_pred_means)
print("\n ROC AUC score test: " + str(AUC))


 ROC AUC score test: 0.69


We saw in the confusion matrix the following results:

TP = 21

FP = 2

TN = 48

FN = 29

We see that we have a really high FN. 

In [8]:
# 4.
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import NuSVC

rf = RandomForestClassifier(n_estimators=100, random_state=0)
svc  = SVC(C=2.0, random_state=0)
nu_svc = NuSVC(nu=0.001, random_state=0)

rf.fit(X_train, y_train)
svc.fit(X_train, y_train)
nu_svc.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
y_pred_svc = svc.predict(X_test)
y_pred_nu_svc = nu_svc.predict(X_test)

print("RF score: " + str(roc_auc_score(y_test, y_pred_rf)))
print("SVC score: " + str(roc_auc_score(y_test, y_pred_svc)))
print("NuSVC score: " + str(roc_auc_score(y_test, y_pred_nu_svc)))

RF score: 0.86
SVC score: 0.84
NuSVC score: 0.88




This algorithms perform better compared to KMeans

In [9]:
#5.
from kmodes.kmodes import KModes
from sklearn.metrics import roc_auc_score

km = KModes(n_clusters=2, random_state=0)

km.fit(X_train)

y_pred_kmodes = km.predict(X_test)

y_pred_kmodes = y_pred_kmodes * -1

AUC = roc_auc_score(y_test, y_pred_kmodes)
print("score test: " + str(AUC))

score test: 0.72


#### 6.
[your conclusion here]

Let's recap all the results.

KMeans = 0.69

KModes = 0.72

NuSVC = 0.88

SVC = 0.84

RandomForest = 0.86

We can observe that the clustering methods have a lower score than the 3 proposed witch happen to be supervised. 

When comparing the clustering methods, we observe that the KModes has a better score than the other one. And that makes sense dude to the fact that KMeans is not the best option when predicting using categorical data. Rather we should choose KModes. 