# Task 4

When performing clustering, categorical features become problematic, because the usual distance for categorical features gives bad results using, for instance K-Means.

That distance is given by
$$d(a, b) = \left\{
\begin{array}{rcl}
     1 & \mbox{ , if } & a = b
  \\ 0 & \mbox{ , if } & a\neq b
\end{array}
\right.$$
In order to adapt K-Means to the case of categorical features, there is an algorithm called K-Modes that is able to handle such features in an efficient way.

The documentation of kmodes package is available here https://pypi.org/project/kmodes/. In order to install it, you will have to use pip because it is not available in any Anaconda package repository. In general, it is not recommended, for Anaconda users, to install packages using pip. In this case, I had never have problems with this package.


The dataset 'USCrimeMDLP.csv' contains 300 samples and 79 categorical features, plus a class feature. Use that dataset for the following exercises:
1.	Split the data into train and test, keeping 200 samples for training. Use stratification. Always use random_state=0.
<br><br>
2.	Use K-Means algorithm with 2clusters, and evaluate it using the area under the ROC curve (AUC) as external measure (we can do it because the class column is available). Take into account that the classes are {-1, 1} and the clusters names are {0, 1}. Did you get a strange value? Why can AUC be so low? Correct the problem. (Hint: the prediction made by clustering assigns the name of the cluster as the class, but the names are just tag names). Is the K-Means algorithm adequate in this case?
<br><br>
3.	Consider the clustering you have obtained as a classification algorithm, i.e. each cluster predicting a class. Apart from the AUC, obtained above, calculate the classification report, area under the ROC curve (AUC), and confusion matrix. 
<br><br>
4.	Compare these results with the classification performed with random forest (n_estimators=100), and with SVC (C=2.0) and NuSVC (nu=0.001).
<br><br>
5.	Use K-Modes algorithm with 2 clusters, evaluating it in the same way as with K-Means. Comparing with the result in exercise 2, should we discard kmodes as an unsupervised classification procedure?
<br><br>
6.	Make summary comments of all the exercises as a general conclusion.

## Solution

In [1]:
seed = 0

### 1. Split the data into train and test, keeping 200 samples for training. Use stratification. Always use random_state=0. 

In [2]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle

df = pd.read_csv('USCrimeMDLP.csv')
df.head()

Unnamed: 0,v1,v3,v4,v6,v8,v9,v11,v12,v13,v14,...,v89,v90,v91,v92,v94,v97,v98,v99,v100,class
0,0,1,3,0,1,0,0,1,1,0,...,0,0,1,0,0,0,0,1,1,-1
1,0,0,1,0,1,1,1,1,0,0,...,0,1,1,0,0,0,0,1,0,1
2,0,2,0,0,1,0,1,0,0,0,...,0,1,1,0,0,0,0,0,0,-1
3,0,2,0,0,1,1,1,1,1,2,...,1,1,1,0,1,0,0,1,0,-1
4,0,1,3,0,0,0,1,0,1,0,...,0,1,1,0,0,0,0,0,0,-1


In [3]:
from sklearn.model_selection import train_test_split

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=200, random_state=seed, stratify=y)

In [4]:
y_train.value_counts()

 1    100
-1    100
Name: class, dtype: int64

In [5]:
y_test.value_counts()

 1    50
-1    50
Name: class, dtype: int64

### 2. Use K-Means algorithm with 2clusters, and evaluate it using the area under the ROC curve (AUC) as external measure (we can do it because the class column is available). Take into account that the classes are {-1, 1} and the clusters names are {0, 1}. Did you get a strange value? Why can AUC be so low? Correct the problem. (Hint: the prediction made by clustering assigns the name of the cluster as the class, but the names are just tag names). Is the K-Means algorithm adequate in this case?

Fit Kmeans algorithm with train data. After that, predict test and train data.

In [6]:
from sklearn.cluster import KMeans

kmeans_model = KMeans(n_clusters=2, random_state=seed).fit(X_train)

train_pred_clusters = kmeans_model.predict(X_train)
test_pred_clusters = kmeans_model.predict(X_test)

In [7]:
train_pred_clusters

array([0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0])

In [8]:
test_pred_clusters

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1])

Predicted clusters names are 1 or 0 but the class/target is -1 or 1. We are going to change the cluster values to -1 and 1.

In [9]:
train_pred_clusters[train_pred_clusters==0] = -1
test_pred_clusters[test_pred_clusters==0] = -1

In [10]:
train_pred_clusters

array([-1, -1,  1, -1, -1,  1, -1,  1, -1,  1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1,  1,  1,  1,  1,  1, -1, -1,  1,  1, -1,  1, -1, -1, -1, -1,
       -1, -1, -1, -1,  1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1,  1, -1,
       -1, -1, -1, -1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1,
        1,  1, -1, -1, -1, -1, -1,  1, -1,  1,  1,  1,  1, -1,  1, -1, -1,
        1, -1, -1,  1, -1, -1, -1,  1, -1, -1,  1, -1, -1,  1, -1, -1,  1,
       -1, -1, -1,  1,  1, -1,  1, -1, -1,  1,  1, -1,  1,  1, -1,  1,  1,
       -1, -1, -1, -1, -1, -1,  1,  1, -1, -1, -1, -1,  1, -1, -1, -1,  1,
        1, -1, -1,  1,  1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1,
        1,  1, -1, -1,  1,  1, -1, -1, -1, -1,  1, -1,  1,  1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1,  1, -1, -1,  1, -1, -1, -1, -1])

In [11]:
test_pred_clusters

array([-1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        1, -1,  1, -1, -1,  1, -1, -1, -1,  1, -1, -1,  1, -1, -1, -1, -1,
       -1, -1,  1, -1, -1, -1,  1,  1, -1, -1,  1, -1, -1, -1,  1, -1, -1,
       -1, -1,  1,  1, -1, -1,  1, -1,  1, -1, -1, -1, -1, -1, -1,  1, -1,
        1,  1, -1, -1,  1, -1, -1, -1, -1,  1, -1,  1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1, -1,  1])

We are going to calculate the AUC score for train and test data.

In [12]:
from sklearn.metrics import roc_auc_score

train_auc = roc_auc_score(train_pred_clusters, y_train)
test_auc = roc_auc_score(test_pred_clusters, y_test)

print('Train AUC: ' + str(train_auc))
print('Test AUC: ' + str(test_auc))

Train AUC: 0.1574107464839524
Test AUC: 0.23178994918125354


In [13]:
from sklearn.metrics import roc_auc_score

train_auc = roc_auc_score(train_pred_clusters*-1, y_train)
test_auc = roc_auc_score(test_pred_clusters*-1, y_test)

print('Train AUC: ' + str(train_auc))
print('Test AUC: ' + str(test_auc))

Train AUC: 0.8425892535160476
Test AUC: 0.7682100508187465


### 3. Consider the clustering you have obtained as a classification algorithm, i.e. each cluster predicting a class. Apart from the AUC, obtained above, calculate the classification report, area under the ROC curve (AUC), and confusion matrix.

Imports.

In [14]:
# Imports
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

Utils.

In [15]:
def print_report(y_clusters, y_true):
    print('roc_auc_score: ' + str(roc_auc_score(y_clusters, y_true)) + '\n')
    print('confusion_matrix:')
    print(confusion_matrix(y_clusters, y_true))
    print('\n')
    print(classification_report(y_clusters, y_true))    

Train data.

In [16]:
print_report(train_pred_clusters, y_train)

roc_auc_score: 0.1574107464839524

confusion_matrix:
[[42 99]
 [58  1]]


              precision    recall  f1-score   support

          -1       0.42      0.30      0.35       141
           1       0.01      0.02      0.01        59

    accuracy                           0.21       200
   macro avg       0.21      0.16      0.18       200
weighted avg       0.30      0.21      0.25       200



In [17]:
print_report(train_pred_clusters*-1, y_train)

roc_auc_score: 0.8425892535160476

confusion_matrix:
[[58  1]
 [42 99]]


              precision    recall  f1-score   support

          -1       0.58      0.98      0.73        59
           1       0.99      0.70      0.82       141

    accuracy                           0.79       200
   macro avg       0.78      0.84      0.78       200
weighted avg       0.87      0.79      0.79       200



Test data.

In [18]:
# Test
print_report(test_pred_clusters, y_test)

roc_auc_score: 0.23178994918125354

confusion_matrix:
[[29 48]
 [21  2]]


              precision    recall  f1-score   support

          -1       0.58      0.38      0.46        77
           1       0.04      0.09      0.05        23

    accuracy                           0.31       100
   macro avg       0.31      0.23      0.26       100
weighted avg       0.46      0.31      0.36       100



In [19]:
print_report(test_pred_clusters*-1, y_test)

roc_auc_score: 0.7682100508187465

confusion_matrix:
[[21  2]
 [29 48]]


              precision    recall  f1-score   support

          -1       0.42      0.91      0.58        23
           1       0.96      0.62      0.76        77

    accuracy                           0.69       100
   macro avg       0.69      0.77      0.67       100
weighted avg       0.84      0.69      0.71       100



### 4. Compare these results with the classification performed with random forest (n_estimators=100), and with SVC (C=2.0) and NuSVC (nu=0.001). 

Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

print(classification_report(y_pred, y_test))
print('roc_auc_score: ' + str(roc_auc_score(y_pred, y_test)) + '\n')

              precision    recall  f1-score   support

          -1       0.88      0.85      0.86        52
           1       0.84      0.88      0.86        48

    accuracy                           0.86       100
   macro avg       0.86      0.86      0.86       100
weighted avg       0.86      0.86      0.86       100

roc_auc_score: 0.860576923076923



SVC

In [21]:
from sklearn.svm import SVC

svc_model = SVC(C=2.0, gamma='auto').fit(X_train, y_train)
y_pred = svc_model.predict(X_test)

print(classification_report(y_pred, y_test))
print('roc_auc_score: ' + str(roc_auc_score(y_pred, y_test)) + '\n')

              precision    recall  f1-score   support

          -1       0.78      0.89      0.83        44
           1       0.90      0.80      0.85        56

    accuracy                           0.84       100
   macro avg       0.84      0.84      0.84       100
weighted avg       0.85      0.84      0.84       100

roc_auc_score: 0.8449675324675325



NuSVC

In [22]:
from sklearn.svm import NuSVC

nusvc_model = NuSVC(nu=0.001, gamma='scale').fit(X_train, y_train)
y_pred = nusvc_model.predict(X_test)

print(classification_report(y_pred, y_test))
print('roc_auc_score: ' + str(roc_auc_score(y_pred, y_test)) + '\n')

              precision    recall  f1-score   support

          -1       0.82      0.87      0.85        47
           1       0.88      0.83      0.85        53

    accuracy                           0.85       100
   macro avg       0.85      0.85      0.85       100
weighted avg       0.85      0.85      0.85       100

roc_auc_score: 0.851264552388599



We get the best performance with RandomForestClassifier.

### 5. Use K-Modes algorithm with 2 clusters, evaluating it in the same way as with K-Means. Comparing with the result in exercise 2, should we discard kmodes as an unsupervised classification procedure?

Fit KModes algorithm with train data. After that, predict test and train data.

In [23]:
from kmodes.kmodes import KModes

km = KModes(n_clusters=2, random_state=seed).fit(X_train)

train_pred_clusters = km.predict(X_train).astype(int)
test_pred_clusters = km.predict(X_test).astype(int)

In [24]:
train_pred_clusters[train_pred_clusters==0] = -1
test_pred_clusters[test_pred_clusters==0] = -1

In [25]:
train_pred_clusters

array([-1, -1,  1, -1, -1,  1, -1,  1, -1,  1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1,  1,  1,  1,  1,  1, -1, -1,  1,  1, -1,  1, -1,  1, -1, -1,
       -1, -1, -1, -1,  1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1,  1, -1,
       -1, -1, -1, -1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1,
        1,  1, -1, -1, -1, -1, -1,  1, -1,  1,  1,  1,  1, -1,  1, -1, -1,
        1, -1, -1,  1, -1, -1, -1,  1, -1, -1,  1, -1, -1,  1, -1, -1,  1,
       -1, -1, -1,  1,  1, -1,  1, -1, -1,  1,  1, -1,  1,  1, -1,  1,  1,
       -1, -1, -1,  1, -1, -1,  1,  1, -1, -1, -1, -1,  1, -1, -1, -1,  1,
        1, -1, -1,  1,  1, -1, -1, -1,  1, -1,  1, -1,  1, -1, -1, -1, -1,
        1,  1, -1, -1,  1,  1, -1, -1, -1, -1,  1, -1,  1,  1, -1, -1, -1,
       -1, -1, -1,  1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1,  1,
       -1, -1,  1, -1,  1,  1, -1, -1,  1, -1, -1, -1, -1])

In [26]:
test_pred_clusters

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1,
        1, -1,  1, -1, -1,  1, -1, -1, -1,  1, -1, -1,  1,  1, -1, -1, -1,
       -1, -1,  1, -1, -1, -1,  1,  1, -1, -1,  1, -1, -1, -1,  1, -1, -1,
       -1, -1,  1,  1, -1, -1,  1, -1,  1,  1, -1, -1, -1, -1, -1,  1, -1,
        1,  1, -1, -1,  1, -1, -1, -1, -1,  1, -1,  1, -1, -1, -1, -1, -1,
       -1, -1,  1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1, -1,  1])

We are going to calculate the AUC score for train and test data.

In [27]:
from sklearn.metrics import roc_auc_score

train_auc = roc_auc_score(train_pred_clusters, y_train)
test_auc = roc_auc_score(test_pred_clusters, y_test)

print('Train AUC: ' + str(train_auc))
print('Test AUC: ' + str(test_auc))

Train AUC: 0.14650432050274945
Test AUC: 0.21413721413721412


In [28]:
from sklearn.metrics import roc_auc_score

train_auc = roc_auc_score(train_pred_clusters*-1, y_train)
test_auc = roc_auc_score(test_pred_clusters*-1, y_test)

print('Train AUC: ' + str(train_auc))
print('Test AUC: ' + str(test_auc))

Train AUC: 0.8534956794972507
Test AUC: 0.7858627858627859


### 6. Make summary comments of all the exercises as a general conclusion.