# Task 4

When performing clustering, categorical features become problematic, because the usual distance for categorical features gives bad results using, for instance K-Means.

That distance is given by
$$d(a, b) = \left\{
\begin{array}{rcl}
     1 & \mbox{ , if } & a = b
  \\ 0 & \mbox{ , if } & a\neq b
\end{array}
\right.$$
In order to adapt K-Means to the case of categorical features, there is an algorithm called K-Modes that is able to handle such features in an efficient way.

The documentation of kmodes package is available here https://pypi.org/project/kmodes/. In order to install it, you will have to use pip because it is not available in any Anaconda package repository. In general, it is not recommended, for Anaconda users, to install packages using pip. In this case, I had never have problems with this package.


The dataset 'USCrimeMDLP.csv' contains 300 samples and 79 categorical features, plus a class feature. Use that dataset for the following exercises:
1.	Split the data into train and test, keeping 200 samples for training. Use stratification. Always use random_state=0.
<br><br>
2.	Use K-Means algorithm with 2clusters, and evaluate it using the area under the ROC curve (AUC) as external measure (we can do it because the class column is available). Take into account that the classes are {-1, 1} and the clusters names are {0, 1}. Did you get a strange value? Why can AUC be so low? Correct the problem. (Hint: the prediction made by clustering assigns the name of the cluster as the class, but the names are just tag names). Is the K-Means algorithm adequate in this case?
<br><br>
3.	Consider the clustering you have obtained as a classification algorithm, i.e. each cluster predicting a class. Apart from the AUC, obtained above, calculate the classification report, area under the ROC curve (AUC), and confusion matrix. 
<br><br>
4.	Compare these results with the classification performed with random forest (n_estimators=100), and with SVC (C=2.0) and NuSVC (nu=0.001).
<br><br>
5.	Use K-Modes algorithm with 2 clusters, evaluating it in the same way as with K-Means. Comparing with the result in exercise 2, should we discard kmodes as an unsupervised classification procedure?
<br><br>
6.	Make summary comments of all the exercises as a general conclusion.

## Solution

### 1. Split the data into train and test, keeping 200 samples for training. Use stratification. Always use random_state=0. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle

df = pd.read_csv('USCrimeMDLP.csv')
df = shuffle(df).reset_index(drop=True)
df.head()

Unnamed: 0,v1,v3,v4,v6,v8,v9,v11,v12,v13,v14,...,v89,v90,v91,v92,v94,v97,v98,v99,v100,class
0,1,0,0,1,1,1,2,1,0,0,...,0,2,0,1,0,1,1,1,1,1
1,0,2,2,0,1,1,1,0,0,1,...,1,1,1,0,0,0,0,0,0,1
2,0,0,3,0,1,1,1,0,0,2,...,0,0,1,0,0,0,0,1,0,-1
3,0,2,2,0,1,1,0,1,0,1,...,1,0,1,0,1,0,1,1,0,1
4,0,1,3,0,0,0,1,0,0,1,...,1,1,1,0,0,0,0,0,0,-1


In [2]:
df_train = df[:200]
df_test = df[-100:]
print('Train shape: ' + str(df_train.shape))
print('Test shape: ' + str(df_test.shape))

Train shape: (200, 80)
Test shape: (100, 80)


In [3]:
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:,-1]

X_test = df_test.iloc[:,:-1]
y_test = df_test.iloc[:,-1]

In [4]:
y_train.value_counts()

-1    101
 1     99
Name: class, dtype: int64

In [5]:
y_test.value_counts()

 1    51
-1    49
Name: class, dtype: int64

### 2. Use K-Means algorithm with 2clusters, and evaluate it using the area under the ROC curve (AUC) as external measure (we can do it because the class column is available). Take into account that the classes are {-1, 1} and the clusters names are {0, 1}. Did you get a strange value? Why can AUC be so low? Correct the problem. (Hint: the prediction made by clustering assigns the name of the cluster as the class, but the names are just tag names). Is the K-Means algorithm adequate in this case?

Fit KModes algorithm with train data. After that, predict test and train data.

In [6]:
from kmodes.kmodes import KModes

km = KModes(n_clusters=2).fit(X_train)

train_pred_clusters = km.predict(X_train)
test_pred_clusters = km.predict(X_test)

Predicted clusters names are 1 or 0. If we check the data type, it's unsigned so we cant asign negative values to that data type. To make it easier, we are going to change the class column (-1 = 0).

In [7]:
print(type(y_train[0]))
print(type(test_pred_clusters[0]))

<class 'numpy.int64'>
<class 'numpy.uint16'>


In [8]:
y_train[y_train==-1] = 0
y_test[y_test==-1] = 0

In [9]:
y_train.value_counts()

0    101
1     99
Name: class, dtype: int64

In [10]:
y_test.value_counts()

1    51
0    49
Name: class, dtype: int64

We are going to calculate the AUC score for train and test data.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

COMO SABER SI EL CLUSTER 1 PERTENECE A UNA CLASE Y EL CLUSTER 0 A LA OTRA. SI SALEN EN EL ORDEN QUE QUIERO YA (1=1 & -1=0) HACE BIEN PERO SI SALE DE LA OTRA MANERA NO (1=0, -1=1).

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [11]:
from sklearn.metrics import roc_auc_score

train_auc = roc_auc_score(train_pred_clusters, y_train)
test_auc = roc_auc_score(test_pred_clusters, y_test)

print('Train AUC: ' + str(train_auc))
print('Test AUC: ' + str(test_auc))

Train AUC: 0.8566778557623844
Test AUC: 0.8833333333333333


### 3. Consider the clustering you have obtained as a classification algorithm, i.e. each cluster predicting a class. Apart from the AUC, obtained above, calculate the classification report, area under the ROC curve (AUC), and confusion matrix.

Imports.

In [16]:
# Imports
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

Train data.

In [22]:
# Train
print('roc_auc_score: ' + str(roc_auc_score(train_pred_clusters, y_train)) + '\n')
print('confusion_matrix:')
print(confusion_matrix(train_pred_clusters, y_train))
print('\n')
print(classification_report(train_pred_clusters, y_train))

roc_auc_score: 0.8566778557623844

confusion_matrix:
[[79  8]
 [22 91]]


              precision    recall  f1-score   support

           0       0.78      0.91      0.84        87
           1       0.92      0.81      0.86       113

    accuracy                           0.85       200
   macro avg       0.85      0.86      0.85       200
weighted avg       0.86      0.85      0.85       200



Test data.

In [23]:
# Test
print('roc_auc_score: ' + str(roc_auc_score(test_pred_clusters, y_test)) + '\n')
print('confusion_matrix:')
print(confusion_matrix(test_pred_clusters, y_test))
print('\n')
print(classification_report(test_pred_clusters, y_test))

roc_auc_score: 0.8833333333333333

confusion_matrix:
[[38  2]
 [11 49]]


              precision    recall  f1-score   support

           0       0.78      0.95      0.85        40
           1       0.96      0.82      0.88        60

    accuracy                           0.87       100
   macro avg       0.87      0.88      0.87       100
weighted avg       0.89      0.87      0.87       100



### 4. Compare these results with the classification performed with random forest (n_estimators=100), and with SVC (C=2.0) and NuSVC (nu=0.001). 

Random Forest

In [25]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

print(classification_report(y_pred, y_test))
print('roc_auc_score: ' + str(roc_auc_score(y_pred, y_test)) + '\n')

              precision    recall  f1-score   support

           0       0.86      0.91      0.88        46
           1       0.92      0.87      0.90        54

    accuracy                           0.89       100
   macro avg       0.89      0.89      0.89       100
weighted avg       0.89      0.89      0.89       100

roc_auc_score: 0.8917069243156199



SVC

In [27]:
from sklearn.svm import SVC

svc_model = SVC(C=2.0, gamma='auto').fit(X_train, y_train)
y_pred = svc_model.predict(X_test)

print(classification_report(y_pred, y_test))
print('roc_auc_score: ' + str(roc_auc_score(y_pred, y_test)) + '\n')

              precision    recall  f1-score   support

           0       0.82      0.89      0.85        45
           1       0.90      0.84      0.87        55

    accuracy                           0.86       100
   macro avg       0.86      0.86      0.86       100
weighted avg       0.86      0.86      0.86       100

roc_auc_score: 0.8626262626262625



NuSVC

In [29]:
from sklearn.svm import NuSVC

nusvc_model = NuSVC(nu=0.001, gamma='scale').fit(X_train, y_train)
y_pred = nusvc_model.predict(X_test)

print(classification_report(y_pred, y_test))
print('roc_auc_score: ' + str(roc_auc_score(y_pred, y_test)) + '\n')

              precision    recall  f1-score   support

           0       0.84      0.89      0.86        46
           1       0.90      0.85      0.88        54

    accuracy                           0.87       100
   macro avg       0.87      0.87      0.87       100
weighted avg       0.87      0.87      0.87       100

roc_auc_score: 0.8715780998389694



We get the best performance with RandomForestClassifier.

### 5. Use K-Modes algorithm with 2 clusters, evaluating it in the same way as with K-Means. Comparing with the result in exercise 2, should we discard kmodes as an unsupervised classification procedure?

In [14]:
# 5.

### 6. Make summary comments of all the exercises as a general conclusion.

In [15]:
# 6.