# Erwin Antepuesto

Instructions:
-

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.
3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.
5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.
7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

### 1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
### 2. Replicate the study using the same dataset.

Some datasets in the study were not available, the second dataset repository link they listed showed error (404) https://cs.nyu.edu/roweis/data.html.

The table below lists the datasets along with their details and availability status.

In [1]:
# Datasets
import pandas as pd

data = {
    'Data set': ['Soybean', 'Zoo', 'Heart disease', 'Breast cancer', 'Dermatology', 'Letters(E,F)', 'DNA', 'Mushroom', 'Iris', 'Isolet', 'COIL20', 'OpticalDigits', 'PenDigits'],
    'Type': ['Categorical', 'Categorical', 'Categorical', 'Categorical', 'Categorical', 'Categorical', 'Categorical', 'Categorical', 'Numerical', 'Numerical', 'Numerical', 'Numerical', 'Numerical'],
    'Availability': ['Available', 'Available', 'Available', 'Available', 'Available', 'Not Available', 'Not Available', 'Available', 'Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Data set,Type,Availability
0,Soybean,Categorical,Available
1,Zoo,Categorical,Available
2,Heart disease,Categorical,Available
3,Breast cancer,Categorical,Available
4,Dermatology,Categorical,Available
5,"Letters(E,F)",Categorical,Not Available
6,DNA,Categorical,Not Available
7,Mushroom,Categorical,Available
8,Iris,Numerical,Available
9,Isolet,Numerical,Not Available


In [21]:
# Imports

# Libraries Used
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import numpy as np
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from sklearn.cluster import KMeans
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import OneHotEncoder
import networkx as nx
import itertools
import warnings
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder 
  
# Datasets Import
# Soybean
soybean_small = fetch_ucirepo(id=91)
sbf = soybean_small.data.features
sbt = soybean_small.data.targets
soybean_df = pd.merge(sbf, sbt, left_index=True, right_index=True)

# Zoo
zoo = fetch_ucirepo(id=111)
zg = zoo.data.features
zt = zoo.data.targets
zoo_df = pd.merge(zg, zt, left_index=True, right_index=True)

# Heart Disease
heart_disease = fetch_ucirepo(id=45)
hdg = heart_disease.data.features
hdt = heart_disease.data.targets
heart_disease_df = pd.merge(hdg, hdt, left_index=True, right_index=True)

# Breast Cancer
breast_cancer_wisconsin_original = fetch_ucirepo(id=15)
bcf = breast_cancer_wisconsin_original.data.features
bct = breast_cancer_wisconsin_original.data.targets
breast_cancer_df = pd.merge(bcf, bct, left_index=True, right_index=True)

# Dermatology
dermatology = fetch_ucirepo(id=33)
df = dermatology.data.features
dt = dermatology.data.targets
dermatology_df = pd.merge(df, dt, left_index=True, right_index=True)

# Mushroom
mushroom = fetch_ucirepo(id=73)
mf = mushroom.data.features
mt = mushroom.data.targets
mushroom_df = pd.merge(mf, mt, left_index=True, right_index=True)

# Iris
iris = fetch_ucirepo(id=53)
if_ = iris.data.features
it_ = iris.data.targets
iris_df = pd.merge(if_, it_, left_index=True, right_index=True)

soybean_df = soybean_df.dropna()
zoo_df = zoo_df.dropna()
heart_disease_df = heart_disease_df.dropna()
dermatology_df = dermatology_df.dropna()
breast_cancer_df = breast_cancer_df.dropna()
mushroom_df = mushroom_df.dropna()
iris_df = iris_df.dropna()

In [22]:
def jaccard_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

def ochiai_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = np.sqrt(len(set1) * len(set2))
    return intersection / denominator

def overlap_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    min_length = min(len(set1), len(set2))
    return intersection / min_length

def dice_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = len(set1) + len(set2)
    return 2 * intersection / denominator

def graph_based_representation(data):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        similarity_matrix[i, j] = jaccard_coefficient(set(data[:, i]), set(data[:, j]))
        similarity_matrix[j, i] = similarity_matrix[i, j]
    G = nx.from_numpy_array(similarity_matrix)
    embedding = SpectralEmbedding(n_components=p)
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def joint_operation(data, representation_matrix):
    return np.dot(data, representation_matrix)

def mean_operation(data, representation_matrix):
    return np.mean(np.dot(data, representation_matrix), axis=1)

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k, n_init=10)
    return kmeans.fit_predict(data)

### 3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).

### 4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.

In [37]:
p = 10
q = 10 
k = 3

results = []

datasets = ["soybean_df", "zoo_df", "heart_disease_df", "dermatology_df", "breast_cancer_df", "mushroom_df", "iris_df"]
for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            X = dataset
            enc = OneHotEncoder()
            X_encoded = enc.fit_transform(X)
            representation_matrix = graph_based_representation(X_encoded.toarray())
            integrated_data = joint_operation(X_encoded.toarray(), representation_matrix)
            labels = perform_clustering(integrated_data, k)

            true_labels = dataset.iloc[:, -1] 
            ARI = adjusted_rand_score(true_labels, labels)
            NMI = normalized_mutual_info_score(true_labels, labels)
            FMI = fowlkes_mallows_score(true_labels, labels)
            results.append([dataset_name, ARI, NMI, FMI])
        except UserWarning as e:
            print(f"Warning: {e}")

results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
results_df

Unnamed: 0,Dataset,ARI,NMI,FMI
0,soybean_df,0.537097,0.648384,0.680706
1,zoo_df,0.700482,0.703857,0.799681
2,heart_disease_df,0.213258,0.209126,0.485568
3,dermatology_df,0.494677,0.595919,0.65619
4,breast_cancer_df,0.36276,0.383401,0.653116
5,mushroom_df,0.39347,0.435907,0.659381
6,iris_df,0.572363,0.579197,0.715891


### Analysis of Dataset Performance
##### Soybean Dataset (soybean_df)

ARI: 0.339349
NMI: 0.475268
FMI: 0.530018

Interpretation: Moderate clustering performance with slightly better results in mutual information and pairwise precision-recall balance.

##### Zoo Dataset (zoo_df)
ARI: 0.723236
NMI: 0.732977
FMI: 0.813887

Interpretation: High performance across all indices, indicating effective clustering with good agreement between predicted and true labels.

##### Heart Disease Dataset (heart_disease_df)

ARI: 0.123240
NMI: 0.147469
FMI: 0.422173

Interpretation: Poor performance in ARI and NMI, but moderate in FMI, suggesting some agreement in pairwise precision and recall despite overall poor clustering.

##### Dermatology Dataset (dermatology_df)

ARI: 0.416896
NMI: 0.551940
FMI: 0.598561

Interpretation: Moderate to good performance, with the best results in mutual information, indicating a decent amount of shared information between clusterings.

##### Breast Cancer Dataset (breast_cancer_df)

ARI: 0.556564
NMI: 0.413921
FMI: 0.787260

Interpretation: Good ARI and excellent FMI suggest effective clustering with high pairwise precision and recall, though NMI is relatively lower.

##### Mushroom Dataset (mushroom_df)

ARI: 0.342453
NMI: 0.380637
FMI: 0.629212

Interpretation: Moderate performance with the best results in FMI, indicating reasonable pairwise

##### Iris Dataset (iris_df)

ARI: 0.655938
NMI: 0.664691
FMI: 0.771520

Interpretation: The iris dataset shows strong performance across all indices, particularly in FMI, indicating very effective clustering with high precision and recall.

### 5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?

##### Adjusted Rand Index (ARI)

Advantages:
Chance Adjustment: ARI adjusts for the chance grouping of elements, which makes it more robust and reliable for assessing the similarity between the true and predicted clusters.

Bound Range: Values range from -1 to 1, where 1 indicates perfect agreement, 0 indicates random clustering, and negative values indicate independent clustering.

Disadvantages:
Sensitivity to Cluster Size: ARI can be sensitive to the number of clusters and the size of the dataset, which might not make it suitable for datasets with a large number of small clusters.

When to Use:
ARI is particularly useful when the true cluster assignments are known, and an objective measure of the agreement between the predicted and true clusters is needed. It is suitable for datasets where the number of clusters is relatively stable.

##### Normalized Mutual Information (NMI)
Advantages:
Normalization: NMI is normalized between 0 and 1, where 1 indicates perfect correlation and 0 indicates no correlation. This makes it easy to interpret and compare across different datasets.

Less Sensitive to Cluster Number: Unlike ARI, NMI can handle varying numbers of clusters more gracefully because it measures the mutual information in relation to the overall uncertainty in the data.

Disadvantages:
Potential Bias: NMI might still show some bias when comparing clusterings with a very different number of clusters, potentially leading to misleadingly high scores.

When to Use:
NMI is useful for comparing the effectiveness of different clustering algorithms, especially when the number of clusters is not known beforehand. It is also beneficial in scenarios where clusters have hierarchical relationships.

##### Fowlkes-Mallows Index (FMI)
Advantages:
Interpretability: FMI is the geometric mean of precision and recall, making it intuitive and easy to understand. It directly relates to the probability that a pair of elements is correctly classified together.

Bound Range: Like NMI, FMI ranges from 0 to 1, where 1 indicates perfect clustering performance.

Disadvantages:
Pairwise Comparisons: FMI considers only pairs of points, which might not capture the overall structure of the data effectively, especially in complex multi-dimensional datasets.

When to Use:
FMI is particularly useful when the focus is on the accuracy of clustering in terms of pairs of points. It is suitable for applications where pairwise relationships are more critical than the global structure, such as in image segmentation or document clustering.

### 6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.

In [36]:
from kmodes.kmodes import KModes
results_categorical = []

for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            X_cat = dataset.iloc[:, :-1]
            true_labels_cat = dataset.iloc[:, -1]

            encoder = LabelEncoder()
            X_cat_encoded = X_cat.apply(encoder.fit_transform)

            km = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
            km_labels = km.fit_predict(X_cat_encoded)
            ARI_km = adjusted_rand_score(true_labels_cat, km_labels)
            NMI_km = normalized_mutual_info_score(true_labels_cat, km_labels)
            FMI_km = fowlkes_mallows_score(true_labels_cat, km_labels)

            ac = AgglomerativeClustering(n_clusters=k, linkage='ward')
            ac_labels = ac.fit_predict(X_cat_encoded)
            ARI_ac = adjusted_rand_score(true_labels_cat, ac_labels)
            NMI_ac = normalized_mutual_info_score(true_labels_cat, ac_labels)
            FMI_ac = fowlkes_mallows_score(true_labels_cat, ac_labels)
            
            results_categorical.append([dataset_name + " (Kmodes)", ARI_km, NMI_km, FMI_km])
            results_categorical.append([dataset_name + " (Hierarchical)", ARI_ac, NMI_ac, FMI_ac])
        except UserWarning as e:
            print(f"Warning: {e}")

results_categorical_df = pd.DataFrame(results_categorical, columns=["Dataset", "ARI", "NMI", "FMI"])
results_categorical_df

Unnamed: 0,Dataset,ARI,NMI,FMI
0,soybean_df (Kmodes),0.494492,0.62121,0.64827
1,soybean_df (Hierarchical),0.653689,0.837666,0.783908
2,zoo_df (Kmodes),0.72157,0.71451,0.812035
3,zoo_df (Hierarchical),0.461701,0.585056,0.645736
4,heart_disease_df (Kmodes),0.141519,0.205702,0.44928
5,heart_disease_df (Hierarchical),0.009543,0.010944,0.353434
6,dermatology_df (Kmodes),0.505016,0.596038,0.664983
7,dermatology_df (Hierarchical),0.032201,0.078147,0.293614
8,breast_cancer_df (Kmodes),0.506449,0.480632,0.738602
9,breast_cancer_df (Hierarchical),0.781678,0.688725,0.893555


### 7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).



Data and Performance Metrics

The following table summarizes the clustering performance of various datasets using three different clustering algorithms (K-modes, Hierarchical, and baseline) along with ARI, NMI, and FMI:

In [33]:
results_df

Unnamed: 0,Dataset,ARI,NMI,FMI
0,soybean_df,0.339349,0.475268,0.530018
1,zoo_df,0.723236,0.732977,0.813887
2,heart_disease_df,0.12324,0.147469,0.422173
3,dermatology_df,0.416896,0.55194,0.598561
4,breast_cancer_df,0.556564,0.413921,0.78726
5,mushroom_df,0.342453,0.380637,0.629212
6,iris_df,0.655938,0.664691,0.77152


In [35]:
results_categorical_df

Unnamed: 0,Dataset,ARI,NMI,FMI
0,soybean_df (Kmodes),0.433978,0.620631,0.6143
1,soybean_df (Hierarchical),0.653689,0.837666,0.783908
2,zoo_df (Kmodes),0.72157,0.71451,0.812035
3,zoo_df (Hierarchical),0.461701,0.585056,0.645736
4,heart_disease_df (Kmodes),0.249916,0.180914,0.513638
5,heart_disease_df (Hierarchical),0.009543,0.010944,0.353434
6,dermatology_df (Kmodes),0.329082,0.383521,0.516354
7,dermatology_df (Hierarchical),0.032201,0.078147,0.293614
8,breast_cancer_df (Kmodes),0.359518,0.383678,0.651834
9,breast_cancer_df (Hierarchical),0.781678,0.688725,0.893555


A consistent pattern emerges: F1-score values are generally higher than those of ARI and NMI. Let's delve into the underlying factors.

##### Why F1-Score Might Outperform:

Focus on Both Precision and Recall: F1-score explicitly considers both precision and recall, penalizing imbalanced clustering outcomes. ARI and NMI primarily focus on agreement between clusterings, not necessarily capturing the quality of individual clusters.

Sensitivity to Cluster Sizes:  ARI and NMI can be insensitive to cluster size variations. If a clustering algorithm assigns most data points to a single large cluster, ARI and NMI might still show high values despite poor cluster separation. F1-score is less susceptible to this issue.

Tailored for Class Imbalance: F1-score is particularly useful in scenarios with imbalanced class distributions, where a clustering algorithm might favor assigning data points to the majority class. F1-score inherently penalizes such behavior.

##### Limitations of ARI and NMI:

Focus on Agreement, Not Quality: As mentioned earlier, ARI and NMI prioritize agreement between clusterings over the quality of individual clusters. A clustering solution with many small, poorly separated clusters might still yield high ARI/NMI values if it mostly agrees with the reference labels.

Insensitivity to Cluster Sizes:  The metrics might not effectively capture situations where a clustering algorithm assigns a significant portion of data points to a single large cluster.

##### Choosing the Right Metric:

The selection of an appropriate performance metric depends on the specific clustering task and the desired properties.

F1-score: Well-suited for scenarios where both precision and recall are crucial, especially in cases with imbalanced class distributions.

ARI/NMI: More suitable when the primary concern is agreement between the clustering and the reference labels, and cluster quality is a secondary consideration.

##### Conclusion:

F1-score's emphasis on both precision and recall, along with its sensitivity to cluster size variations and imbalanced class distributions, often leads to higher values compared to ARI and NMI in clustering evaluation. However, the choice of metric should be guided by the specific clustering task and its objectives.
