Implement and Evaluate Other Hierarchical Clustering Approaches

Preliminary

In the current behavior analysis different clustering approaches were implemented and tested. Currently, BIRCH is the only used hierarchical clustering approach. In this project, we want to implement other hierarchical clustering algorithms and different cluster selection approaches. Thereby, other clustering and evaluation metrics may be tested. All implemented features will be evaluated and compared to the existing ones.

Related Repositories

https://github.com/research-iobserve/iobserve-analysis

Documentation

How to use

In order to use the here implemented Hierarchical Clustering approaches, you need iobserve-repository as well as iobserve-analysis and your input data for the experiment. First, follow the instructions of iobserve-analysis to set it up.

Next, you need to set up your experiment. For that, an analysis.config file is needed which you can find in iobserve-analysis/analysis. This config file sets all necessary parameters for your clustering experiment. There, you want to set sourceDirectory, pcm.directory.db, pcm.directory.init, baseURL and outputURL accordingly. For the Hierarchical Clustering, you can choose from different distance metrics for your experiment:

euclidean: Use the Euclidean distance metric.
manhatten: Use the Manhatten distance metric.

Additionally, you can choose from different cluster selection methods for your experiment in order to select the appropriate amount of clusters:

elbow: Use the Elbow Method.
avgsil: Use the Average Silhouette Method.
gap: Use the Gap Statistic Method.

After you are done configurating the analysis.config, you can execute the Hierarchical Clustering using Eclipse. Make sure to set the program arguments of your Run Configuration to
-c analysis.config and your VM arguments to -Dlog4j.configuration=file:/iobserve-analysis/log4j.cfg.

You can also use the shell script execute-analysis.sh in order to generate an analysis.config file for you. This script can also be found in the directory iobserve-analysis/analysis.

Hierarchical Clustering

This project uses agglomerative clustering in order to cluster user behavior data from the JPetStore into distinct clusters. The goal is to find rules and similarity patterns in user behaviors and to create user behavior models for predicting future user behavior. Agglomerative clustering starts by putting each data point of the user behavior into a single cluster. Then, it iteratively merges the two most similar clusters into a single cluster until there is only a single cluster that contains all data points. This procedure can be visualized by a so called Dendogram.

In order to merge clusters, there are different similarity measurements, the so called linkages that can be used to determine the similarity between two. In this project, it is possible to choose between singe, average and complete linkage. Depending on which linkage is used, the resulting dendograms might differ.

Unlike other clustering mehtods like KMeansClustering, the user does not choose the number of clusters for the clustering result. Instead, this is done by an algorithm. There are different ways to choose an appropriate number of clusters and each of them might provide better results than the others for certain input data. This project focuses on the Elbow Method, the Average Silhouette Method and the Gap Statistic Method which the user can choose from. The next section briefly explains each of these methods.

Elbow Method

This method determines the number of clusters where the addition of another cluster does not significantly increases the quality of each cluster of the clustering. This number is called the Elbow. For that, it calculates for each k = 1,...,n possible clusterings, where n is the number of data points. For each k, it calculates the sum of all inter-cluster similarites of every cluster in the clustering. Then it searches the elbow. The problem with this method is, that there is not always a clear elbow, depending on the input data. In this case, the selected number of clusters might be supoptimal.

Average Silhouette Method

This method determines how well each data point is located in its assigned cluster by comparing it to the positions of all other clusters in the same cluster. Additionally, it compared the position to all other clusters in every other cluster to determine how bad it lies in the not assigned clusters.

Gap Statistic Method

This method compared the inter-cluster similarity of each possible clustering to the inter-cluster similarity of possible clusterings of a random-uniform distributed generated reference data sets. The goal is to choose a clustering with a large gap in comparison to the clustering of the reference data set.

Future Work

An evaluation of the three implemented clustering selection methods with fixed user behavior data for the JPetStore shows that the Average Silhouette Method provides satisfactory results, while Gap Statistic Method and the Elbow Method provide mixed results. Additional testing with more diverse user input data is necessary. Another important fact to be addressed is, that user input data needs to be weighted in order to create a model for future user behavior. As it is now, a user buying a dog is as similar to a user buying a cat and another user buying a snake, even though there might be a big difference in user preferences.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly