This repository contains the code that implements the F-EM clustering algorithm corresponding to https://arxiv.org/pdf/1907.01660.pdf. This algorithm follows a two-step EM scheme focused on robustness to noise.
The clustering algorithm described in Section 2 is implemented as a class in the file _fem.py. It can be called in the following way:
fem = FEM(K)
fem.fit(data)where K is the number of clusters and data is a numpy array with one observation per row.
To get the clustering labels you just need to access the labels attribute of the method object like this:
fem.labels_rejected outliers are labelled with -1.
You can specify a value b between 0 (reject nothing) and 1 (reject everything) for the outlier rejection:
fem = FEM(K, thres = b)
fem.fit(dataset)To classify new data points after the model is fitted you can run the following line:
classif_labels = fem.predict(new_data)In case you are interested in the parameters of the model you can access them like this:
fem.mu_
fem.Sigma_External libraries required to run the algorithm:
numpyscipyscikit-learnmathrandom
You can download the datasets used to compare the different clustering algorithms:
- MNIST (https://www.kaggle.com/oddrationale/mnist-in-csv)
- smallNORB (https://cs.nyu.edu/~ylclab/data/norb-v1.0-small/)
- 20newsgroup (
fetch_20newsgroupsfrom sklearn library)
The notebooks
- experiments-MNIST.ipynb
- experiments-NORBand20newsgroup.ipynb
contain the experiments and comparisons described in the Section 3 of the paper. plotnine, umap-learn and matplotlib are required for the plots.
We run the t-EM algorithm implemented by the function EmSkew from the R library EMMMIXskew.
Authors
- Violeta Roizman (violeta.roizman@centralesupelec.fr)
- Matthieu Jonckheere
- Frédéric Pascal
Copyright (c) 2019 CentraleSupelec.
The python wrapper used to read smallNORB data is available here (Copyright (c) 2017 Andrea Palazzi).