This repository is devoted to implementation of the approach proposed in:
Vasilii Feofanov, Emilie Devijver and Massih-Reza Amini - Transductive Bounds for the Multi-class Majority Vote Classifier.
In Proceedings of the AAAI Conference on Artificial Intelligence 33, 3566-3573.
The multi-class semi-supervised framework is considered. The goal is to infer a model based on given few labeled examples and lots of unlabeled ones. The proposed algorithm iteratively assigns pseudo-labels to a subset of unlabeled training examples that have their associated class margin above a threshold obtained from the transductive bound proposed in the paper. The algorithm is based on a supervised approach that can be any classifier that outputs posteriors. In our implementation, we use the random forest approach.
The algorithm is implemented in Python 3 and can be found at self_learning.py. Some functions are re-written in Cython to reduce runtime and are located at self_learning_cython.pyx. By default, the msla function makes use of Cython, which can be manually changed if there is no possibility to install Cython package.
To run succesfully the code, it requires:
- Python 3
- scikit-learn
- NumPy
- Pandas
- Matplotlib (only for plotting)
- Cython (optional)
To validate our approach, we compare the MSLA algorithm with the following methods:
- Purely supervised approach. It is a scikit-learn implementation of the Random Forest.
- The Label Propagation by scikit-learn.
- Transductive SVM extended to the multi-class case by the one-versus-all approach.
- FSLA: the self-learning algorithm with a fixed threshold equal to 0.7.
This test can be executed to verify that the code works on a machine. It performs classification using all classifiers under consideration on the DNA dataset with one random split on labeled and ulabeled data. In a terminal, the code is executed in the following way:
python3 simple_test.py
In case of success, the basic information will be displayed as well as the following graph:
In addition, the folder with TSVM input and output for each class will be created.
This test performs experiments with the setup described in the paper. 20 random splits on labeled and unlabeled parts are performed for a dataset. To run the test in a terminal you specify two arguments: name of a dataset, and the labeled/unlabeled split. For instance, for the dataset pendigits with the split 0.99, one types the following:
python3 experiment_test.py pendigits 0.99
The output will create several file in the output folder.