CZ4041 Machine Learning -- Semi-Supervised Learning (Research Base Project)
Python 3.6.2
pip install -r requirements.txt
- dual-learning-ssl.py
Python source code implementing the dual learning-based safe semi-supervised learning. It uses the nursery-ssl10 dataset in the
data/
folder for SSL training. It generates.csv
files in the result folder.
This csv dataset file contains 10% labelled training data and 90% unlabelled training data. This dataset is used to train the initial SSL model and allow it to perform labelling for the 90% of unlabelled data. The SSL model will then be retrained on the dataset which it has just labelled.
This csv dataset contains all the labelled class information which is not available in the nursery-ssl10-10-1-tra.csv dataset. In other words, it is the answers key for the unlabelled data instances in training dataset. We use this to confirm the quality of our SSL model in labelling the unlabelled training data.
This csv dataset is the test dataset used to measure the accuracy of the SSL model implemented.
This file is generated by dual-learning-ssl.py
. It contains a safe subset of the data\nursery-ssl10-10-1tra.csv
which are filtered by the dual-learning algorithm. Therefore, this dataset contains labelled and unlabelled data. This dataset can be safely used for training by any SSL algorithm.
This file is generated by dual-learning-ssl.py
. The difference between dataset-ssl-safe-w-real-y.csv
and dataset-ssl-safe-unlabelled.csv
is that the former are all labelled while the latter contains unlabelled instances.
This file is generated by dual-learning-ssl.py
and is a safe subset of the training dataset. The class label for each the unlabeled instances in this dataset was pre-labelled using a prediction from Regularised Least Squares model. Therefore the class label in this .csv file may not be 100% correct. Another supervised model can be trained on this dataset to complete the SSL model.
This file is generated by dual-learning-ssl.py
. It contains the statistics for risk calculation by the dual learning model. It calculate number of true positive and false positive for which a data instances is truly safe at a particular risk threshold. A data instance is defined as really safe when it is below the risk threshold, and the dual learning model prediction of its label is exactly the same as its real label.
This file is the dummy coded version of the data\nursery-ssl10-10-1tst.csv
test dataset.