Skip to content

CZ4041 Machine Learning -- Semi-Supervised Learning (Research Base Project)

Notifications You must be signed in to change notification settings

sohjunjie/CZ4041---ML-SSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CZ4041---ML-SSL

CZ4041 Machine Learning -- Semi-Supervised Learning (Research Base Project)

Version Information

Python 3.6.2

Install python requirements

pip install -r requirements.txt

Description of files in root folder

  1. dual-learning-ssl.py Python source code implementing the dual learning-based safe semi-supervised learning. It uses the nursery-ssl10 dataset in the data/ folder for SSL training. It generates .csv files in the result folder.

Description of files in data folder

1. data\nursery-ssl10-10-1tra.csv

This csv dataset file contains 10% labelled training data and 90% unlabelled training data. This dataset is used to train the initial SSL model and allow it to perform labelling for the 90% of unlabelled data. The SSL model will then be retrained on the dataset which it has just labelled.

2. data\nursery-ssl10-10-1trs.csv

This csv dataset contains all the labelled class information which is not available in the nursery-ssl10-10-1-tra.csv dataset. In other words, it is the answers key for the unlabelled data instances in training dataset. We use this to confirm the quality of our SSL model in labelling the unlabelled training data.

3. data\nursery-ssl10-10-1tst.csv

This csv dataset is the test dataset used to measure the accuracy of the SSL model implemented.

Description of files in result folder

1. result\dataset-ssl-safe-unlabelled.csv

This file is generated by dual-learning-ssl.py. It contains a safe subset of the data\nursery-ssl10-10-1tra.csv which are filtered by the dual-learning algorithm. Therefore, this dataset contains labelled and unlabelled data. This dataset can be safely used for training by any SSL algorithm.

2. result\dataset-ssl-safe-w-real-y.csv

This file is generated by dual-learning-ssl.py. The difference between dataset-ssl-safe-w-real-y.csv and dataset-ssl-safe-unlabelled.csv is that the former are all labelled while the latter contains unlabelled instances.

3. result\dataset-ssl-safe.csv

This file is generated by dual-learning-ssl.py and is a safe subset of the training dataset. The class label for each the unlabeled instances in this dataset was pre-labelled using a prediction from Regularised Least Squares model. Therefore the class label in this .csv file may not be 100% correct. Another supervised model can be trained on this dataset to complete the SSL model.

4. result\dual-ssl-stats.csv

This file is generated by dual-learning-ssl.py. It contains the statistics for risk calculation by the dual learning model. It calculate number of true positive and false positive for which a data instances is truly safe at a particular risk threshold. A data instance is defined as really safe when it is below the risk threshold, and the dual learning model prediction of its label is exactly the same as its real label.

5. result\tst-dummy.csv

This file is the dummy coded version of the data\nursery-ssl10-10-1tst.csv test dataset.

About

CZ4041 Machine Learning -- Semi-Supervised Learning (Research Base Project)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages