Skip to content

Releases: retarfi/language-pretraining

v2.2.1

28 Apr 11:28
571ecf8
Compare
Choose a tag to compare
  • Able to select sentencepiece algorithm
  • Able to use multiprocessing in create_datasets.py
  • Move ELECTRA model file into models directory
  • Add DeBERTaV3 (alpha) implementation
    • This implementation does back propagation of generator and discriminator at the same time
    • In my experiment, models from this implementation perform worse than the models with my DeBERTaV2 implementation
    • So the implementation needs to be improved, however, I don't have time to put effort into this.

v2.2.0

04 Mar 15:41
37e1ade
Compare
Choose a tag to compare

Main changes are following:

  • jptranstokenizer is used for tokenizer
    • It enables other word tokenizers such as Juman++, Sudachi, and spacy LUW.
  • requirements.txt to pyproject.toml
    • This is unstable, especially the PyTorch part, and should be changed according to your own environment.
    • If you get an error in run_pretraining.py, it may be due to pydantic. updating pydandic to the latest version may solve the problem, although the compatibility does not match.
  • Add Pre-mask option
    • To use this option, please specify --mask_style and use --is_dataset_masked option in run_pretraining.py.
  • Add DeBERTa and DeBERTaV2
  • Change license from Apache 2.0 to MIT

There are more changes in detail.
Please read Readme.md.

v2.1.0

06 Aug 11:50
a9514df
Compare
Choose a tag to compare

v2.0.0

06 Jun 00:41
Compare
Choose a tag to compare

Apply Hugging Face's datasets library
https://github.com/retarfi/language-pretraining/tree/336c3699679dd59be788acc21f83188efa76b95b

New features:

  • Apply datasets library
    • You need to run create_datasets.py before running run_pretraining.py
    • Check README.md#Create Dataset for how to run create_datasets.py
  • Log losses of discriminator and generator of ELECTRA
  • Additional pre-training from a checkpoint is avaiable

v1.0

06 Oct 09:52
Compare
Choose a tag to compare

First release