PyTorch Implementation of "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"
Original paper can be found here.
This repository will contain a PyTorch implementation of ALBERT and various wrappers around it for pre-training and fine-tuning tasks.
I will use 🤗Tokenizers and 🤗Datasets for tokenization and dataset preprocessing. A native Python implementation for this simply cannot compete speed wise.
albert/: directory containing ALBERT architecture implementation and associated wrapper modules.trainer/: directory containing trainer classes for different language model tasks.main-TASK.py: main script to run the taskTASK.hps.py: Hyperparameter configuration file.
Configurations are provided via types.SimpleNamespace to generate namespaces from Python dictionaries.
_commoncontains shared hyperparameters for all tasks._pretraincontains hyperparameters for pretraining tasks.- Dictionaries with
_albert_as a suffix are for specific model configurations which can be selected with the--modelflag in training scripts. - The exception to the above is
_albert_sharedwhich is shared hyperparameters for all configurations. - More specific configurations override common configurations
To select a configuration, import HPS from hps.py then retrieve the
namespace using a (task, model) key-tuple.
Start the pre-training script:
python main-pretrain.py
With optional flags:
--cpu # do not use GPU acceleration, only use the CPU.
--tqdm-off # disable the use of tqdm loading bars. useful when running on a server.
--small # use a smaller split of the dataset. useful for debugging.
--model # select model to use. defaults to `base`. see `hps.py` for details.
--no-save # do not save outputs to disk. all data will be lost on termination. note that HuggingFace may still cache results.
The script will save outputs to runs/pretrain-{DATE}_{TIME}/ where you can retrieve model checkpoints.
TODO: Instructions on fine-tuning for downstream tasks
TODO: Instructions on model inference
- Option of using Nyströmformer self-attention approximations rather than softmax attention. Defaults to Nyströmformer.
TODO: Add model checkpoints
- ALBERT Core
- BERT alternative option
- Pretraining Wrappers
- Finetuning Wrappers
- Preprocessing Pipeline
- 🤗Version (Faster)
- Native (Slower)
- Pre-training Scripts
- Fine-tuning Scripts
- Inference Scripts
- More attention approximation options
- Fancy Logging
- Automatic Mixed-Precision Operations
- Distributed Training
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
@misc{lan2020albert,
title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
year={2020},
eprint={1909.11942},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
@misc{xiong2021nystromformer,
title={Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating Self-Attention},
author={Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh},
year={2021},
eprint={2102.03902},
archivePrefix={arXiv},
primaryClass={cs.CL}
}