🚀 Welcome to the repository for ANGOFA, a project aimed at developing tailored pre-trained language models specifically fine-tuned for Angolan languages using a combination of OFA embedding initialization and synthetic data.
In recent years, the development of pre-trained language models (PLMs) has showcased their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, very-low resource languages like those spoken in Angola have been largely overlooked, creating a void in the multilingual landscape. ANGOFA addresses this gap by introducing four PLMs fine-tuned specifically for Angolan languages.
- 🌍 Multilingual Adaptation: Tailored PLMs fine-tuned for Angolan languages using Multilingual Adaptive Fine-tuning (MAFT) approach
- 📚 Enhanced Performance: Surveying the role of informed embedding initialization and synthetic data in improving PLM performance in downstream tasks.
- ✨ Variants: Includes ANGXLMR and ANGOFA variants, each with distinct fine-tuning processes and configurations.
Both ANGOFA and ANGXLM-R models are available on Hugging Face's model hub for easy access and experimentation: AngXLMR,AngOFA,AngXLMR-SYN, AngOFA-SYN.
This work was supported in part by Oracle Cloud credits and related resources provided by Oracle
If you find this work helpful, please consider citing our paper:
@misc{quinjica2024angofa,
title={ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model},
author={Osvaldo Luamba Quinjica and David Ifeoluwa Adelani},
year={2024},
eprint={2404.02534},
archivePrefix={arXiv},
primaryClass={cs.CL}
}