Skip to content

schatimo/FinBERT

 
 

Repository files navigation

FinBERT

FinBERT is a BERT model trained on financial communication text. The purpose is to enhance finaincal NLP research and practice. It is trained on the following three finanical communication corpus. The total corpora size is 4.9B tokens.

  • Corporate Reports 10-K & 10-Q: 2.5B tokens
  • Earnings Call Transcripts: 1.3B tokens
  • Analyst Reports: 1.1B tokens

FinBERT results in state-of-the-art performance on financial sentiment classification task, which is a core financial NLP task. With the release of FinBERT, we hope practitioners and researchers can utilize FinBERT for a wider range of applications where the prediction target goes beyond sentiment, such as financial-related outcomes including stock returns, stock volatilities, corporate fraud, etc.

Download FinBERT

We provide four versions of pre-trained weights.

FinVocab is a new WordPiece vocabulary on our finanical corpora using the SentencePiece library. We produce both cased and uncased versions of FinVocab, with sizes of 28,573 and 30,873 tokens respectively. This is very similar to the 28,996 and 30,522 token sizes of the original BERT cased and uncased BaseVocab.

Using FinBERT for financial sentiment classification

Finanical sentiment classification is a core NLP task in finance. FinBERT is shown to outperform vanilla BERT model on several financial sentiment classification task. Since FinBERT is in the same format as BERT, please refer to Google's BERT repo for downstream tasks.

As a demostration, We provide a script for fine-tuning FinBERT for Finanical Phrase Bank dataset, a financial sentiment classification dataset. We also provide a jupyter notebook to show how to load a fine tuned model, and then use it to predict on novel sentences. In the jupyter notebook, one can see 2 models, FinBert-FinVocab-Uncased and a Naive Bayes Model. Both Model were FineTuned on the 10K HKUST dataset, as mentioned in the paper.

Downloading Financial Phrase Bank Dataset

The datase is collected by Malo et al. 2014, and can be downloaded from this link. The zip file for the Financial Phrase Bank Dataset has been provided for ease of download and use.

Environment:

To set up the evironment used to train and test the model, run pip install -r requirements.txt
We would like to give special thanks to the creators of pytorch-pretrained-bert (i.e. pytorch-transformers)

In order to fine-tune FinBERT on the Financial Phrase Bank dataset, please run the script as follows:

python train_bert.py --cuda_device (cuda:device_id) --output_path (output directory) --vocab (vocab chosen)
--vocab_path (path to new vocab txt file) --data_dir (path to downloaded dataset) --weight_path (path to downloaded weights)

There are 4 kinds of vocab to choose from: FinVocab-Uncased, FinVocab-Cased, and Google's BERT Base-Uncased and Base-Cased.

Note to run the script, one should first download the model weights, and the Financial Phrase Bank Dataset.

Citation

@misc{yang2020finbert,
    title={FinBERT: A Pretrained Language Model for Financial Communications},
    author={Yi Yang and Mark Christopher Siy UY and Allen Huang},
    year={2020},
    eprint={2006.08097},
    archivePrefix={arXiv},
    }

Contact

Please post a Github issue or contact imyiyang@ust.hk if you have any questions.

About

A Pretrained BERT Model for Financial Communications. https://arxiv.org/abs/2006.08097

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 66.8%
  • Python 33.2%