Skip to content

Python package to pre-train BART from scratch with your corpus

Notifications You must be signed in to change notification settings

sobamchan/engawa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

engawa

NOT YET FULLY TESTED

A simple implementation to pre-train BART from scratch with your own corpus.

Usage

Soon, I will make this pip-installable with CLI commands but at the moment, you need to run it as a repository.

Installation

pip install engawa

Build tokenizer

engawa train-tokenizer --data-path /path/to/train.txt --save-dir /path/to/save

# Checkout other options by
engawa train-tokenizer --help

Pre-train BART

engawa train-model \
  --tokenizer-file /path/to/tokenizer.json \
  --train-file /path/to/train.txt \
  --val-file /path/to/val.txt \
  --default-root-dir /path/to/save/things

# Checkout other options by
engawa train-model --help

About

Python package to pre-train BART from scratch with your corpus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages