python preprocess.py <datadir>
python train_test_split.py <datadir> <min_examples> <test_size>
Use the following to save a summary of the final dataset:
TOKENIZERS_PARALLELISM=false python -m spacy debug data config.cfg --ignore-warnings --verbose --no-format --paths.train train.spacy --paths.dev test.spacy > data-summary.txt
You may have to switch the spancat
component in config.cfg
to use the
spancat
factory instead of spancat_singlelabel
to get more details on
the span labels.