Skip to content

Latest commit

 

History

History
26 lines (20 loc) · 986 Bytes

README.org

File metadata and controls

26 lines (20 loc) · 986 Bytes

Corpus-Specific Automatic Hyperlinking

python preprocess.py <datadir>
python train_test_split.py <datadir> <min_examples> <test_size>

Use the following to save a summary of the final dataset:

TOKENIZERS_PARALLELISM=false python -m spacy debug data config.cfg --ignore-warnings --verbose --no-format --paths.train train.spacy --paths.dev test.spacy > data-summary.txt

You may have to switch the spancat component in config.cfg to use the spancat factory instead of spancat_singlelabel to get more details on the span labels.

Resources