Skip to content

rynoV/CPSC-599-NLP-project

Repository files navigation

Corpus-Specific Automatic Hyperlinking

python preprocess.py <datadir>
python train_test_split.py <datadir> <min_examples> <test_size>

Use the following to save a summary of the final dataset:

TOKENIZERS_PARALLELISM=false python -m spacy debug data config.cfg --ignore-warnings --verbose --no-format --paths.train train.spacy --paths.dev test.spacy > data-summary.txt

You may have to switch the spancat component in config.cfg to use the spancat factory instead of spancat_singlelabel to get more details on the span labels.

Resources