New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

New unified Python interface for training and evaluating NER models #738

Merged

AngledLuffa merged 5 commits into dev from run_ner

Jun 30, 2021

Collaborator

AngledLuffa commented Jun 30, 2021

Replaces the run_ner.sh script with a run_ner.py script - a bit more portable with a bit more functionality.

Also included are some upgrades to the charlm parameters for ner_tagger.py

AngledLuffa added 5 commits

June 29, 2021 00:28


          Add explicit flags for specifying where the charlm files are. Will st…

616202a

…ill guess the location if those flags are not given


          Make prepare_ner_dataset runnable as a module

33c0d2d


          Convert the run_ner.sh script to python

1b61580


          Pass the charlm filenames to the trainer so that different charlm loc…

ea03c90

…ations can be used when testing


          Get the char_hidden_dim from the charlm if a charlm is used

cfbf896

AngledLuffa merged commit 28b619b into dev

AngledLuffa deleted the run_ner branch

June 30, 2021 17:01

yuhaozhang reviewed

View reviewed changes

stanza/utils/training/run_ner.py

+                python3 -m stanza.utils.training.run_ner.py hu_combined
+              This script expects the prepared data to be in
+                data/ner/dataset.train.json, dev.json, test.json

Member

yuhaozhang Jun 30, 2021

data/ner/dataset.train.json should be data/ner/dataset/train.json?

Collaborator Author

AngledLuffa Jun 30, 2021

I guess it should be in the form of an expression... the original preparation script put data in data/ner/{lang}_{dataset}.train.json

Member

yuhaozhang Jun 30, 2021

How about data/ner/{dataset}.train.json, {dataset}.dev.json, {dataset}.test.json? Minor issue though.

yuhaozhang reviewed

View reviewed changes

stanza/utils/training/run_ner.py

+              most of the datasets online is in that script.  Some of the datasets
+              have licenses which must be agreed to, so no attempt is made to
+              automatically download the data.
+              """

Member

yuhaozhang Jun 30, 2021

I like these instructions. They are very informative and well structured. 👍

Collaborator Author

AngledLuffa Jun 30, 2021

:)

yuhaozhang reviewed

View reviewed changes

stanza/utils/training/run_ner.py

+              def find_charlm(direction, language, charlm):
+                  saved_path = 'saved_models/charlm/{}_{}_{}_charlm.pt'.format(language, charlm, direction)
+                  if os.path.exists(saved_path):
+                      logger.info('Using model %s for %s charlm' % (saved_path, direction))

Member

yuhaozhang Jun 30, 2021

Minor style comment: for formatting the strings, I am an advocate for either using the .format() function or using the python f-string. They are more pythonic and also more consistent with other places in the existing Stanza codebase.

yuhaozhang reviewed

View reviewed changes

stanza/utils/training/run_ner.py



		def main():
		common.main(run_treebank, "ner", "nertagger", add_ner_args)

Member

yuhaozhang Jun 30, 2021

I just noticed that you are now having all the training logic code in common.main(). Very nice move! Can you also write some instructions/comments in your new common script, just like the comments that you wrote for all other util scripts?

yuhaozhang reviewed

View reviewed changes

stanza/utils/training/run_ner.py

+                  logger.info("Using pretrain found in %s  To use a different pretrain, specify --wordvec_pretrain_file" % pretrains[0])
+                  return pretrains[0]
+              def run_treebank(mode, paths, treebank, short_name,

Member

yuhaozhang Jun 30, 2021

I am a little uncertain about the name "treebank" here in both the function name and the argument. I understand that you want to have a consistent interface with other training scripts. But the issue with "treebank" is that NER training often does not require a conventional "treebank", but rather a specific NER dataset/corpus. I think there are two ways you can go to avoid confusion:

Rename all "treebank" to "dataset" to be more general. This will benefit all potential future extensions of Stanza that do not rely on training with "treebanks", such as sentiment, entity linking, etc. OR
Still use "treebank" in the function names and arguments, but write some comments here explaining that this is not really "treebank" for NER training.

yuhaozhang changed the title ~~Run ner~~ New unified Python interface for training and evaluating NER models

Member

yuhaozhang commented Jun 30, 2021

The changes all look very good to me! I like the move of having a unified common script for the training logic, and then having specific implementations for training various models. Thanks for making this possible!

I have added some comments to specific lines of changes. I also changed to title to be more specific about what's happening for better searchability.

I haven't run the latest changes due to the environment setup needed. But I suggest you to at least run a full NER training + evaluation cycle for two languages to double check if the numbers are correct with the new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment