Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New unified Python interface for training and evaluating NER models #738

Merged
merged 5 commits into from
Jun 30, 2021

Conversation

AngledLuffa
Copy link
Collaborator

Replaces the run_ner.sh script with a run_ner.py script - a bit more portable with a bit more functionality.

Also included are some upgrades to the charlm parameters for ner_tagger.py

@AngledLuffa AngledLuffa merged commit 28b619b into dev Jun 30, 2021
@AngledLuffa AngledLuffa deleted the run_ner branch June 30, 2021 17:01
python3 -m stanza.utils.training.run_ner.py hu_combined

This script expects the prepared data to be in
data/ner/dataset.train.json, dev.json, test.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data/ner/dataset.train.json should be data/ner/dataset/train.json?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be in the form of an expression... the original preparation script put data in data/ner/{lang}_{dataset}.train.json

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about data/ner/{dataset}.train.json, {dataset}.dev.json, {dataset}.test.json? Minor issue though.

most of the datasets online is in that script. Some of the datasets
have licenses which must be agreed to, so no attempt is made to
automatically download the data.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these instructions. They are very informative and well structured. 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

def find_charlm(direction, language, charlm):
saved_path = 'saved_models/charlm/{}_{}_{}_charlm.pt'.format(language, charlm, direction)
if os.path.exists(saved_path):
logger.info('Using model %s for %s charlm' % (saved_path, direction))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style comment: for formatting the strings, I am an advocate for either using the .format() function or using the python f-string. They are more pythonic and also more consistent with other places in the existing Stanza codebase.



def main():
common.main(run_treebank, "ner", "nertagger", add_ner_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that you are now having all the training logic code in common.main(). Very nice move! Can you also write some instructions/comments in your new common script, just like the comments that you wrote for all other util scripts?

logger.info("Using pretrain found in %s To use a different pretrain, specify --wordvec_pretrain_file" % pretrains[0])
return pretrains[0]

def run_treebank(mode, paths, treebank, short_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little uncertain about the name "treebank" here in both the function name and the argument. I understand that you want to have a consistent interface with other training scripts. But the issue with "treebank" is that NER training often does not require a conventional "treebank", but rather a specific NER dataset/corpus. I think there are two ways you can go to avoid confusion:

  • Rename all "treebank" to "dataset" to be more general. This will benefit all potential future extensions of Stanza that do not rely on training with "treebanks", such as sentiment, entity linking, etc. OR

  • Still use "treebank" in the function names and arguments, but write some comments here explaining that this is not really "treebank" for NER training.

@yuhaozhang yuhaozhang changed the title Run ner New unified Python interface for training and evaluating NER models Jun 30, 2021
@yuhaozhang
Copy link
Member

The changes all look very good to me! I like the move of having a unified common script for the training logic, and then having specific implementations for training various models. Thanks for making this possible!

I have added some comments to specific lines of changes. I also changed to title to be more specific about what's happening for better searchability.

I haven't run the latest changes due to the environment setup needed. But I suggest you to at least run a full NER training + evaluation cycle for two languages to double check if the numbers are correct with the new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants