-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New unified Python interface for training and evaluating NER models #738
Conversation
…ill guess the location if those flags are not given
…ations can be used when testing
python3 -m stanza.utils.training.run_ner.py hu_combined | ||
|
||
This script expects the prepared data to be in | ||
data/ner/dataset.train.json, dev.json, test.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data/ner/dataset.train.json
should be data/ner/dataset/train.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it should be in the form of an expression... the original preparation script put data in data/ner/{lang}_{dataset}.train.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about data/ner/{dataset}.train.json, {dataset}.dev.json, {dataset}.test.json
? Minor issue though.
most of the datasets online is in that script. Some of the datasets | ||
have licenses which must be agreed to, so no attempt is made to | ||
automatically download the data. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these instructions. They are very informative and well structured. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
def find_charlm(direction, language, charlm): | ||
saved_path = 'saved_models/charlm/{}_{}_{}_charlm.pt'.format(language, charlm, direction) | ||
if os.path.exists(saved_path): | ||
logger.info('Using model %s for %s charlm' % (saved_path, direction)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style comment: for formatting the strings, I am an advocate for either using the .format()
function or using the python f-string. They are more pythonic and also more consistent with other places in the existing Stanza codebase.
|
||
|
||
def main(): | ||
common.main(run_treebank, "ner", "nertagger", add_ner_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that you are now having all the training logic code in common.main()
. Very nice move! Can you also write some instructions/comments in your new common
script, just like the comments that you wrote for all other util scripts?
logger.info("Using pretrain found in %s To use a different pretrain, specify --wordvec_pretrain_file" % pretrains[0]) | ||
return pretrains[0] | ||
|
||
def run_treebank(mode, paths, treebank, short_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little uncertain about the name "treebank" here in both the function name and the argument. I understand that you want to have a consistent interface with other training scripts. But the issue with "treebank" is that NER training often does not require a conventional "treebank", but rather a specific NER dataset/corpus. I think there are two ways you can go to avoid confusion:
-
Rename all "treebank" to "dataset" to be more general. This will benefit all potential future extensions of Stanza that do not rely on training with "treebanks", such as sentiment, entity linking, etc. OR
-
Still use "treebank" in the function names and arguments, but write some comments here explaining that this is not really "treebank" for NER training.
The changes all look very good to me! I like the move of having a unified common script for the training logic, and then having specific implementations for training various models. Thanks for making this possible! I have added some comments to specific lines of changes. I also changed to title to be more specific about what's happening for better searchability. I haven't run the latest changes due to the environment setup needed. But I suggest you to at least run a full NER training + evaluation cycle for two languages to double check if the numbers are correct with the new changes. |
Replaces the run_ner.sh script with a run_ner.py script - a bit more portable with a bit more functionality.
Also included are some upgrades to the charlm parameters for ner_tagger.py