-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding option for spacy large tokenizer #17
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for pulling up this PR!! I've added some comments which might need some changes, but feel free to ignore the "Nit" ones.
Also, can you share the results that you got with this model? If they are significantly better and reproducible, we can also change the defaults in the code!
Remove extra space Co-authored-by: Smit Kiri <smit.kiri@gmail.com>
In terms of spacy_lg, I found that it does better in tokenizing the clinical note as compared to scispacy (both models) into BIO format. Using spacy_lg I could match a greater proportion of the BIO annotations as compared to the annotations using scispacy tokenization. (I did not do the e2e task yet, just doing the BIO formatting so far) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes!! But I noticed that some tokens that spacy_lg
generates contain just \n
which throws off the format of train.txt
/ test.txt
used by the models. Every new line in train.txt
represents a new token and it should have a [token]<space>[label]
, but because of the \n
characters, there are some line which are just blank (\n
) followed by a line with just the label.
This was the reason why I use the function scispacy_plus_tokenizer
for training the BiLSTM-CRF
model, using just scispacy
raised errors. (Now that I'm writing this, I feel I should remove the scispacy
option altogether).
From a quick look, I agree with you that the spacy large tokens look better than scispacy ones. Did you face any issues when trying to train a model with the generated files?
|
||
elif args.tokenizer == 'spacy_lg': | ||
import spacy | ||
tokenizer = spacy.load("en_core_web_lg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is much much faster to run generate_data.py
if you change this line to tokenizer = spacy.load("en_core_web_lg").tokenizer
I found that spacy large model performed better in retrieving annotated entities than scispacy ... would be good to have this additional tokenizing option!