adding option for spacy large tokenizer #17

yenniejun · 2022-03-10T20:00:10Z

I found that spacy large model performed better in retrieving annotated entities than scispacy ... would be good to have this additional tokenizing option!

smitkiri

Thank you so much for pulling up this PR!! I've added some comments which might need some changes, but feel free to ignore the "Nit" ones.

Also, can you share the results that you got with this model? If they are significantly better and reproducible, we can also change the defaults in the code!

generate_data.py

Remove extra space Co-authored-by: Smit Kiri <smit.kiri@gmail.com>

…traction

yenniejun · 2022-03-16T21:20:26Z

In terms of spacy_lg, I found that it does better in tokenizing the clinical note as compared to scispacy (both models) into BIO format. Using spacy_lg I could match a greater proportion of the BIO annotations as compared to the annotations using scispacy tokenization. (I did not do the e2e task yet, just doing the BIO formatting so far)

smitkiri

Thanks for making the changes!! But I noticed that some tokens that spacy_lg generates contain just \n which throws off the format of train.txt / test.txt used by the models. Every new line in train.txt represents a new token and it should have a [token]<space>[label], but because of the \n characters, there are some line which are just blank (\n) followed by a line with just the label.

This was the reason why I use the function scispacy_plus_tokenizer for training the BiLSTM-CRF model, using just scispacy raised errors. (Now that I'm writing this, I feel I should remove the scispacy option altogether).

From a quick look, I agree with you that the spacy large tokens look better than scispacy ones. Did you face any issues when trying to train a model with the generated files?

smitkiri · 2022-03-19T21:48:45Z

generate_data.py

+
+    elif args.tokenizer == 'spacy_lg':
+        import spacy
+        tokenizer = spacy.load("en_core_web_lg")


It is much much faster to run generate_data.py if you change this line to tokenizer = spacy.load("en_core_web_lg").tokenizer

adding option for spacy large tokenizer

5571a57

smitkiri self-requested a review March 10, 2022 22:36

smitkiri requested changes Mar 10, 2022

View reviewed changes

generate_data.py Outdated Show resolved Hide resolved

generate_data.py Outdated Show resolved Hide resolved

generate_data.py Outdated Show resolved Hide resolved

yenniejun and others added 4 commits March 16, 2022 16:55

Update generate_data.py

293c799

Remove extra space Co-authored-by: Smit Kiri <smit.kiri@gmail.com>

Moving spacy_lg elif statement up

2522fd8

Update req.txt to download spacy_lg model

b8f6bef

Merge branch 'master' of https://github.com/yenniejun/ehr-relation-ex…

fe4151b

…traction

smitkiri reviewed Mar 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding option for spacy large tokenizer #17

adding option for spacy large tokenizer #17

yenniejun commented Mar 10, 2022

smitkiri left a comment

yenniejun commented Mar 16, 2022

smitkiri left a comment

smitkiri Mar 19, 2022

adding option for spacy large tokenizer #17

Are you sure you want to change the base?

adding option for spacy large tokenizer #17

Conversation

yenniejun commented Mar 10, 2022

smitkiri left a comment

Choose a reason for hiding this comment

yenniejun commented Mar 16, 2022

smitkiri left a comment

Choose a reason for hiding this comment

smitkiri Mar 19, 2022

Choose a reason for hiding this comment