-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Allow for custom entity wordlist #7
Comments
@foxik did also NameTag NER. We have NER training data only for a few languages, but it would be still interesting to merge UDPipe and NameTag into one tool. Both are partially (will be fully) NN-based, so they could be trained jointly on the two tasks, where the lower layers will be shared. |
@martinpopel As always, the problem with training data :) I see that NameTag only has one model, for czech. I'm looking to use this for 6-8 languages simultaneously, something that UDPipe is great at. |
Hi @EmilStenstrom, it is not obvious in which phase to use the gazetteers list -- ideally it would be used during training so that it can be only a feature and the algorithm; but then the lists would have to be collected for all languages by me (I am saying only collected, not generated), and also users would not be able to add theirs "custom" PROPN from their domain which is not covered by the training gazetteers list (the second issue could be alleviated by allowing users to specify additional content of the training gazetteers lists). Or people could use the gazetteers list not a "soft signal" (i.e., used as a feature in the classifier), but as a "hard signal", by forcing the PROPN tag for the given words (ignoring the classifier). As for the second possibility, that is already possible by UDPipe to some extent -- after calling Anyway, even if the gazetteers list would most likely improve PROPN, I am not going to deal with them in UDPipe for the time being, because I will implement a new tagging algorithm in the future (~6-9 months), which will be based on recurrent neural networks with character-level embeddings amongst others. Similar architectures can perform NER on state-of-the-art level without any gazetteers (http://arxiv.org/abs/1603.06270 or https://arxiv.org/abs/1603.01360), so I hope they will be able to recognize PROPN better than the current algorithm (which works quite differently and is quite bad at recognizing PROPN). But if this does not help, I will turn to gazetteers list :-) |
Hi! Thanks for a great reply. Sorry for the late reply, I've been away on vacation and just got back. Thanks for the detailed walktrough of all the different options. Option 1 seems hard to get right. Option 2 is essentially building my own NER and plugging it into UDPipe (too hard for me at least). So that leaves Option 3, to patiently wait until a magical solution that solves all problems is created :) Really looking forward to trying it our on real data! |
One way to slightly improve the current system during training would be to detect som entities which have better gazetteers than others. https://clavin.bericotechnologies.com/clavin-core/ for instance, uses the free geonames data to classify locations: http://download.geonames.org/export/dump/ I'm thinking this could be used as a feature when training the models? |
I'm back! :) 6 months have gone by and I just wanted to hear if the plans to implement an algorithm for improved PROPN detection is still intact? Is there a separate github project I should follow for progress? Let me know if there's anything I can do to help. If you end up looking for gazetteers for different languages you can use a wikidata dump. It's available in lots of different languages. |
Hi there. One of our students defended Master thesis about entity linking -- the implementation should get merged into UDPipe (probably in several months). Therefore, I am not planning any special PROPN handling until then... Also, we plan to change the tagging algorithm to use neural networks (again, in several months). That may also help a bit (if we pretrain embeddings on large corpus or use some wikidata information as you suggest). I am leaving this open as a remainder until any of the above lands in UDPipe. |
That's great news! Looking forward to that contribution being merged. Thanks for keeping me up to date, and all the great work you do on UDPipe! :) |
Regarding NER. A few questions as I was looking for benchmarking a few tools traditional and neural network based tools and possibly also creating an R wrapper around nametag.
|
Sorry, I overlooked the notification mail. We have much better results than the original NameTag, reported in https://arxiv.org/abs/1908.06926. This is the algorithm (the LSTM+CRF or seq2seq as denoted in Table 1/2, depending on what is better for which dataset) we plan to integrate to UDPipe -- so completely different than the current NameTag (which is very weak for English currently). For NEL, I had a master student working on first Czech prototype, which is described in https://is.cuni.cz/webapps/zzp/detail/176335/?lang=en. However, it is already three years, so our current approach is quite different; but we do not have any released code/numbers/... |
would both be great additions to UDPipe! |
Closing, the development of new models has moved to https://github.com/ufal/linpipe. |
I've been playing around with UDPipe and it's simply the best NLP library I've seen so far. Great work! :)
The only problems I've seen are related to miscategorized PROPN tags. I see how these are the hardest ones to get right, given that they follow very few rules, and often are multi-token. Given that PROPN detecting is the same thing as NER recognition, and NER is often solved with big lists of entites (gazetteers) I think that's something that UDPipe could do too.
Mind you, I'm not saying that you should bundle these lists yourself. Instead you could let the developer that uses your library (and that knows the domain it's going to be used in) point to a text file with entities. You could then lookup if a word is an entity or not, and use that when deciding if something should be a PROPN or not. I could just be another feature, nothing more than that.
Is this a good idea? Would it solve the problems I'm seeing with missing PROPN tags?
The text was updated successfully, but these errors were encountered: