Idea: Allow for custom entity wordlist #7

EmilStenstrom · 2016-08-28T19:24:17Z

I've been playing around with UDPipe and it's simply the best NLP library I've seen so far. Great work! :)

The only problems I've seen are related to miscategorized PROPN tags. I see how these are the hardest ones to get right, given that they follow very few rules, and often are multi-token. Given that PROPN detecting is the same thing as NER recognition, and NER is often solved with big lists of entites (gazetteers) I think that's something that UDPipe could do too.

Mind you, I'm not saying that you should bundle these lists yourself. Instead you could let the developer that uses your library (and that knows the domain it's going to be used in) point to a text file with entities. You could then lookup if a word is an entity or not, and use that when deciding if something should be a PROPN or not. I could just be another feature, nothing more than that.

Is this a good idea? Would it solve the problems I'm seeing with missing PROPN tags?

martinpopel · 2016-08-28T19:50:16Z

@foxik did also NameTag NER. We have NER training data only for a few languages, but it would be still interesting to merge UDPipe and NameTag into one tool. Both are partially (will be fully) NN-based, so they could be trained jointly on the two tasks, where the lower layers will be shared.
Adding gazetteers as a feature for UDPipe is orthogonal to this (and much easier to implement I guess).

EmilStenstrom · 2016-08-28T19:57:59Z

@martinpopel As always, the problem with training data :) I see that NameTag only has one model, for czech. I'm looking to use this for 6-8 languages simultaneously, something that UDPipe is great at.

foxik · 2016-08-29T18:24:40Z

Hi @EmilStenstrom,

it is not obvious in which phase to use the gazetteers list -- ideally it would be used during training so that it can be only a feature and the algorithm; but then the lists would have to be collected for all languages by me (I am saying only collected, not generated), and also users would not be able to add theirs "custom" PROPN from their domain which is not covered by the training gazetteers list (the second issue could be alleviated by allowing users to specify additional content of the training gazetteers lists).

Or people could use the gazetteers list not a "soft signal" (i.e., used as a feature in the classifier), but as a "hard signal", by forcing the PROPN tag for the given words (ignoring the classifier). As for the second possibility, that is already possible by UDPipe to some extent -- after calling model.tag, you can go through the words and replace the predicted tags by PROPN. (Then you can call model.parse so that parser can see the tags.) (Of course, it would be a bit better to fix the tags before calling the tagger, so that the tagger can benefit from the already fixed POS tags like PROPN. Actually, this is quite interesting possibility -- maybe we could have a way of saying in model.tag: "these tags/lemmas/features are already generated and you have to respect them". That would be useful also for stuff like emails, URLs, etc.)

Anyway, even if the gazetteers list would most likely improve PROPN, I am not going to deal with them in UDPipe for the time being, because I will implement a new tagging algorithm in the future (~6-9 months), which will be based on recurrent neural networks with character-level embeddings amongst others. Similar architectures can perform NER on state-of-the-art level without any gazetteers (http://arxiv.org/abs/1603.06270 or https://arxiv.org/abs/1603.01360), so I hope they will be able to recognize PROPN better than the current algorithm (which works quite differently and is quite bad at recognizing PROPN). But if this does not help, I will turn to gazetteers list :-)

EmilStenstrom · 2016-09-11T16:04:21Z

Hi! Thanks for a great reply. Sorry for the late reply, I've been away on vacation and just got back.

Thanks for the detailed walktrough of all the different options. Option 1 seems hard to get right. Option 2 is essentially building my own NER and plugging it into UDPipe (too hard for me at least). So that leaves Option 3, to patiently wait until a magical solution that solves all problems is created :)

Really looking forward to trying it our on real data!

EmilStenstrom · 2016-09-17T11:55:24Z

One way to slightly improve the current system during training would be to detect som entities which have better gazetteers than others. https://clavin.bericotechnologies.com/clavin-core/ for instance, uses the free geonames data to classify locations: http://download.geonames.org/export/dump/

I'm thinking this could be used as a feature when training the models?

EmilStenstrom · 2017-02-18T18:13:32Z

I'm back! :) 6 months have gone by and I just wanted to hear if the plans to implement an algorithm for improved PROPN detection is still intact? Is there a separate github project I should follow for progress? Let me know if there's anything I can do to help.

If you end up looking for gazetteers for different languages you can use a wikidata dump. It's available in lots of different languages.

foxik · 2017-02-19T14:51:08Z

Hi there. One of our students defended Master thesis about entity linking -- the implementation should get merged into UDPipe (probably in several months). Therefore, I am not planning any special PROPN handling until then...

Also, we plan to change the tagging algorithm to use neural networks (again, in several months). That may also help a bit (if we pretrain embeddings on large corpus or use some wikidata information as you suggest).

I am leaving this open as a remainder until any of the above lands in UDPipe.

EmilStenstrom · 2017-02-20T09:24:08Z

That's great news! Looking forward to that contribution being merged.

Thanks for keeping me up to date, and all the great work you do on UDPipe! :)

jwijffels · 2020-03-09T08:48:40Z

Regarding NER. A few questions as I was looking for benchmarking a few tools traditional and neural network based tools and possibly also creating an R wrapper around nametag.

Are you planning to integrate https://github.com/ufal/nametag in UDPipe or will the model structure be different and will it contain other algorithms than the maximum entropy markov model?
On this entity linking paper and/or code or a master student. Where is this available? Is there already code on this?

foxik · 2020-03-17T17:54:37Z

Sorry, I overlooked the notification mail.

We have much better results than the original NameTag, reported in https://arxiv.org/abs/1908.06926. This is the algorithm (the LSTM+CRF or seq2seq as denoted in Table 1/2, depending on what is better for which dataset) we plan to integrate to UDPipe -- so completely different than the current NameTag (which is very weak for English currently).

For NEL, I had a master student working on first Czech prototype, which is described in https://is.cuni.cz/webapps/zzp/detail/176335/?lang=en. However, it is already three years, so our current approach is quite different; but we do not have any released code/numbers/...

jwijffels · 2020-03-18T07:38:35Z

would both be great additions to UDPipe!

foxik · 2023-02-16T09:01:50Z

Closing, the development of new models has moved to https://github.com/ufal/linpipe.

arademaker mentioned this issue Aug 29, 2017

method to plug a morpological guesser #34

Closed

foxik closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Allow for custom entity wordlist #7

Idea: Allow for custom entity wordlist #7

EmilStenstrom commented Aug 28, 2016 •

edited

Loading

martinpopel commented Aug 28, 2016

EmilStenstrom commented Aug 28, 2016

foxik commented Aug 29, 2016

EmilStenstrom commented Sep 11, 2016

EmilStenstrom commented Sep 17, 2016

EmilStenstrom commented Feb 18, 2017 •

edited

Loading

foxik commented Feb 19, 2017

EmilStenstrom commented Feb 20, 2017

jwijffels commented Mar 9, 2020

foxik commented Mar 17, 2020

jwijffels commented Mar 18, 2020

foxik commented Feb 16, 2023

Idea: Allow for custom entity wordlist #7

Idea: Allow for custom entity wordlist #7

Comments

EmilStenstrom commented Aug 28, 2016 • edited Loading

martinpopel commented Aug 28, 2016

EmilStenstrom commented Aug 28, 2016

foxik commented Aug 29, 2016

EmilStenstrom commented Sep 11, 2016

EmilStenstrom commented Sep 17, 2016

EmilStenstrom commented Feb 18, 2017 • edited Loading

foxik commented Feb 19, 2017

EmilStenstrom commented Feb 20, 2017

jwijffels commented Mar 9, 2020

foxik commented Mar 17, 2020

jwijffels commented Mar 18, 2020

foxik commented Feb 16, 2023

EmilStenstrom commented Aug 28, 2016 •

edited

Loading

EmilStenstrom commented Feb 18, 2017 •

edited

Loading