r wrapper / morphodita #10

jwijffels · 2020-05-19T15:06:56Z

FYI.
I've built an R wrapper around nametag https://github.com/bnosac/nametagger so that I can easily use it to construct a baseline NER model and compare it to a baseline CRF or other deep-learning approaches which require more computing resources.

While I was doing this. I'm wondering if there is an easy way on how to extract a morphodita model from a .udpipe file? Such that I can use them with tagger morphodita:model?

foxik · 2020-05-19T20:46:19Z

BTW, I consider nametag to be very weak currently -- it is not very accurate (it is unchanged since ~2014) and requires a tagger+lemmatizer to work; we use it only for Czech.

As for extracting the tagger -- the released UDPipe models actually contain two MorphoDita models -- one is a tagger predicting UPOS, XPOS & Feats, and the other one is a lemmatizer predicting UPOS & Lemmas. I do not think it is possible to extract the models using existing binaries, but it would be trivial to write one, if you want it.

jwijffels · 2020-05-20T07:42:45Z

I have some .udpipe models where the parts of speech and the lemmatizer was trained with 1 morphodita model for which I can still use the tagger external now to test NameTag out.

Some background:

I'm working on 15th-19th century corpora with text consisting of a combination of Dutch dialects with French & Latin and
which are obtained by either manual transcription of images or automated (full of errors) extraction of text from images based on Transcribus or Tesseract.

I don't mind using pre-deep learning machine learning techniques, my laptop is still from 2013 and the users of the models are historians which have no clue about computer programming.

Free free to provide any advise on tooling that would be more suitable. The requirements that I have are

a named entity recognition model can be trained and scored on a regular CPU-only computer in decent time
the toolkit should not assume pretrained embeddings exist
preferably written in C++ without any very complex Makefile wizardry so that I can easily wrap it up in an R package in 1 day instead of 1 week
For example I couldn't find any open-source biLSTM-CRF model which matches the above requirements. Would be interested in pointers to tooling you advise.

foxik · 2020-05-24T13:37:58Z

I do not really have any suggestions -- NameTag generally fulfils the "not much required computational performance". The disadvantages are the required morphological model (but if you already have it, it is not a problem) and lower than state-of-the-art performance (it does not even use a CRF layer -- it uses a MEMM with dynamic decoding only; and the implemented feature templates are not that strong). But I do not have any low-resource alternative (we are still using it for Czech)...

When the new UDPipe appears (yes, it is bordering with vaporware at this moment, I am unfortunately aware), we plan a NER + NEL modules too; but they will require substantially more computational resources (especially for training)...

jwijffels · 2020-05-25T07:11:30Z

Thanks for the messages and the advice. Looking forward to the vaporware announcements :)

jwijffels closed this as completed May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r wrapper / morphodita #10

r wrapper / morphodita #10

jwijffels commented May 19, 2020

foxik commented May 19, 2020

jwijffels commented May 20, 2020 •

edited

Loading

foxik commented May 24, 2020

jwijffels commented May 25, 2020

r wrapper / morphodita #10

r wrapper / morphodita #10

Comments

jwijffels commented May 19, 2020

foxik commented May 19, 2020

jwijffels commented May 20, 2020 • edited Loading

foxik commented May 24, 2020

jwijffels commented May 25, 2020

jwijffels commented May 20, 2020 •

edited

Loading