Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r wrapper / morphodita #10

Closed
jwijffels opened this issue May 19, 2020 · 4 comments
Closed

r wrapper / morphodita #10

jwijffels opened this issue May 19, 2020 · 4 comments

Comments

@jwijffels
Copy link

FYI.
I've built an R wrapper around nametag https://github.com/bnosac/nametagger so that I can easily use it to construct a baseline NER model and compare it to a baseline CRF or other deep-learning approaches which require more computing resources.

While I was doing this. I'm wondering if there is an easy way on how to extract a morphodita model from a .udpipe file? Such that I can use them with tagger morphodita:model?

@foxik
Copy link
Member

foxik commented May 19, 2020

BTW, I consider nametag to be very weak currently -- it is not very accurate (it is unchanged since ~2014) and requires a tagger+lemmatizer to work; we use it only for Czech.

As for extracting the tagger -- the released UDPipe models actually contain two MorphoDita models -- one is a tagger predicting UPOS, XPOS & Feats, and the other one is a lemmatizer predicting UPOS & Lemmas. I do not think it is possible to extract the models using existing binaries, but it would be trivial to write one, if you want it.

@jwijffels
Copy link
Author

jwijffels commented May 20, 2020

I have some .udpipe models where the parts of speech and the lemmatizer was trained with 1 morphodita model for which I can still use the tagger external now to test NameTag out.

Some background:

  • I'm working on 15th-19th century corpora with text consisting of a combination of Dutch dialects with French & Latin and
  • which are obtained by either manual transcription of images or automated (full of errors) extraction of text from images based on Transcribus or Tesseract.

I don't mind using pre-deep learning machine learning techniques, my laptop is still from 2013 and the users of the models are historians which have no clue about computer programming.

Free free to provide any advise on tooling that would be more suitable. The requirements that I have are

  • a named entity recognition model can be trained and scored on a regular CPU-only computer in decent time
  • the toolkit should not assume pretrained embeddings exist
  • preferably written in C++ without any very complex Makefile wizardry so that I can easily wrap it up in an R package in 1 day instead of 1 week
    For example I couldn't find any open-source biLSTM-CRF model which matches the above requirements. Would be interested in pointers to tooling you advise.

@foxik
Copy link
Member

foxik commented May 24, 2020

I do not really have any suggestions -- NameTag generally fulfils the "not much required computational performance". The disadvantages are the required morphological model (but if you already have it, it is not a problem) and lower than state-of-the-art performance (it does not even use a CRF layer -- it uses a MEMM with dynamic decoding only; and the implemented feature templates are not that strong). But I do not have any low-resource alternative (we are still using it for Czech)...

When the new UDPipe appears (yes, it is bordering with vaporware at this moment, I am unfortunately aware), we plan a NER + NEL modules too; but they will require substantially more computational resources (especially for training)...

@jwijffels
Copy link
Author

Thanks for the messages and the advice. Looking forward to the vaporware announcements :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants