entity labeling #7

jtyoui · 2018-06-26T16:25:44Z

Excuse me, how does the entity labeling be?

mikeiannacone · 2018-06-26T19:41:05Z

I'm not sure what you mean. Can you be more specific?

jtyoui · 2018-06-27T01:41:04Z

Could you tell me How can you name the NER?

jtyoui · 2018-06-27T01:58:55Z

I wanna know How can I withdraw the entity relationship in chinese ?if I annotate the entity first. what shared I do them.

mikeiannacone · 2018-06-27T17:35:21Z

Ok, I'm still not quite sure what you need.

Do you want to present the data to the user in Chinese? This wouldn't really need any changes to the NLP, database, or anything else on the back end. I could show you what changes would be needed in the UI (and possibly in the REST API, if needed.)

Do you have documents in Chinese that you would like to label and store? This would definitely require new sets of training data, and may also need some other changes to the NLP. That would probably be more difficult, but I would have to ask the other team members for more input on specifics.

Anyway, let me know what you need and then hopefully I'll have more specific information.

wawang250 · 2018-06-28T02:15:10Z

hi, actually we are trying to do a Named Entity Recognition on a set of Chinese documents. But these documents are not labeled. We have tried your project in English files and it worked very well.

We world like to know that how should we label our documents or the ENTITIES in these documents so that we can make a proper TRAINING SET. Or can you give us a little demo of your labeled data set and we might have a clue of where to start.

Thanks again for your reply.

mikeiannacone · 2018-07-17T19:51:03Z

Ok, after thinking through this, and getting some input from the rest of the team, I think I can point you guys in the right direction on this.

To support any non-english documents, you will need to make some changes to the entity extractor and the relation extractor. Both repos contain updated and reasonably detailed README files that describe them, but to summarize: the entity extractor labels the "entities" it finds in the text, and the relation extractor decides how those entities are related to each other. For example, if a sentence contains two version numbers and two software products, the entity extractor would find and label them, and the relation extractor would match each product with it's version.

The entity extractor uses Stanford’s CoreNLP for a lot of non-domain-specific tasks, including sentence splitting, tokenizing, part of speech (POS) tagging, and generating the parse tree. This library apparently has Chinese models that can be loaded, but you'll need to look through their documentation for the specifics.

After all of that pre-processing has been done, the entity extractor then uses gazetteer(s) (basically a dictionary) to label known entities (eg. "Microsoft"). After that it uses a trained Apache OpenNLP averaged perceptron model to find entities not contained in the gazetteer (eg. "Obscure Developer LLC".) You would need to replace or expand these gazetteers - ours was generated from sources like Freebase and Wikipedia, which should include many languages. To generate a new Apache OpenNLP model, you'll need your own training corpus. Information about how we generated those models is in our recent publication here: https://ieeexplore.ieee.org/document/8260670/. These models and dictionaries you're replacing are contained in the resources directory of that project.

The relation extraction can be done with either pattern matching or SVM models, depending on which branch is checked out. (Master branch uses pattern matching, "svm" branch uses SVM.) Either one would need to be updated to use Chinese sentence patterns, or new SVM models. In either branch, those will be in that project's resources directory.

If you can change the entity extractor to load the appropriate CoreNLP models, and then replace all of the gazetteers and models in the resources directories in both of those projects, you guys should be able to get that working in any language you like. Generating those models and gazetteers was somewhat difficult, but that publication I linked above should help get you started with generating and evaluating them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entity labeling #7

entity labeling #7

jtyoui commented Jun 26, 2018

mikeiannacone commented Jun 26, 2018

jtyoui commented Jun 27, 2018

jtyoui commented Jun 27, 2018

mikeiannacone commented Jun 27, 2018 •

edited

Loading

wawang250 commented Jun 28, 2018

mikeiannacone commented Jul 17, 2018

entity labeling #7

entity labeling #7

Comments

jtyoui commented Jun 26, 2018

mikeiannacone commented Jun 26, 2018

jtyoui commented Jun 27, 2018

jtyoui commented Jun 27, 2018

mikeiannacone commented Jun 27, 2018 • edited Loading

wawang250 commented Jun 28, 2018

mikeiannacone commented Jul 17, 2018

mikeiannacone commented Jun 27, 2018 •

edited

Loading