-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
entity labeling #7
Comments
I'm not sure what you mean. Can you be more specific? |
Could you tell me How can you name the NER? |
I wanna know How can I withdraw the entity relationship in chinese ?if I annotate the entity first. what shared I do them. |
Ok, I'm still not quite sure what you need. Do you want to present the data to the user in Chinese? This wouldn't really need any changes to the NLP, database, or anything else on the back end. I could show you what changes would be needed in the UI (and possibly in the REST API, if needed.) Do you have documents in Chinese that you would like to label and store? This would definitely require new sets of training data, and may also need some other changes to the NLP. That would probably be more difficult, but I would have to ask the other team members for more input on specifics. Anyway, let me know what you need and then hopefully I'll have more specific information. |
hi, actually we are trying to do a Named Entity Recognition on a set of Chinese documents. But these documents are not labeled. We have tried your project in English files and it worked very well. We world like to know that how should we label our documents or the ENTITIES in these documents so that we can make a proper TRAINING SET. Or can you give us a little demo of your labeled data set and we might have a clue of where to start. Thanks again for your reply. |
Ok, after thinking through this, and getting some input from the rest of the team, I think I can point you guys in the right direction on this. To support any non-english documents, you will need to make some changes to the entity extractor and the relation extractor. Both repos contain updated and reasonably detailed README files that describe them, but to summarize: the entity extractor labels the "entities" it finds in the text, and the relation extractor decides how those entities are related to each other. For example, if a sentence contains two version numbers and two software products, the entity extractor would find and label them, and the relation extractor would match each product with it's version. The entity extractor uses Stanford’s CoreNLP for a lot of non-domain-specific tasks, including sentence splitting, tokenizing, part of speech (POS) tagging, and generating the parse tree. This library apparently has Chinese models that can be loaded, but you'll need to look through their documentation for the specifics. After all of that pre-processing has been done, the entity extractor then uses gazetteer(s) (basically a dictionary) to label known entities (eg. "Microsoft"). After that it uses a trained Apache OpenNLP averaged perceptron model to find entities not contained in the gazetteer (eg. "Obscure Developer LLC".) You would need to replace or expand these gazetteers - ours was generated from sources like Freebase and Wikipedia, which should include many languages. To generate a new Apache OpenNLP model, you'll need your own training corpus. Information about how we generated those models is in our recent publication here: https://ieeexplore.ieee.org/document/8260670/. These models and dictionaries you're replacing are contained in the resources directory of that project. The relation extraction can be done with either pattern matching or SVM models, depending on which branch is checked out. (Master branch uses pattern matching, "svm" branch uses SVM.) Either one would need to be updated to use Chinese sentence patterns, or new SVM models. In either branch, those will be in that project's resources directory. If you can change the |
Excuse me, how does the entity labeling be?
The text was updated successfully, but these errors were encountered: