Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about chinese dataset #20

Closed
leileilin opened this issue Jul 7, 2022 · 14 comments
Closed

about chinese dataset #20

leileilin opened this issue Jul 7, 2022 · 14 comments

Comments

@leileilin
Copy link

Hello, thank you for your great work of open source. I want to process Chinese datasets according to your process, but in convert_ to_ jsonlines.py. Py this step reports an error, do you know why?
Thanks.

@vdobrovolskii
Copy link
Owner

Hi!

Could you please post the error stack trace?

@leileilin
Copy link
Author

Hi!

Could you please post the error stack trace?
here is:
subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll']' returned non-zero exit status 1.

@leileilin
Copy link
Author

Hi!

Could you please post the error stack trace?

I think this is caused by not choosing a Chinese parser, but I don't know where to start.

@vdobrovolskii
Copy link
Owner

have you tried manually running java -cp downloads/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll?
Also, note that the parameters to the java module use EnglishGrammaticalStructure, while for Chinese you might need something like ChineseGrammaticalStructure (check the docs to be sure)

@leileilin
Copy link
Author

edu.stanford.nlp.trees

here is the point i feel confused, i change it into edu.stanford.nlp.trees.GrammaticalStructure, but still get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for DNP using class edu.stanford.nlp.trees.SemanticHeadFinder in DNP-27

@vdobrovolskii
Copy link
Owner

I am not sure how to do it with Chinese, but have a look here, it might help:
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/international/pennchinese/ChineseGrammaticalStructure.html

@leileilin
Copy link
Author

I am not sure how to do it with Chinese, but have a look here, it might help: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/international/pennchinese/ChineseGrammaticalStructure.html

Thank you for your answer. It can indeed be successfully implemented, but the following similar errors will occur:
Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU --)

@leileilin
Copy link
Author

have you tried manually running java -cp downloads/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll? Also, note that the parameters to the java module use EnglishGrammaticalStructure, while for Chinese you might need something like ChineseGrammaticalStructure (check the docs to be sure)

Another problem is that the document does not describe the role of these parameters. Where did you learn from?

@vdobrovolskii
Copy link
Owner

Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU --)
Does it occur on all the documents?

@leileilin
Copy link
Author

Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU --) Does it occur on all the documents?

Just some sentences in the document

@vdobrovolskii
Copy link
Owner

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

@leileilin
Copy link
Author

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

The parsing results of those sentences are wrong, so I directly discard them.

@leileilin
Copy link
Author

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

The parsing results of those sentences are wrong, so I directly discard them.

In fact, my practice has shortcomings, because I destroy the integrity of the data.

@vdobrovolskii
Copy link
Owner

It really comes down to the percentage of such sentences. What is it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants