about chinese dataset #20

leileilin · 2022-07-07T10:46:27Z

Hello, thank you for your great work of open source. I want to process Chinese datasets according to your process, but in convert_ to_ jsonlines.py. Py this step reports an error, do you know why?
Thanks.

vdobrovolskii · 2022-07-07T10:49:19Z

Hi!

Could you please post the error stack trace?

leileilin · 2022-07-07T11:33:00Z

Hi!

Could you please post the error stack trace?
here is:
subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll']' returned non-zero exit status 1.

leileilin · 2022-07-07T11:36:55Z

Hi!

Could you please post the error stack trace?

I think this is caused by not choosing a Chinese parser, but I don't know where to start.

vdobrovolskii · 2022-07-07T11:36:59Z

have you tried manually running java -cp downloads/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll?
Also, note that the parameters to the java module use EnglishGrammaticalStructure, while for Chinese you might need something like ChineseGrammaticalStructure (check the docs to be sure)

leileilin · 2022-07-07T11:43:36Z

edu.stanford.nlp.trees

here is the point i feel confused, i change it into edu.stanford.nlp.trees.GrammaticalStructure, but still get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for DNP using class edu.stanford.nlp.trees.SemanticHeadFinder in DNP-27

vdobrovolskii · 2022-07-07T12:09:21Z

I am not sure how to do it with Chinese, but have a look here, it might help:
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/international/pennchinese/ChineseGrammaticalStructure.html

leileilin · 2022-07-08T01:25:09Z

I am not sure how to do it with Chinese, but have a look here, it might help: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/international/pennchinese/ChineseGrammaticalStructure.html

Thank you for your answer. It can indeed be successfully implemented, but the following similar errors will occur:
Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU －－)

leileilin · 2022-07-08T01:30:03Z

have you tried manually running java -cp downloads/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll? Also, note that the parameters to the java module use EnglishGrammaticalStructure, while for Chinese you might need something like ChineseGrammaticalStructure (check the docs to be sure)

Another problem is that the document does not describe the role of these parameters. Where did you learn from?

vdobrovolskii · 2022-07-08T14:26:22Z

Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU －－)
Does it occur on all the documents?

leileilin · 2022-07-09T09:37:32Z

Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU －－) Does it occur on all the documents?

Just some sentences in the document

vdobrovolskii · 2022-07-09T09:44:07Z

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

leileilin · 2022-07-09T10:51:18Z

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

The parsing results of those sentences are wrong, so I directly discard them.

leileilin · 2022-07-09T12:48:56Z

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

The parsing results of those sentences are wrong, so I directly discard them.

In fact, my practice has shortcomings, because I destroy the integrity of the data.

vdobrovolskii · 2022-07-13T08:33:00Z

It really comes down to the percentage of such sentences. What is it?

vdobrovolskii closed this as completed Aug 26, 2022

vdobrovolskii mentioned this issue Dec 25, 2023

How to modify to Chinese data set #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about chinese dataset #20

about chinese dataset #20

leileilin commented Jul 7, 2022

vdobrovolskii commented Jul 7, 2022

leileilin commented Jul 7, 2022

leileilin commented Jul 7, 2022

vdobrovolskii commented Jul 7, 2022

leileilin commented Jul 7, 2022

vdobrovolskii commented Jul 7, 2022

leileilin commented Jul 8, 2022

leileilin commented Jul 8, 2022

vdobrovolskii commented Jul 8, 2022

leileilin commented Jul 9, 2022

vdobrovolskii commented Jul 9, 2022

leileilin commented Jul 9, 2022

leileilin commented Jul 9, 2022

vdobrovolskii commented Jul 13, 2022

about chinese dataset #20

about chinese dataset #20

Comments

leileilin commented Jul 7, 2022

vdobrovolskii commented Jul 7, 2022

leileilin commented Jul 7, 2022

leileilin commented Jul 7, 2022

vdobrovolskii commented Jul 7, 2022

leileilin commented Jul 7, 2022

vdobrovolskii commented Jul 7, 2022

leileilin commented Jul 8, 2022

leileilin commented Jul 8, 2022

vdobrovolskii commented Jul 8, 2022

leileilin commented Jul 9, 2022

vdobrovolskii commented Jul 9, 2022

leileilin commented Jul 9, 2022

leileilin commented Jul 9, 2022

vdobrovolskii commented Jul 13, 2022