Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are these latest Chines model significantly worse than the Stanford online parser? #985

Closed
lingvisa opened this issue Jan 16, 2020 · 21 comments

Comments

@lingvisa
Copy link

I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

我的朋友:always tags "我的" as one NN token.
我的狗吃苹果: ‘我的狗’ tagged as one NN token.
他的狗吃苹果:'狗吃' tagged as one NN token.
高质量就业成时代: '就业' tagged as VV

When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 16, 2020 via email

@lingvisa
Copy link
Author

lingvisa commented Jan 16, 2020

I found the reason, because it is using CTB model. PKU model doesn't have this issue for those examples. Trying to switch the parameter to use PKU model. Really, by default it should PKU model. CTB horrible!

I found this by running segmenter alone, where I can play with PKU or CTB. In the full pipeline package, it doesn't have this option to switch easily.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 16, 2020 via email

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 17, 2020 via email

@lingvisa
Copy link
Author

That's great! I just took a test, and it reports an error message for data format. Should i compressed myself?

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb9.train.chris6.ser.gz -serDictionary data/dict-chris6.ser.gz 0
Invoked on Fri Jan 17 00:16:49 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb9.train.chris6.ser.gz -serDictionary data/dict-chris6.ser.gz 0
serDictionary=data/dict-chris6.ser.gz
loadClassifier=data/ctb9.train.chris6.ser.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=test.simp.utf8
sighanPostProcessing=true
keepAllWhitespaces=false
=0
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Resource or file looks like a gzip file, but is not: data/ctb9.train.chris6.ser.gz
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:491)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1503)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1516)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2993)
Caused by: java.util.zip.ZipException: Not in GZIP format

@lingvisa
Copy link
Author

Hi, John: Do you have a chance to see the error message. I'd love to use it and appreciate!

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 17, 2020 via email

@lingvisa
Copy link
Author

lingvisa commented Jan 17, 2020

Hi, John: Seems there is a type casting issue. I simply unzip it and pass it the the command line:

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0
Invoked on Fri Jan 17 11:02:45 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0
serDictionary=data/ctb9.train.chris6.ser.gz
loadClassifier=data/ctb.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=test.simp.utf8
sighanPostProcessing=true
keepAllWhitespaces=false
=0
Loading classifier from data/ctb.gz ... done [5.8 sec].
Loading Chinese dictionaries from 1 file:
data/ctb9.train.chris6.ser.gz
java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;
edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69)
edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118)
edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98)
edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104)
edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243)
edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118)
edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142)
edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067)
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;
at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:72)
at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118)
at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067)
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;
at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69)
... 7 more

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 17, 2020 via email

@lingvisa
Copy link
Author

Sorry. it's my fault. I passed the new model to the segDictionary parameter, which is wrong. I corrected it and it works fine. Another question I am having is to add my new dictionary with the command: java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz. However, when I use the new dictionary name "dict-chris6.ser.2.gz" for the model, the running message says that there are only 4 entries in the dictionary, which is wrong. I checked the code of ChineseDictionary, and it seems my command above is the right way to add my dictionary into it. Where is wrong? If I don't create new dictionary file but just add a new dict file with the parameter --serDictionary, it works fine. I want to create a new single dict file to make it easy for management.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 19, 2020 via email

@lingvisa
Copy link
Author

Output is below:

./segment.sh ctb test.simp.utf8 UTF-8 0
(CTB):
-n File:
test.simp.utf8
-n Encoding:
UTF-8

Invoked on Sun Jan 19 20:29:56 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.2.gz
serDictionary=./data/dict-chris6.ser.2.gz
loadClassifier=./data/ctb.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=test.simp.utf8
sighanPostProcessing=true
keepAllWhitespaces=false
Loading classifier from ./data/ctb.gz ... done [5.5 sec].
Loading Chinese dictionaries from 1 file:
./data/dict-chris6.ser.2.gz
./data/dict-chris6.ser.2.gz: 4 entries
Done. Unique words in ChineseDictionary is: 4.
Loading character dictionary file from ./data/dict/character_list [done].
Loading affix dictionary from ./data/dict/in.ctb [done].

As you can see, "Done. Unique words in ChineseDictionary is: 4.", which is wrong, and the testing sentences are barely segmented.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 20, 2020 via email

@lingvisa
Copy link
Author

lingvisa commented Jan 20, 2020

The log message looks normal:

$ java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz
Loading Chinese dictionaries from 2 files:
data/dict-chris6.ser.gz
data/foo.txt
data/foo.txt: 6 entries
Done. Unique words in ChineseDictionary is: 423202.
Serializing dictionaries to data/dict-chris6.ser.2.gz ...
done.

As can be seen, it is correct that it says the expanded dictionary size is 423202. However, when I used it for segmentation, it displays the number of entries in the new dictionary is only 4.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 20, 2020 via email

@lingvisa
Copy link
Author

Just tested and didn't find the issue when running in the full pipeline package, instead of the segmenter package. Thanks for the info.

@AngledLuffa
Copy link
Contributor

Good to hear!

@lingvisa
Copy link
Author

HI, John, a follow-up question regarding the segmenter dictionary: dict-chris6.ser.gz. Those 1-6 characters entries are meaningful words or n-grams? They look like n-grams, but a lot of them indeed are valid words. Could you confirm they are meaningful words extracted from the training data, or they are just ngrams extracted from the training data? If they are words, just 2-character entries are as big as 125336.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 27, 2020 via email

@lingvisa
Copy link
Author

I can easily notice some 2-character words which won't be great, normally, like:
归由
胜数
心来
开缺
老而
弄绉
缺顶
肤泛
应负
胡早

3-character:
嫁妆箱
杨岐黄
子模性
磁县都

Do you have a way to get the original sentences where those words occur? They look very unusual, though they lack context.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jan 28, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants