Are these latest Chines model significantly worse than the Stanford online parser? #985

lingvisa · 2020-01-16T07:13:00Z

I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

我的朋友：always tags "我的" as one NN token.
我的狗吃苹果： ‘我的狗’ tagged as one NN token.
他的狗吃苹果：'狗吃' tagged as one NN token.
高质量就业成时代: '就业' tagged as VV

When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

AngledLuffa · 2020-01-16T07:15:00Z

That doesn't sound right. How are you running the tool?

…

On Wed, Jan 15, 2020, 11:13 PM lingvisa ***@***.***> wrote: I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples: 我的朋友：always tags "我的" as one NN token. 我的狗吃苹果： ‘我的狗’ tagged as one NN token. 他的狗吃苹果：'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWPP4BOSNQC2AZUPNOLQ6ACIPA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IGRYUNA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWLIA5A6M4GLCOOUJPLQ6ACIPANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-16T08:01:03Z

I found the reason, because it is using CTB model. PKU model doesn't have this issue for those examples. Trying to switch the parameter to use PKU model. Really, by default it should PKU model. CTB horrible!

I found this by running segmenter alone, where I can play with PKU or CTB. In the full pipeline package, it doesn't have this option to switch easily.

AngledLuffa · 2020-01-16T19:29:30Z

The online version is using an older model with fewer inaccuracies. I'll try to see if we can update it for the next release

…

On Thu, Jan 16, 2020 at 12:01 AM lingvisa ***@***.***> wrote: I found the reason, because it is using CTB model. PKU model doesn't have this issue for those examples. Trying to switch the parameter to use PKU model. Really, by default it should PKU model. CTB horrible! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWJANIRZUHZNZ6EQYJDQ6AH37A5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJDEKEI#issuecomment-575030545>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWNWGPGBQ3IQ2LBC5UTQ6AH37ANCNFSM4KHPE4IQ> .

AngledLuffa · 2020-01-17T07:59:55Z

I built a new model using ctb9 segmentation data (although the dictionary has not been updated with the newer ctb). It will be included in the next release. Also, until then it's here, in case you want to take a look: https://nlp.stanford.edu/~horatio/ctb9.train.chris6.ser.gz

…

On Thu, Jan 16, 2020 at 11:29 AM John Bauer ***@***.***> wrote: The online version is using an older model with fewer inaccuracies. I'll try to see if we can update it for the next release On Thu, Jan 16, 2020 at 12:01 AM lingvisa ***@***.***> wrote: > I found the reason, because it is using CTB model. PKU model doesn't have > this issue for those examples. Trying to switch the parameter to use PKU > model. Really, by default it should PKU model. CTB horrible! > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#985?email_source=notifications&email_token=AA2AYWJANIRZUHZNZ6EQYJDQ6AH37A5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJDEKEI#issuecomment-575030545>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA2AYWNWGPGBQ3IQ2LBC5UTQ6AH37ANCNFSM4KHPE4IQ> > . >

lingvisa · 2020-01-17T08:18:16Z

That's great! I just took a test, and it reports an error message for data format. Should i compressed myself?

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb9.train.chris6.ser.gz -serDictionary data/dict-chris6.ser.gz 0
Invoked on Fri Jan 17 00:16:49 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb9.train.chris6.ser.gz -serDictionary data/dict-chris6.ser.gz 0
serDictionary=data/dict-chris6.ser.gz
loadClassifier=data/ctb9.train.chris6.ser.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=test.simp.utf8
sighanPostProcessing=true
keepAllWhitespaces=false
=0
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Resource or file looks like a gzip file, but is not: data/ctb9.train.chris6.ser.gz
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:491)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1503)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1516)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2993)
Caused by: java.util.zip.ZipException: Not in GZIP format

lingvisa · 2020-01-17T15:59:18Z

Hi, John: Do you have a chance to see the error message. I'd love to use it and appreciate!

AngledLuffa · 2020-01-17T18:41:27Z

Your previous message was after midnight Stanford time. A little patience would be appreciated. Serving files was doing something funky with the .gz file. I put it inside a .zip and it seems to work: https://nlp.stanford.edu/~horatio/ctb9.zip You'll have to extract it from the .zip, of course.

…

On Fri, Jan 17, 2020 at 7:59 AM lingvisa ***@***.***> wrote: Hi, John: Do you have a chance to see the error message. I'd love to use it and appreciate! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWJ274UZ7LESP4JSPDTQ6HIVPA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIEACQ#issuecomment-575684618>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWMYRN6P2ZWRRVWSKCLQ6HIVPANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-17T19:05:31Z

Hi, John: Seems there is a type casting issue. I simply unzip it and pass it the the command line:

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0
Invoked on Fri Jan 17 11:02:45 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0
serDictionary=data/ctb9.train.chris6.ser.gz
loadClassifier=data/ctb.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=test.simp.utf8
sighanPostProcessing=true
keepAllWhitespaces=false
=0
Loading classifier from data/ctb.gz ... done [5.8 sec].
Loading Chinese dictionaries from 1 file:
data/ctb9.train.chris6.ser.gz
java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;
edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69)
edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118)
edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98)
edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104)
edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243)
edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118)
edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142)
edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067)
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;
at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:72)
at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118)
at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067)
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;
at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69)
... 7 more

AngledLuffa · 2020-01-17T19:36:21Z

Hmm, I am not finding the same result when running the previous code release with the new model. First, I downloaded the .zip file from the link I gave and extracted the .gz file from that .zip. (I heard you like compressed files...) Then I ran this command line on my windows machine: java -cp *;../stanford-chinese-corenlp-2018-10-05-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file ../../codebase/bar.txt -segment.model ../../codebase/ctb9.train.chris6.ser.gz Hopefully something similar will work for you. If not, we will hopefully be producing a new version of corenlp soon anyway, and the new models will be available for public use then.

…

On Fri, Jan 17, 2020 at 11:05 AM lingvisa ***@***.***> wrote: Hi, John: Seems there is a type casting issue, and simply unzip it and pass it the the command line: java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0 Invoked on Fri Jan 17 11:02:45 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0 serDictionary=data/ctb9.train.chris6.ser.gz loadClassifier=data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false =0 Loading classifier from data/ctb.gz ... done [5.8 sec]. Loading Chinese dictionaries from 1 file: data/ctb9.train.chris6.ser.gz java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69) edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118) edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98) edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104) edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243) edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118) edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142) edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067) Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:72) at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118) at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98) at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104) at edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243) at edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118) at edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142) at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067) Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69) ... 7 more — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWMXBP6M3SLDAI5VY53Q6H6PZA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIVGFA#issuecomment-575755028>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWO4EUFENVO4QPGW4HDQ6H6PZANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-17T21:55:00Z

Sorry. it's my fault. I passed the new model to the segDictionary parameter, which is wrong. I corrected it and it works fine. Another question I am having is to add my new dictionary with the command: java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz. However, when I use the new dictionary name "dict-chris6.ser.2.gz" for the model, the running message says that there are only 4 entries in the dictionary, which is wrong. I checked the code of ChineseDictionary, and it seems my command above is the right way to add my dictionary into it. Where is wrong? If I don't create new dictionary file but just add a new dict file with the parameter --serDictionary, it works fine. I want to create a new single dict file to make it easy for management.

AngledLuffa · 2020-01-19T23:59:31Z

What was the output of running ChineseDictionary? Because I agree that should have worked.

…

On Fri, Jan 17, 2020 at 1:55 PM lingvisa ***@***.***> wrote: Sorry. it's my fault. I passed the new model to the segDictionary parameter, which is wrong. I corrected it and it works fine. Another question I am having is to add my new dictionary with the command: java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz. However, when I use the new dictionary name "dict-chris6.ser.2.gz" for the model, the running message says that there are only 4 entries in the dictionary, which is wrong. I checked the code of ChineseDictionary, and it seems my command above is the right way to add my dictionary into it. Where is wrong? If I don't create new dictionary file but just add a new dict file with the parameter --serDictionary, it works fine. I want to create a new single dict file to make it easy for management. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWIVOCTE66262GK4MCDQ6ISLJA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJJCURA#issuecomment-575810116>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWK47J4FUFQFFCVWPYTQ6ISLJANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-20T04:31:32Z

Output is below:

./segment.sh ctb test.simp.utf8 UTF-8 0
(CTB):
-n File:
test.simp.utf8
-n Encoding:
UTF-8

Invoked on Sun Jan 19 20:29:56 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.2.gz
serDictionary=./data/dict-chris6.ser.2.gz
loadClassifier=./data/ctb.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=test.simp.utf8
sighanPostProcessing=true
keepAllWhitespaces=false
Loading classifier from ./data/ctb.gz ... done [5.5 sec].
Loading Chinese dictionaries from 1 file:
./data/dict-chris6.ser.2.gz
./data/dict-chris6.ser.2.gz: 4 entries
Done. Unique words in ChineseDictionary is: 4.
Loading character dictionary file from ./data/dict/character_list [done].
Loading affix dictionary from ./data/dict/in.ctb [done].

As you can see, "Done. Unique words in ChineseDictionary is: 4.", which is wrong, and the testing sentences are barely segmented.

AngledLuffa · 2020-01-20T05:05:07Z

Sorry for any misunderstanding - I meant, what is the result when you try to build a new dictionary using ChineseDictionary.java

…

On Sun, Jan 19, 2020 at 8:31 PM lingvisa ***@***.***> wrote: Output is below: ./segment.sh ctb test.simp.utf8 UTF-8 0 (CTB): -n File: test.simp.utf8 -n Encoding: UTF-8 Invoked on Sun Jan 19 20:29:56 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.2.gz serDictionary=./data/dict-chris6.ser.2.gz loadClassifier=./data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false Loading classifier from ./data/ctb.gz ... done [5.5 sec]. Loading Chinese dictionaries from 1 file: ./data/dict-chris6.ser.2.gz ./data/dict-chris6.ser.2.gz: 4 entries Done. Unique words in ChineseDictionary is: 4. Loading character dictionary file from ./data/dict/character_list [done]. Loading affix dictionary from ./data/dict/in.ctb [done]. As you can see, "Done. Unique words in ChineseDictionary is: 4.", which is wrong, and the testing sentences are barely segmented. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWI6AFHBIIGWD333MQTQ6USKJA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJLJ3IA#issuecomment-576101792>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWNFHLYIGUWVF25HYRLQ6USKJANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-20T06:39:12Z

The log message looks normal:

$ java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz
Loading Chinese dictionaries from 2 files:
data/dict-chris6.ser.gz
data/foo.txt
data/foo.txt: 6 entries
Done. Unique words in ChineseDictionary is: 423202.
Serializing dictionaries to data/dict-chris6.ser.2.gz ...
done.

As can be seen, it is correct that it says the expanded dictionary size is 423202. However, when I used it for segmentation, it displays the number of entries in the new dictionary is only 4.

AngledLuffa · 2020-01-20T19:03:44Z

I am not seeing the same results as you. c:\Users\horat\nlp\releases\stanford-corenlp-full-2018-10-05>java -cp *;../stanford-chinese-corenlp-2018-10-05-models.jar edu.stanford.nlp.wordseg.ChineseDictionary -output foo.ser.gz -inputDicts ../../codebase/foo.txt,edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 2 files: [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - ../../codebase/foo.txt [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - ../../codebase/foo.txt: 1 entries [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423201. [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Serializing dictionaries to foo.ser.gz ... [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - done. c:\Users\horat\nlp\releases\stanford-corenlp-full-2018-10-05>java -cp *;../stanford-chinese-corenlp-2018-10-05-models.jar edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict edu/stanford/nlp/models/segmenter/chinese -textFile ../../codebase/foo.txt -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier edu/stanford/nlp/models/segmenter/chinese/ctb.gz -serDictionary foo.ser.gz <snip> [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - foo.ser.gz [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423201. <snip>

…

On Sun, Jan 19, 2020 at 10:39 PM lingvisa ***@***.***> wrote: The log message looks normal: $ java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz Loading Chinese dictionaries from 2 files: data/dict-chris6.ser.gz data/foo.txt data/foo.txt: 6 entries Done. Unique words in ChineseDictionary is: 423202. Serializing dictionaries to data/dict-chris6.ser.2.gz ... done. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWNWGYSF572FO4QZFU3Q6VBJBA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJLQOGY#issuecomment-576128795>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWOUIXW2OIGO3PYFNQDQ6VBJBANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-20T21:26:24Z

Just tested and didn't find the issue when running in the full pipeline package, instead of the segmenter package. Thanks for the info.

AngledLuffa · 2020-01-20T22:31:53Z

Good to hear!

lingvisa · 2020-01-27T06:18:28Z

HI, John, a follow-up question regarding the segmenter dictionary: dict-chris6.ser.gz. Those 1-6 characters entries are meaningful words or n-grams? They look like n-grams, but a lot of them indeed are valid words. Could you confirm they are meaningful words extracted from the training data, or they are just ngrams extracted from the training data? If they are words, just 2-character entries are as big as 125336.

AngledLuffa · 2020-01-27T18:48:10Z

They should all be words. Do you see some which look like non-words to you?

…

On Sun, Jan 26, 2020 at 10:18 PM lingvisa ***@***.***> wrote: HI, John, a follow-up question regarding the segmenter dictionary: dict-chris6.ser.gz. Those 1-6 characters entries are meaningful words or n-grams? They look like n-grams, but a lot of them indeed are valid words. Could you confirm they are meaningful words extracted from the training data, or they are just ngrams extracted from the training data? If they are words, just 2-character entries are as big as 125336. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWLITM5WWLRDPVXJ4ELQ7Z4DLA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ6NJLY#issuecomment-578606255>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWKAMCFZ2IWKHC7TLC3Q7Z4DLANCNFSM4KHPE4IQ> .

lingvisa · 2020-01-28T02:56:36Z

I can easily notice some 2-character words which won't be great, normally, like:
归由
胜数
心来
开缺
老而
弄绉
缺顶
肤泛
应负
胡早

3-character:
嫁妆箱
杨岐黄
子模性
磁县都

Do you have a way to get the original sentences where those words occur? They look very unusual, though they lack context.

AngledLuffa · 2020-01-28T20:20:48Z

For the most part we are trying to get 4.0 out the door, so rebuilding the dictionaries is on the long term list, not the short term list. Just randomly picking one: 子模性 Is this a math term? https://sighingnow.github.io/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/submodular.html 嫁妆箱 seems like two words put together so idk about that one 杨岐黄 is a name, right? Seems like there might be some random crap sneaking in but at least some of these look reasonable. Anyway, we can look into starting a new dictionary in a couple weeks

…

On Mon, Jan 27, 2020 at 6:56 PM lingvisa ***@***.***> wrote: I can easily notice some 2-character words which won't be great, normally, like: 归由胜数心来开缺老而弄绉缺顶肤泛应负胡早 3-character: 嫁妆箱杨岐黄子模性磁县都 Do you have a way to get the original sentences where those words occur? They look very unusual, though they lack context. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=AA2AYWK6GDRNT75XM5JKZYDQ76NGNA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKB3FZY#issuecomment-579056359>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWNOZHZVR7LAT45PCCTQ76NGNANCNFSM4KHPE4IQ> .

AngledLuffa closed this as completed Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are these latest Chines model significantly worse than the Stanford online parser? #985

Are these latest Chines model significantly worse than the Stanford online parser? #985

lingvisa commented Jan 16, 2020

AngledLuffa commented Jan 16, 2020 via email

lingvisa commented Jan 16, 2020 •

edited

AngledLuffa commented Jan 16, 2020 via email

AngledLuffa commented Jan 17, 2020 via email

lingvisa commented Jan 17, 2020

lingvisa commented Jan 17, 2020

AngledLuffa commented Jan 17, 2020 via email

lingvisa commented Jan 17, 2020 •

edited

AngledLuffa commented Jan 17, 2020 via email

lingvisa commented Jan 17, 2020

AngledLuffa commented Jan 19, 2020 via email

lingvisa commented Jan 20, 2020

AngledLuffa commented Jan 20, 2020 via email

lingvisa commented Jan 20, 2020 •

edited

AngledLuffa commented Jan 20, 2020 via email

lingvisa commented Jan 20, 2020

AngledLuffa commented Jan 20, 2020

lingvisa commented Jan 27, 2020

AngledLuffa commented Jan 27, 2020 via email

lingvisa commented Jan 28, 2020

AngledLuffa commented Jan 28, 2020 via email

Are these latest Chines model significantly worse than the Stanford online parser? #985

Are these latest Chines model significantly worse than the Stanford online parser? #985

Comments

lingvisa commented Jan 16, 2020

AngledLuffa commented Jan 16, 2020 via email

lingvisa commented Jan 16, 2020 • edited

AngledLuffa commented Jan 16, 2020 via email

AngledLuffa commented Jan 17, 2020 via email

lingvisa commented Jan 17, 2020

lingvisa commented Jan 17, 2020

AngledLuffa commented Jan 17, 2020 via email

lingvisa commented Jan 17, 2020 • edited

AngledLuffa commented Jan 17, 2020 via email

lingvisa commented Jan 17, 2020

AngledLuffa commented Jan 19, 2020 via email

lingvisa commented Jan 20, 2020

./segment.sh ctb test.simp.utf8 UTF-8 0 (CTB): -n File: test.simp.utf8 -n Encoding: UTF-8

AngledLuffa commented Jan 20, 2020 via email

lingvisa commented Jan 20, 2020 • edited

AngledLuffa commented Jan 20, 2020 via email

lingvisa commented Jan 20, 2020

AngledLuffa commented Jan 20, 2020

lingvisa commented Jan 27, 2020

AngledLuffa commented Jan 27, 2020 via email

lingvisa commented Jan 28, 2020

AngledLuffa commented Jan 28, 2020 via email

lingvisa commented Jan 16, 2020 •

edited

lingvisa commented Jan 17, 2020 •

edited

./segment.sh ctb test.simp.utf8 UTF-8 0
(CTB):
-n File:
test.simp.utf8
-n Encoding:
UTF-8

lingvisa commented Jan 20, 2020 •

edited