Garbage characters in printed result #894

jiangweiatgithub · 2021-12-08T11:37:27Z

When I run the following python code:

import stanza
from stanza.server import CoreNLPClient
text = "中国是一个伟大的国家。"
print(text)
with CoreNLPClient(
properties='chinese',
classpath=r'F:\StanfordCoreNLP\stanford-corenlp-4.2.2*',
strict=False,
start_server=stanza.server.StartServer.TRY_START ,
annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
timeout=30000,
memory='16G') as client:

pattern = 'NP'
matches = client.tregex(text, pattern)
# You can access matches similarly
print(matches['sentences'][0]['0']['match'])

I got:
中国是一个伟大的国家。
2021-12-08 19:35:10 INFO: Using CoreNLP default properties for: chinese. Make sure to have chinese models jar (available for download here: https://stanfordnlp.github.io/CoreNLP/) in CLASSPATH
2021-12-08 19:35:10 INFO: Connecting to existing CoreNLP server at localhost:9000
2021-12-08 19:35:10 INFO: Connecting to existing CoreNLP server at localhost:9000
(NP (NNP �й��һ��ΰ��Ĺ��) (SYM ��))

Any idea about the garbage characters?

Process finished with exit code 0

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2021-12-08T13:07:28Z

This is clearly an encoding issue. What platform are you using?

…

On Wed, Dec 8, 2021 at 3:37 AM jiangweiatgithub ***@***.***> wrote: *When I run the following python code:* import stanza from stanza.server import CoreNLPClient text = "中国是一个伟大的国家。" print(text) with CoreNLPClient( properties='chinese', classpath=r'F:\StanfordCoreNLP\stanford-corenlp-4.2.2*', strict=False, start_server=stanza.server.StartServer.TRY_START , annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'], timeout=30000, memory='16G') as client: pattern = 'NP' matches = client.tregex(text, pattern) # You can access matches similarly print(matches['sentences'][0]['0']['match']) *I got:* 中国是一个伟大的国家。 2021-12-08 19:35:10 INFO: Using CoreNLP default properties for: chinese. Make sure to have chinese models jar (available for download here: https://stanfordnlp.github.io/CoreNLP/) in CLASSPATH 2021-12-08 19:35:10 INFO: Connecting to existing CoreNLP server at localhost:9000 2021-12-08 19:35:10 INFO: Connecting to existing CoreNLP server at localhost:9000 (NP (NNP �й��һ��ΰ��Ĺ��) (SYM ��)) Any idea about the garbage characters? Process finished with exit code 0 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#894>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWOYTS4KM22K5Q4ZDITUP47QHANCNFSM5JTSUOMQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jiangweiatgithub · 2021-12-08T13:09:01Z

Windows 11, with Simplified Chinese enabled.

jiangweiatgithub · 2021-12-09T01:53:14Z

FYI, when I trying browsing the localhost server: I get garbage results as well:

AngledLuffa · 2021-12-09T23:52:49Z

For the corenlp / stanza connection, there's a setting where the text needs to converted to utf-8, and then on the receiving end the text needs to be converted back. Otherwise it doesn't work on Windows. Unfortunately this means both packages need to be released in order to fix this! I'll send a temporary update soon-ish.

jiangweiatgithub · 2021-12-10T00:53:57Z

Sure. Let me know when it is ready and how I can get it. Thanks you for your quick response!

AngledLuffa · 2021-12-10T08:22:36Z

Ok, this should work. Stanza side:

pip install --no-deps --force git+git://github.com/stanfordnlp/stanza.git@f1a427c48bc9ec6a88f4bdbdffabfb4bf99a9bc5

CoreNLP side:

https://nlp.stanford.edu/software/stanford-corenlp-4.3.2b.zip

... for posterity, those will both eventually be incorporated into future releases, in case the branch and/or temp corenlp release don't exist in the future.

stale · 2022-02-08T09:02:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AngledLuffa · 2022-02-08T18:16:29Z

CoreNLP with this update is now available - 4.4.0

stanza update should be soon

AngledLuffa · 2022-04-23T07:02:00Z

stanza 1.4.0 and corenlp 4.4.0 together should have this fix

AngledLuffa mentioned this issue Dec 9, 2021

Make the server write json stuff in utf-8, even on windows systems stanfordnlp/CoreNLP#1231

Merged

stale bot added the stale label Feb 8, 2022

stale bot removed the stale label Feb 8, 2022

AngledLuffa added the fixed on dev label Feb 8, 2022

AngledLuffa closed this as completed Apr 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage characters in printed result #894

Garbage characters in printed result #894

jiangweiatgithub commented Dec 8, 2021

AngledLuffa commented Dec 8, 2021 via email

jiangweiatgithub commented Dec 8, 2021

jiangweiatgithub commented Dec 9, 2021 •

edited

AngledLuffa commented Dec 9, 2021

jiangweiatgithub commented Dec 10, 2021 •

edited

AngledLuffa commented Dec 10, 2021

stale bot commented Feb 8, 2022

AngledLuffa commented Feb 8, 2022

AngledLuffa commented Apr 23, 2022

Garbage characters in printed result #894

Garbage characters in printed result #894

Comments

jiangweiatgithub commented Dec 8, 2021

AngledLuffa commented Dec 8, 2021 via email

jiangweiatgithub commented Dec 8, 2021

jiangweiatgithub commented Dec 9, 2021 • edited

AngledLuffa commented Dec 9, 2021

jiangweiatgithub commented Dec 10, 2021 • edited

AngledLuffa commented Dec 10, 2021

stale bot commented Feb 8, 2022

AngledLuffa commented Feb 8, 2022

AngledLuffa commented Apr 23, 2022

jiangweiatgithub commented Dec 9, 2021 •

edited

jiangweiatgithub commented Dec 10, 2021 •

edited