Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage characters in printed result #894

Closed
jiangweiatgithub opened this issue Dec 8, 2021 · 9 comments
Closed

Garbage characters in printed result #894

jiangweiatgithub opened this issue Dec 8, 2021 · 9 comments

Comments

@jiangweiatgithub
Copy link

When I run the following python code:

import stanza
from stanza.server import CoreNLPClient
text = "中国是一个伟大的国家。"
print(text)
with CoreNLPClient(
properties='chinese',
classpath=r'F:\StanfordCoreNLP\stanford-corenlp-4.2.2*',
strict=False,
start_server=stanza.server.StartServer.TRY_START ,
annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
timeout=30000,
memory='16G') as client:

pattern = 'NP'
matches = client.tregex(text, pattern)
# You can access matches similarly
print(matches['sentences'][0]['0']['match'])

I got:
中国是一个伟大的国家。
2021-12-08 19:35:10 INFO: Using CoreNLP default properties for: chinese. Make sure to have chinese models jar (available for download here: https://stanfordnlp.github.io/CoreNLP/) in CLASSPATH
2021-12-08 19:35:10 INFO: Connecting to existing CoreNLP server at localhost:9000
2021-12-08 19:35:10 INFO: Connecting to existing CoreNLP server at localhost:9000
(NP (NNP �й���һ��ΰ��Ĺ���) (SYM ��))

Any idea about the garbage characters?

Process finished with exit code 0

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 8, 2021 via email

@jiangweiatgithub
Copy link
Author

Windows 11, with Simplified Chinese enabled.

@jiangweiatgithub
Copy link
Author

jiangweiatgithub commented Dec 9, 2021

FYI, when I trying browsing the localhost server: I get garbage results as well:

Screenshot 2021-12-09 095445

@AngledLuffa
Copy link
Collaborator

For the corenlp / stanza connection, there's a setting where the text needs to converted to utf-8, and then on the receiving end the text needs to be converted back. Otherwise it doesn't work on Windows. Unfortunately this means both packages need to be released in order to fix this! I'll send a temporary update soon-ish.

@jiangweiatgithub
Copy link
Author

jiangweiatgithub commented Dec 10, 2021

Sure. Let me know when it is ready and how I can get it. Thanks you for your quick response!

@AngledLuffa
Copy link
Collaborator

Ok, this should work. Stanza side:

pip install --no-deps --force git+git://github.com/stanfordnlp/stanza.git@f1a427c48bc9ec6a88f4bdbdffabfb4bf99a9bc5

CoreNLP side:

https://nlp.stanford.edu/software/stanford-corenlp-4.3.2b.zip

... for posterity, those will both eventually be incorporated into future releases, in case the branch and/or temp corenlp release don't exist in the future.

@stale
Copy link

stale bot commented Feb 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 8, 2022
@AngledLuffa
Copy link
Collaborator

CoreNLP with this update is now available - 4.4.0

stanza update should be soon

@AngledLuffa
Copy link
Collaborator

stanza 1.4.0 and corenlp 4.4.0 together should have this fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants