-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreNLP server inteprets input as Latin-1 #125
Comments
I believe you have to set the content type in the HTTP header, rather than the URL parameters. For example, from the demo server: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/demo/corenlp-brat.js#L610 It's true that by default we do not decode to UTF-8. This is actually on purpose; the default encoding for a web request is ISO-8859-1:
That said, perhaps this is me being too pedantic. Jason+Chris: should I change this to UTF-8? |
My bad! I didn't check the http header. Still, if you decide to keep this behavior, I would suggest mentioning it in the documentation. |
Well, you're right that it's a bit of a dumb default for NLP applications. I've pushed a change to the code where it decodes in UTF-8 by default, unless you pass in the |
It seems there's still something wrong. The behavior of the server is different depending on whether I add or not the charset metadata to the http header. |
I've got the same issue: a request in German, encoded in UTF-8 returns a response encoded in ISO-8859-1. This is a bit of a problem, because JSON documents can only be encoded in UTF-8, UTF-16 or UTF-32, according to the standard: https://tools.ietf.org/html/rfc7159#section-8.1 |
OK, I think I found the problem. Can you try again with the most recent version from GitHub? |
I'll interpret the lack of complaints as indicating that this fixed the problem. |
I ran into the bug report eventually, and thank you, but I am not sure I have understood correctly? The docs currently say: "The official HTTP 1.1 specification recommends ISO-8859-1 as the encoding of a request, unless a different encoding is explicitly set by using the Content-Type header. However, for most NLP applications this is an unintuitive default, and so the server instead defaults to UTF-8. To enable the ISO-8859-1 default, pass in the -strict flag to the server at startup." But AIUI, this is not true for 3.6.0, which is what the docs say they are talking about? But will be true in the next version (soon may it be released...!) I was not able to find any set of headers I could send to CoreNLP server to get a UTF-8 response back from a UTF-8 submission with 3.6.0. |
The server in 3.6.0 is fairly broken, and this is one of the bugs (responses are never in UTF-8). I believe the GitHub version of the code should have fixed this; there should (I also hope) be a new release in a few months. |
Great, we will try to work out how to compile (not java users usually...) and find out, thanks. It might be worth updating the site in the meantime so this is clear. |
I'll try to get the pagerank of our StackOverflow documentation up: http://stackoverflow.com/documentation/stanford-nlp/4122/introduction-to-stanford-corenlp#t=201608021728295977479 (directions on how to install from GitHub) |
That is dead useful, thank you, hadn't seen that. |
I'm using the CoreNLP server with a POS tagger I trained on Portuguese data. The tagger was trained on UTF-8 files and its properties explicitly tells it to use UTF-8.
Running the tagger via command line works fine with UTF-8 input (calling either the tagger class or the StanfordCoreNLP class), but the server apparently always interprets input data as Latin-1. I noticed it because the tagger makes blatant mistakes on accented words when I encode the data as UTF-8, but not when it's encoded as Latin-1.
I also tried including
encoding=utf-8
in the URL parameters, but it had no effect.The text was updated successfully, but these errors were encountered: