Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreNLP server inteprets input as Latin-1 #125

Closed
erickrf opened this issue Jan 16, 2016 · 12 comments
Closed

CoreNLP server inteprets input as Latin-1 #125

erickrf opened this issue Jan 16, 2016 · 12 comments

Comments

@erickrf
Copy link

erickrf commented Jan 16, 2016

I'm using the CoreNLP server with a POS tagger I trained on Portuguese data. The tagger was trained on UTF-8 files and its properties explicitly tells it to use UTF-8.

Running the tagger via command line works fine with UTF-8 input (calling either the tagger class or the StanfordCoreNLP class), but the server apparently always interprets input data as Latin-1. I noticed it because the tagger makes blatant mistakes on accented words when I encode the data as UTF-8, but not when it's encoded as Latin-1.

I also tried including encoding=utf-8 in the URL parameters, but it had no effect.

@gangeli
Copy link
Member

gangeli commented Jan 16, 2016

I believe you have to set the content type in the HTTP header, rather than the URL parameters. For example, from the demo server: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/demo/corenlp-brat.js#L610

It's true that by default we do not decode to UTF-8. This is actually on purpose; the default encoding for a web request is ISO-8859-1:

It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1.

That said, perhaps this is me being too pedantic. Jason+Chris: should I change this to UTF-8?

@erickrf
Copy link
Author

erickrf commented Jan 16, 2016

My bad! I didn't check the http header. Still, if you decide to keep this behavior, I would suggest mentioning it in the documentation.

@gangeli
Copy link
Member

gangeli commented Jan 17, 2016

Well, you're right that it's a bit of a dumb default for NLP applications. I've pushed a change to the code where it decodes in UTF-8 by default, unless you pass in the -strict tag to the server. It should show up on GitHub soon, and will hopefully even make it into the upcoming Maven release.

@gangeli gangeli closed this as completed Jan 17, 2016
@erickrf
Copy link
Author

erickrf commented Jan 19, 2016

It seems there's still something wrong. The behavior of the server is different depending on whether I add or not the charset metadata to the http header.
Actually, when I add the charset, it understants the data correctly but replies in ISO-8859-1. If I don't, it apparently tries to decode from ISO-8859-1 but replies in UTF-8.

@basvandenbroek
Copy link

I've got the same issue: a request in German, encoded in UTF-8 returns a response encoded in ISO-8859-1. This is a bit of a problem, because JSON documents can only be encoded in UTF-8, UTF-16 or UTF-32, according to the standard: https://tools.ietf.org/html/rfc7159#section-8.1

@gangeli
Copy link
Member

gangeli commented Feb 10, 2016

OK, I think I found the problem. Can you try again with the most recent version from GitHub?

@gangeli gangeli reopened this Feb 10, 2016
@gangeli
Copy link
Member

gangeli commented Feb 23, 2016

I'll interpret the lack of complaints as indicating that this fixed the problem.

@willmoy
Copy link
Contributor

willmoy commented Jul 31, 2016

I ran into the bug report eventually, and thank you, but I am not sure I have understood correctly?

The docs currently say: "The official HTTP 1.1 specification recommends ISO-8859-1 as the encoding of a request, unless a different encoding is explicitly set by using the Content-Type header. However, for most NLP applications this is an unintuitive default, and so the server instead defaults to UTF-8. To enable the ISO-8859-1 default, pass in the -strict flag to the server at startup."
http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#character-encoding

But AIUI, this is not true for 3.6.0, which is what the docs say they are talking about? But will be true in the next version (soon may it be released...!)

I was not able to find any set of headers I could send to CoreNLP server to get a UTF-8 response back from a UTF-8 submission with 3.6.0.

@gangeli
Copy link
Member

gangeli commented Jul 31, 2016

The server in 3.6.0 is fairly broken, and this is one of the bugs (responses are never in UTF-8). I believe the GitHub version of the code should have fixed this; there should (I also hope) be a new release in a few months.

@willmoy
Copy link
Contributor

willmoy commented Aug 2, 2016

Great, we will try to work out how to compile (not java users usually...) and find out, thanks.

It might be worth updating the site in the meantime so this is clear.

@gangeli
Copy link
Member

gangeli commented Aug 2, 2016

I'll try to get the pagerank of our StackOverflow documentation up: http://stackoverflow.com/documentation/stanford-nlp/4122/introduction-to-stanford-corenlp#t=201608021728295977479 (directions on how to install from GitHub)

@willmoy
Copy link
Contributor

willmoy commented Aug 2, 2016

That is dead useful, thank you, hadn't seen that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants