CoreNLP server inteprets input as Latin-1 #125

erickrf · 2016-01-16T02:56:07Z

I'm using the CoreNLP server with a POS tagger I trained on Portuguese data. The tagger was trained on UTF-8 files and its properties explicitly tells it to use UTF-8.

Running the tagger via command line works fine with UTF-8 input (calling either the tagger class or the StanfordCoreNLP class), but the server apparently always interprets input data as Latin-1. I noticed it because the tagger makes blatant mistakes on accented words when I encode the data as UTF-8, but not when it's encoded as Latin-1.

I also tried including encoding=utf-8 in the URL parameters, but it had no effect.

The text was updated successfully, but these errors were encountered:

gangeli · 2016-01-16T04:39:05Z

I believe you have to set the content type in the HTTP header, rather than the URL parameters. For example, from the demo server: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/demo/corenlp-brat.js#L610

It's true that by default we do not decode to UTF-8. This is actually on purpose; the default encoding for a web request is ISO-8859-1:

It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1.

That said, perhaps this is me being too pedantic. Jason+Chris: should I change this to UTF-8?

erickrf · 2016-01-16T17:57:33Z

My bad! I didn't check the http header. Still, if you decide to keep this behavior, I would suggest mentioning it in the documentation.

gangeli · 2016-01-17T06:32:30Z

Well, you're right that it's a bit of a dumb default for NLP applications. I've pushed a change to the code where it decodes in UTF-8 by default, unless you pass in the -strict tag to the server. It should show up on GitHub soon, and will hopefully even make it into the upcoming Maven release.

erickrf · 2016-01-19T18:25:25Z

It seems there's still something wrong. The behavior of the server is different depending on whether I add or not the charset metadata to the http header.
Actually, when I add the charset, it understants the data correctly but replies in ISO-8859-1. If I don't, it apparently tries to decode from ISO-8859-1 but replies in UTF-8.

basvandenbroek · 2016-02-10T17:07:45Z

I've got the same issue: a request in German, encoded in UTF-8 returns a response encoded in ISO-8859-1. This is a bit of a problem, because JSON documents can only be encoded in UTF-8, UTF-16 or UTF-32, according to the standard: https://tools.ietf.org/html/rfc7159#section-8.1

gangeli · 2016-02-10T18:59:53Z

OK, I think I found the problem. Can you try again with the most recent version from GitHub?

gangeli · 2016-02-23T17:18:23Z

I'll interpret the lack of complaints as indicating that this fixed the problem.

willmoy · 2016-07-31T16:08:38Z

I ran into the bug report eventually, and thank you, but I am not sure I have understood correctly?

The docs currently say: "The official HTTP 1.1 specification recommends ISO-8859-1 as the encoding of a request, unless a different encoding is explicitly set by using the Content-Type header. However, for most NLP applications this is an unintuitive default, and so the server instead defaults to UTF-8. To enable the ISO-8859-1 default, pass in the -strict flag to the server at startup."
http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#character-encoding

But AIUI, this is not true for 3.6.0, which is what the docs say they are talking about? But will be true in the next version (soon may it be released...!)

I was not able to find any set of headers I could send to CoreNLP server to get a UTF-8 response back from a UTF-8 submission with 3.6.0.

gangeli · 2016-07-31T18:35:03Z

The server in 3.6.0 is fairly broken, and this is one of the bugs (responses are never in UTF-8). I believe the GitHub version of the code should have fixed this; there should (I also hope) be a new release in a few months.

willmoy · 2016-08-02T16:22:27Z

Great, we will try to work out how to compile (not java users usually...) and find out, thanks.

It might be worth updating the site in the meantime so this is clear.

gangeli · 2016-08-02T17:29:55Z

I'll try to get the pagerank of our StackOverflow documentation up: http://stackoverflow.com/documentation/stanford-nlp/4122/introduction-to-stanford-corenlp#t=201608021728295977479 (directions on how to install from GitHub)

willmoy · 2016-08-02T17:32:14Z

That is dead useful, thank you, hadn't seen that.

gangeli closed this as completed Jan 17, 2016

gangeli reopened this Feb 10, 2016

gangeli closed this as completed Feb 23, 2016

axeltlarsson mentioned this issue May 5, 2016

Fix NullPointer etc on Server ErikGartner/relationship-extractor#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreNLP server inteprets input as Latin-1 #125

CoreNLP server inteprets input as Latin-1 #125

erickrf commented Jan 16, 2016

gangeli commented Jan 16, 2016

erickrf commented Jan 16, 2016

gangeli commented Jan 17, 2016

erickrf commented Jan 19, 2016

basvandenbroek commented Feb 10, 2016

gangeli commented Feb 10, 2016

gangeli commented Feb 23, 2016

willmoy commented Jul 31, 2016

gangeli commented Jul 31, 2016

willmoy commented Aug 2, 2016

gangeli commented Aug 2, 2016

willmoy commented Aug 2, 2016

CoreNLP server inteprets input as Latin-1 #125

CoreNLP server inteprets input as Latin-1 #125

Comments

erickrf commented Jan 16, 2016

gangeli commented Jan 16, 2016

erickrf commented Jan 16, 2016

gangeli commented Jan 17, 2016

erickrf commented Jan 19, 2016

basvandenbroek commented Feb 10, 2016

gangeli commented Feb 10, 2016

gangeli commented Feb 23, 2016

willmoy commented Jul 31, 2016

gangeli commented Jul 31, 2016

willmoy commented Aug 2, 2016

gangeli commented Aug 2, 2016

willmoy commented Aug 2, 2016