moved: Body parsing bug due to special characters/encoding? #7

Open
tj opened this Issue Apr 29, 2011 · 2 comments

2 participants

@tj
Owner
tj commented Apr 29, 2011

I was working on a bookmarklet that, among other things, form-posts the title of whatever page you're on to my server running Express, and I'm seeing Connect's body parser choke on some pages from Amazon.

Here's a super simple test case:

https://gist.github.com/947895

Run that website locally, drag the bookmarklet to your toolbar, and click it on any of the provided Amazon links. You should see an error message like this one:

URIError: URI malformed
    at decodeURIComponent (native)
    at /usr/local/lib/node/.npm/qs/0.1.0/package/lib/querystring.js:28:18
    at Array.reduce (native)
    at /usr/local/lib/node/.npm/qs/0.1.0/package/lib/querystring.js:27:6
    at IncomingMessage.<anonymous> (/usr/local/lib/node/.npm/connect/1.3.0/package/lib/middleware/bodyParser.js:74:15)
    at IncomingMessage.emit (events.js:61:17)
    at HTTPParser.onMessageComplete (http.js:132:23)
    at Socket.ondata (http.js:1007:22)
    at Socket._onReadable (net.js:677:27)
    at IOWatcher.onReadable [as callback] (net.js:177:10)

This happens on Amazon pages where the title has special characters, like é or ü. You can change the title of an Amazon page (e.g. by setting document.title in the console) to just é, for example, and it will cause the bug.

I've done some investigating and can give you some more info, but at a high level, it seems that the browser in this case encodes the form differently than encodeURIComponent() does, which causes decodeURIComponent() — used by Connect's body parser — to choke.

For example, calling encodeURIComponent() on that é yields %C3%A9 everywhere, but what the server receives in the form body from these Amazon pages is %E9. Attempting to decodeURIComponent() on %E9 causes this error.

I tried making a sample page for this, but the form post matched encodeURIComponent(). I'm guessing the behavior on Amazon is related to encoding, but I haven't been able to confirm, maybe because Express sends a Content-Type header that specifies utf-8.

All said, it seems that Connect's body parser shouldn't break on these encodings. Hope this info helps. Thanks!

@tj
Owner
tj commented Apr 29, 2011

^ moved from senchalabs/connect

@hokaccha

I also had a similar case. When POST with Shift_JIS, decodeURIComponent cannot decode.

Because decodeURIComponent use only UTF-8. Other charset should use an appropriate function.

For example, This is Shift_JIS decoder library.
http://lightbox.on.coocan.jp/ecl_new.txt

How about such a code?
hokaccha/connect@1f7c870
hokaccha@8c0d514

then,

var express = require('express');
express.bodyParser.qs.decoder = UnescapeSJIS;
...

But, ISO-8859-1 decoder was not able to be found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment