non-ascii characters in demo gallery image filenames cause problems #463

Closed
TierraDelFuego opened this Issue Nov 5, 2012 · 3 comments

Projects

None yet

2 participants

@TierraDelFuego

I see that a similar issue was raised and closed but the real fix is to not use non-ascii chars. See RFC 3986

(NB: This analysis is attributable to the support from tigertech.net)

This might look like the proper filename-part of a URI
Ávila%2C%20Spain.jpg

However, this didn't work. Some investigation revealed that this is
because there are two separate ways to write the acute "A" character in
UTF-8 that look identical.

It can be written as an ASCII "A" followed by a "COMBINING ACUTE ACCENT"
(see http://www.fileformat.info/info/unicode/char/301/index.htm), or
as a "LATIN CAPITAL LETTER A WITH ACUTE" (see
http://www.fileformat.info/info/unicode/char/c1/index.htm).

The trouble is that the UTF-8 hexadecimal representations of these two
glyphs are completely different. In the first case, it looks like this
in hex:

41 cc 81

And in the second case, it looks like this:

c3 81

That really matters, because there is actually no such thing as a UTF-8
HTTP URL. Instead, browsers translate whatever you type to what they
believe is the correct hex representation and send that to the server.
But the "what they believe is the correct hex representation" part can
go wrong: the reason that
Ávila%2C%20Spain.jpg didn't seem to work was
that the two browsers we tested prefer the second form of the encoding,
and send that to the server as:

%C3%81vila%2C%20Spain.jpg

But that's not how you saved the filename -- instead, you saved it as
the first form, an ASCII "A" followed by hex "cc 81":

A%CC%81vila%2C%20Spain.jpg

Which, as you can see, works fine if you force the browser to send that
representation.

And this discussion doesn't even contemplate the possibility that
someone might be using a browser with a non-UTF-8 character set that
sends the accented "A" character as a completely different byte
representation in another charset. For example, the character is simply
the hex byte c1 in ISO-8859-1, so a browser using that charset may send
it as:

%C1vila%2C%20Spain.jpg

The server has no way to know that these three different representation
are all supposed to be the same file; it's just matching bytes.

So we really have two choices about how to make this work reliably. The
first is to avoid non-ASCII URLs entirely; that's our advice.

The second is to determine the actual hex filename representation (UTF-8
byte sequence), then force that in URLs, as in the last example above.
To find the filename byte representation, you could try something like
this in the shell:

ls | perl -pe 's/([\x20\x2c\x5c\x80-\xff])/"%" . uc sprintf "%02x",ord($1)/eg;'

Which produces this helpful encoding that would always work in any
browser:

A%CC%81vila%2C%20Spain.jpg

Consider that I am new to python and django and mezzanine simultaneously.

The thumb_url looks correct to me but the image_url does not.

Vars to the browser with Debug = True

/path/lib/python2.6/site-packages/mezzanine/core/templatetags/mezzanine_tags.py
in thumbnail

244 raise FileSystemEncodingChanged()

thumb_name

u'A\u0301vila, Spain-75x75.jpg'

height

75

image_prefix

u'A\u0301vila, Spain'

thumb_url

u'uploads/gallery/.thumbnails/A%CC%81vila%2C%20Spain-75x75.jpg'

image_dir

u'uploads/gallery'

filetype

'JPEG'

image_name

u'A\u0301vila, Spain.jpg'

FileSystemEncodingChanged

thumb_path

u'/path/static/media/uploads/gallery/.thumbnails/A\u0301vila, Spain-75x75.jpg'

width

75

image_url

u'uploads/gallery/A\u0301vila, Spain.jpg'

image_url_path

u'uploads/gallery'

quality

95

thumb_dir

u'/path/static/media/uploads/gallery/.thumbnails'

image_ext

u'.jpg'

@stephenmcd
Owner

This is generally the result of the locale for your filesystem and/or database not supporting unicode.

Can you check those?

@TierraDelFuego

This is the mezzanine demo using sqlite3. The filesystem can't be the problem, the files are on disk with those filenames. No doubt you are busy and may have not had time to read my entire long report. I also noticed that your demo site will not allow renaming of files to these kinds of filenames. The install was a slam dunk so that stage passed but still stalled at this stage. These "first impressions" matter. It seems converting to ascii is all that's needed. Would you accept a patch for this?
FWIW I couldn't even find the gallery on your demo page.

@stephenmcd
Owner

I did read through it, it's just that this is an issue that was really ironed out many months ago, and usually always turned out to be a locale issue. It can also manifest itself when the demo data is installed without a correct locale set, and then the correct locale set is after that moment, or vice versa - you mentioned you had some help from your hosting company, so I suspect that this may have occurred as they investigated it.

The demo site doesn't necessarily contain the demo data since it's open to anyone to edit - I've reset the demo data and right at this moment at least, you can see the gallery working fine with unicode filenames in it: http://mezzanine.jupo.org/gallery/ - in fact the reason the demo data contains filenames with unicode characters is entirely to raise this issue when it does come up.

Also for reference you can see in the bundled fabfile that we define a locale that supports utf here:

https://github.com/stephenmcd/mezzanine/blob/master/mezzanine/project_template/fabfile.py#L49

which we then first apply when provisioning a server and log back in before continuing, to ensure everything is correct:

https://github.com/stephenmcd/mezzanine/blob/master/mezzanine/project_template/fabfile.py#L331-335

If it is possible that your hosting provider changed the locale after you'd installed the demo data in order to try and rectify this, you might be able to get it working by reinstalling now that the locale is correct. Keep in mind though that this is only demo data, and a quicker path might be simply to remove the demo gallery, given that the locale may be correctly configured now.

@stephenmcd stephenmcd closed this May 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment