A couple of encoding related fixes #27

wants to merge 5 commits into


None yet

4 participants

shajith commented Sep 27, 2012
  1. When passed a block, autolink should yield a string in the right encoding to that block.
  2. Autolinking a URL with a cyrillic x was broken.

Both illustrated by tests.

@vmg vmg commented on the diff Oct 1, 2012
@@ -20,7 +20,18 @@
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
+#include "ruby.h"
vmg Oct 1, 2012 Owner

This cannot go here -- note that autolink.c is a backport from vmg/sundown, GitHub's Markdown parser. This parser is language agnostic, so it cannot use Ruby's specific overrides to have encoding aware helpers.

On top of that, we have a very strict policy of enforcing UTF-8 everywhere, so encodings are rather irrelevant.

shajith Oct 1, 2012 Contributor

My bad, I didn't know about autolink.c being language-agnostic. Also: Are the ctype.h versions of isalpha etc sufficient for UTF-8 input?

vmg Oct 1, 2012 Owner

We assume them to be good enough: according to the IEEE standard for URLs, any characters that escape the extended range need to be percent-encoded in an URL anyway, so all these functions matching the lower range work as expected for all valid URLs.

...This is one of the few times when standards throw us a hand. :)

jamesarosen Oct 3, 2012

Are you saying that if you rinku sees "http://example.com/х" in, for example, an email, it's by definition not a URL and thus shouldn't be auto-linked? The autolinker that GitHub is applying to this comment doesn't have a problem with that. It uses the original string in its original encoding for the <a> contents, and the URL-encoded version for the href:

<a href="http://example.com/%D1%85">http://example.com/х</a>
vmg commented Oct 1, 2012

There is a valid issue with strings returned with the wrong encoding here, but both Rinku and Redcarpet/Sundown are strictly UTF-8-aware libraries, so blindly copying the encodings is not the right answer. The proper fix is to properly set UTF-8 as the encoding of all generated strings, and to verify that the string that gets passed to Rinku is either UTF-8 or UTF8-compatible.

shajith commented Oct 1, 2012

Thanks for reviewing, I will take a shot at verifying the input to be UTF-8 instead of using the encoding of the input string. Do you think it should refuse non-UTF-8 strings (via ArgumentError, i.e)?

vmg commented Oct 1, 2012

Thanks to you for the PR!

Yeah, rejecting invalid encodings is the approach we took in Redcarpet. It makes more sense than re-encoding the string, because the user most of the time doesn't expect a reencoding anyway.

By the way, we should probably accept not only UTF-8, but all UTF8-compatible encodings (i.e. ASCII also applies). From that point of view, the code to copy the encoding index in this PR is already working nicely.

shajith commented Oct 20, 2012

Closing in favor of #28

@shajith shajith closed this Oct 20, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment