Both illustrated by tests.
auto_link changes link encoding when inside a block
removed assertion that was failing the test
switched to rb_ versions of the ctype functions
need to work with no HAVE_RUBY_ENCODING_H
yielding text with correct encoding to the link text block
This cannot go here -- note that autolink.c is a backport from vmg/sundown, GitHub's Markdown parser. This parser is language agnostic, so it cannot use Ruby's specific overrides to have encoding aware helpers.
On top of that, we have a very strict policy of enforcing UTF-8 everywhere, so encodings are rather irrelevant.
My bad, I didn't know about autolink.c being language-agnostic. Also: Are the ctype.h versions of isalpha etc sufficient for UTF-8 input?
We assume them to be good enough: according to the IEEE standard for URLs, any characters that escape the extended range need to be percent-encoded in an URL anyway, so all these functions matching the lower range work as expected for all valid URLs.
...This is one of the few times when standards throw us a hand. :)
Are you saying that if you rinku sees "http://example.com/х" in, for example, an email, it's by definition not a URL and thus shouldn't be auto-linked? The autolinker that GitHub is applying to this comment doesn't have a problem with that. It uses the original string in its original encoding for the <a> contents, and the URL-encoded version for the href:
There is a valid issue with strings returned with the wrong encoding here, but both Rinku and Redcarpet/Sundown are strictly UTF-8-aware libraries, so blindly copying the encodings is not the right answer. The proper fix is to properly set UTF-8 as the encoding of all generated strings, and to verify that the string that gets passed to Rinku is either UTF-8 or UTF8-compatible.
Thanks for reviewing, I will take a shot at verifying the input to be UTF-8 instead of using the encoding of the input string. Do you think it should refuse non-UTF-8 strings (via ArgumentError, i.e)?
Thanks to you for the PR!
Yeah, rejecting invalid encodings is the approach we took in Redcarpet. It makes more sense than re-encoding the string, because the user most of the time doesn't expect a reencoding anyway.
By the way, we should probably accept not only UTF-8, but all UTF8-compatible encodings (i.e. ASCII also applies). From that point of view, the code to copy the encoding index in this PR is already working nicely.
Closing in favor of #28