Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

A couple of encoding related fixes #27

Closed
wants to merge 5 commits into from

4 participants

@shajith
  1. When passed a block, autolink should yield a string in the right encoding to that block.
  2. Autolinking a URL with a cyrillic x was broken.

Both illustrated by tests.

@vmg vmg commented on the diff
ext/rinku/autolink.c
@@ -20,7 +20,18 @@
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
+#include "ruby.h"
+
+#ifdef HAVE_RUBY_ENCODING_H
@vmg Owner
vmg added a note

This cannot go here -- note that autolink.c is a backport from vmg/sundown, GitHub's Markdown parser. This parser is language agnostic, so it cannot use Ruby's specific overrides to have encoding aware helpers.

On top of that, we have a very strict policy of enforcing UTF-8 everywhere, so encodings are rather irrelevant.

@shajith
shajith added a note

My bad, I didn't know about autolink.c being language-agnostic. Also: Are the ctype.h versions of isalpha etc sufficient for UTF-8 input?

@vmg Owner
vmg added a note

We assume them to be good enough: according to the IEEE standard for URLs, any characters that escape the extended range need to be percent-encoded in an URL anyway, so all these functions matching the lower range work as expected for all valid URLs.

...This is one of the few times when standards throw us a hand. :)

Are you saying that if you rinku sees "http://example.com/х" in, for example, an email, it's by definition not a URL and thus shouldn't be auto-linked? The autolinker that GitHub is applying to this comment doesn't have a problem with that. It uses the original string in its original encoding for the <a> contents, and the URL-encoded version for the href:

<a href="http://example.com/%D1%85">http://example.com/х</a>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vmg
Owner
vmg commented

There is a valid issue with strings returned with the wrong encoding here, but both Rinku and Redcarpet/Sundown are strictly UTF-8-aware libraries, so blindly copying the encodings is not the right answer. The proper fix is to properly set UTF-8 as the encoding of all generated strings, and to verify that the string that gets passed to Rinku is either UTF-8 or UTF8-compatible.

@shajith

Thanks for reviewing, I will take a shot at verifying the input to be UTF-8 instead of using the encoding of the input string. Do you think it should refuse non-UTF-8 strings (via ArgumentError, i.e)?

@vmg
Owner
vmg commented

Thanks to you for the PR!

Yeah, rejecting invalid encodings is the approach we took in Redcarpet. It makes more sense than re-encoding the string, because the user most of the time doesn't expect a reencoding anyway.

By the way, we should probably accept not only UTF-8, but all UTF8-compatible encodings (i.e. ASCII also applies). From that point of view, the code to copy the encoding index in this PR is already working nicely.

@shajith

Closing in favor of #28

@shajith shajith closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 45 additions and 11 deletions.
  1. +11 −1 ext/rinku/autolink.c
  2. +13 −8 ext/rinku/rinku.c
  3. +21 −2 test/autolink_test.rb
View
12 ext/rinku/autolink.c
@@ -20,7 +20,18 @@
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
+#include "ruby.h"
+
+#ifdef HAVE_RUBY_ENCODING_H
@vmg Owner
vmg added a note

This cannot go here -- note that autolink.c is a backport from vmg/sundown, GitHub's Markdown parser. This parser is language agnostic, so it cannot use Ruby's specific overrides to have encoding aware helpers.

On top of that, we have a very strict policy of enforcing UTF-8 everywhere, so encodings are rather irrelevant.

@shajith
shajith added a note

My bad, I didn't know about autolink.c being language-agnostic. Also: Are the ctype.h versions of isalpha etc sufficient for UTF-8 input?

@vmg Owner
vmg added a note

We assume them to be good enough: according to the IEEE standard for URLs, any characters that escape the extended range need to be percent-encoded in an URL anyway, so all these functions matching the lower range work as expected for all valid URLs.

...This is one of the few times when standards throw us a hand. :)

Are you saying that if you rinku sees "http://example.com/х" in, for example, an email, it's by definition not a URL and thus shouldn't be auto-linked? The autolinker that GitHub is applying to this comment doesn't have a problem with that. It uses the original string in its original encoding for the <a> contents, and the URL-encoded version for the href:

<a href="http://example.com/%D1%85">http://example.com/х</a>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+#include <ruby/encoding.h>
+#define isalnum(s) rb_isalnum(s)
+#define isspace(s) rb_isspace(s)
+#define isalpha(s) rb_isalpha(s)
+#define ispunct(s) rb_ispunct(s)
+#else
#include <ctype.h>
+#endif
+
#if defined(_WIN32)
#define strncasecmp _strnicmp
@@ -293,4 +304,3 @@ sd_autolink__url(
return link_end;
}
-
View
21 ext/rinku/rinku.c
@@ -24,6 +24,8 @@
#include <ruby/encoding.h>
#else
#define rb_enc_copy(dst, src)
+#define rb_enc_set_index(str, idx)
+#define rb_enc_get_index(str) 1
#endif
#include "autolink.h"
@@ -74,7 +76,7 @@ static const char *g_hrefs[] = {
};
static void
-autolink__print(struct buf *ob, const struct buf *link, void *payload)
+autolink__print(struct buf *ob, const struct buf *link, void *payload, int enc_index)
{
bufput(ob, link->data, link->size);
}
@@ -191,7 +193,8 @@ rinku_autolink(
unsigned int flags,
const char *link_attr,
const char **skip_tags,
- void (*link_text_cb)(struct buf *ob, const struct buf *link, void *payload),
+ void (*link_text_cb)(struct buf *ob, const struct buf *link, void *payload, int enc_index),
+ int enc_index,
void *payload)
{
size_t i, end;
@@ -267,7 +270,7 @@ rinku_autolink(
BUFPUTSL(ob, "\">");
}
- link_text_cb(ob, link, payload);
+ link_text_cb(ob, link, payload, enc_index);
BUFPUTSL(ob, "</a>");
link_count++;
@@ -287,10 +290,11 @@ rinku_autolink(
* Ruby code
*/
static void
-autolink_callback(struct buf *link_text, const struct buf *link, void *block)
+autolink_callback(struct buf *link_text, const struct buf *link, void *block, int enc_index)
{
VALUE rb_link, rb_link_text;
rb_link = rb_str_new(link->data, link->size);
+ rb_enc_set_index(rb_link, enc_index);
rb_link_text = rb_funcall((VALUE)block, rb_intern("call"), 1, rb_link);
Check_Type(rb_link_text, T_STRING);
bufput(link_text, RSTRING_PTR(rb_link_text), RSTRING_LEN(rb_link_text));
@@ -346,8 +350,8 @@ const char **rinku_load_tags(VALUE rb_skip)
* HTML, Rinku is smart enough to skip the links that are already enclosed in `<a>`
* tags.`
*
- * - `mode` is a symbol, either `:all`, `:urls` or `:email_addresses`,
- * which specifies which kind of links will be auto-linked.
+ * - `mode` is a symbol, either `:all`, `:urls` or `:email_addresses`,
+ * which specifies which kind of links will be auto-linked.
*
* - `link_attr` is a string containing the link attributes for each link that
* will be generated. These attributes are not sanitized and will be include as-is
@@ -392,7 +396,7 @@ rb_rinku_autolink(int argc, VALUE *argv, VALUE self)
ID mode_sym;
rb_scan_args(argc, argv, "14&", &rb_text, &rb_mode,
- &rb_html, &rb_skip, &rb_flags, &rb_block);
+ &rb_html, &rb_skip, &rb_flags, &rb_block);
Check_Type(rb_text, T_STRING);
@@ -434,6 +438,7 @@ rb_rinku_autolink(int argc, VALUE *argv, VALUE self)
rb_raise(rb_eTypeError,
"Invalid linking mode (possible values are :all, :urls, :email_addresses)");
+
count = rinku_autolink(
output_buf,
RSTRING_PTR(rb_text),
@@ -443,6 +448,7 @@ rb_rinku_autolink(int argc, VALUE *argv, VALUE self)
link_attr,
skip_tags,
RTEST(rb_block) ? &autolink_callback : NULL,
+ rb_enc_get_index(rb_text),
(void*)rb_block);
if (count == 0)
@@ -465,4 +471,3 @@ void RUBY_EXPORT Init_rinku()
rb_define_method(rb_mRinku, "auto_link", rb_rinku_autolink, -1);
rb_define_const(rb_mRinku, "AUTOLINK_SHORT_DOMAINS", INT2FIX(SD_AUTOLINK_SHORT_DOMAINS));
}
-
View
23 test/autolink_test.rb
@@ -32,7 +32,7 @@ def test_global_skip_tags
Rinku.skip_tags = nil
assert_not_equal Rinku.auto_link(url), url
end
-
+
def test_auto_link_with_single_trailing_punctuation_and_space
url = "http://www.youtube.com"
url_result = generate_result(url)
@@ -138,7 +138,7 @@ def test_auto_link_at_eol
url2 = "http://www.ruby-doc.org/core/Bar.html"
assert_equal %(<p><a href="#{url1}">#{url1}</a><br /><a href="#{url2}">#{url2}</a><br /></p>), Rinku.auto_link("<p>#{url1}<br />#{url2}<br /></p>")
- end
+ end
def test_block
link = Rinku.auto_link("Find ur favorite pokeman @ http://www.pokemon.com") do |url|
@@ -149,6 +149,12 @@ def test_block
assert_equal link, "Find ur favorite pokeman @ <a href=\"http://www.pokemon.com\">POKEMAN WEBSITE</a>"
end
+ def test_links_with_cyrillic_x
+ url = "http://example.com/х"
+
+ assert_linked "<a href=\"#{url}\">#{url}</a>", url
+ end
+
def test_autolink_works
url = "http://example.com/"
assert_linked "<a href=\"#{url}\">#{url}</a>", url
@@ -285,6 +291,19 @@ def test_copies_source_encoding
ret = Rinku.auto_link str
assert_equal str.encoding, ret.encoding
end
+
+ def test_block_encoding
+ url = "http://example.com/х"
+ assert_equal "UTF-8", url.encoding.to_s
+
+ link = Rinku.auto_link(url) do |u|
+ assert_equal "UTF-8", u.encoding.to_s
+ u
+ end
+
+ assert_equal link.encoding.to_s, "UTF-8"
+ end
+
end
def generate_result(link_text, href = nil)
Something went wrong with that request. Please try again.