Clear coderange when rb_str_resize change size #9552

tompng · 2024-01-16T12:57:42Z

Fixes https://bugs.ruby-lang.org/issues/20189

rb_str_resize was clearing coderange only when size shrinks

if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
  ENC_CODERANGE_CLEAR(str);
}

But coderange should be cleared also when size is extended.

Example of coderange change with extend
"\x7f" (invalid utf16) → "\x7f\x00" (valid utf16)
"\x7f\x7f" (valid utf16) → "\x7f\x7f\x00" (invalid utf16)

Every existing encoding except utf16 utf32 does not end with "\0".

Encoding.list.select do |encoding|
  (0..0x10ffff).any? do |codepoint|
    bytes = codepoint.chr(encoding).bytes
    bytes != [0] && bytes.last == 0
  rescue
    false
  end
end
#=> [#<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>, #<Encoding:UTF-32BE>, #<Encoding:UTF-32LE>]

Extending with \0 will change coderange only on these four encodings. These are the only encodings which is TERM_LEN > 1.

nobu · 2024-01-16T13:55:32Z

string.c

@@ -3110,7 +3110,7 @@ rb_str_resize(VALUE str, long len)
    int independent = str_independent(str);
    long slen = RSTRING_LEN(str);

-    if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
+    if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {


With moving termlen from the following block to here, and:

Suggested change

if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {

if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) ||

(termlen > 1 && (slen % termlen || len % termlen))) {

Changed it to (slen % termlen == 0) != (len % termlen == 0) because I think it's more precise

nobu · 2024-01-16T13:56:20Z

diff --git i/ext/-test-/string/set_len.c w/ext/-test-/string/set_len.c
index 049da2cdb5c..b55ef6f4691 100644
--- i/ext/-test-/string/set_len.c
+++ w/ext/-test-/string/set_len.c
@@ -16,9 +16,17 @@ bug_str_append(VALUE str, VALUE addendum)
     return str;
 }
 
+static VALUE
+bug_str_resize(VALUE str, VALUE len)
+{
+    rb_str_resize(str, NUM2LONG(len));
+    return str;
+}
+
 void
 Init_string_set_len(VALUE klass)
 {
     rb_define_method(klass, "set_len", bug_str_set_len, 1);
     rb_define_method(klass, "append", bug_str_append, 1);
+    rb_define_method(klass, "resize", bug_str_resize, 1);
 }
diff --git i/test/-ext-/string/test_set_len.rb w/test/-ext-/string/test_set_len.rb
index e3eff75d9b9..ecd50083650 100644
--- i/test/-ext-/string/test_set_len.rb
+++ w/test/-ext-/string/test_set_len.rb
@@ -63,4 +63,13 @@
     assert_not_predicate str, :ascii_only?
     assert_equal u, str
   end
+
+  def test_valid_encoding_after_resized
+    s = "\0\0".force_encoding(Encoding::UTF_16BE)
+    str = Bug::String.new(s)
+    str.resize(1)
+    assert_not_predicate str, :valid_encoding?
+    str.resize(2)
+    assert_predicate str, :valid_encoding?
+  end
 end

In some encoding like utf-16 utf-32, expanding the string with null bytes can change coderange to either broken or valid.

tompng · 2024-01-16T16:34:58Z

test/-ext-/string/test_set_len.rb

+    str = Bug::String.new(s)
+    assert_predicate str, :valid_encoding?
+    str.resize(2)
+    assert_not_predicate str, :valid_encoding?


This is a test case for shrinking surrogate pair that len % termlen is not changed but valid_encoding? is changed.
This is covered by slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT branch.

byroot · 2024-01-17T00:02:28Z

string.c


-    if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
+    if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) ||
+        (termlen > 1 && (slen % termlen == 0) != (len % termlen == 0))) {


I think you need to clear if slen > len (truncation) regardless.

I'd advocate for the simplest condition that catch the most common cases. I don't think the goal should be to never clear the coderange when it's theoretically avoidable, just to use fast and robust heuristics to avoid clearing it in the overwhelming majority of cases.

I'd advocate for the simplest condition

I'd like to use slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT because it's simple and respects previous optimization for 7bit coderange. (not updated the pull request to this condition yet)

I think there are about four options.

// 1 slen != len // Most robust // 2 slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT // ENC_CODERANGE_7BIT means it's ascii compatible encoding and all bytes are <=127, // so truncate and extend will never change coderange. // 3 (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || (slen < len && termlen > 1) // Need non obvious knowledge that every character(codepoint = 1..0x10ffff) in every termlen==1 encodings currently defined in `Encoding.list` does not end with "\0". // 4 (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || (termlen > 1 && (slen % termlen == 0) != (len % termlen == 0)) // Most efficient // Need additional knowledge that every termlen > 1 encodings (utf16 and utf32) are fixed-width, // and the latter part of surrogate pair(utf16) is not "\0\0" so extending by 2bytes does not change broken to valid.

What do you think?

I think the cases that are worth optimizing for are the very common encodings, namely: ASCII, ASCII-BIT and UTF-8, and in the increase case, which is common when appending strings.

Truncation is much rarer in my opinion, so not worth the complexity.

So I would advocate for:

bool truncating = slen > len; if (truncating || termlen != 1) { CLEAR_CODERANGE(...) }

We lose the optimization for truncating ASCII-only strings, but I really doubt it's common enough to justify the complexity.

Of course I don't pretend to be an expert in this area, so I'm happy to defer to @nobu's opinion.

It make sense to me.
I updated to if (slen > len || (termlen != 1 && slen < len)) because rb_str_resize(str, RSTRING_LEN(str)) is used in rb_str_freeze and some other places.

nobu reviewed Jan 16, 2024

View reviewed changes

tompng added 2 commits January 17, 2024 01:14

Clear coderange when rb_str_resize change size

3830ffa

In some encoding like utf-16 utf-32, expanding the string with null bytes can change coderange to either broken or valid.

Add rb_str_resize coderange test

d200c55

tompng force-pushed the fix_str_resize_coderange branch from 20ba9cb to d200c55 Compare January 16, 2024 16:28

tompng commented Jan 16, 2024

View reviewed changes

byroot reviewed Jan 17, 2024

View reviewed changes

Simplify rb_str_resize clear range condition

6abc1d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear coderange when rb_str_resize change size #9552

Clear coderange when rb_str_resize change size #9552

tompng commented Jan 16, 2024 •

edited

nobu Jan 16, 2024

tompng Jan 16, 2024

nobu commented Jan 16, 2024

tompng Jan 16, 2024

byroot Jan 17, 2024 •

edited

tompng Jan 17, 2024

byroot Jan 17, 2024

tompng Jan 18, 2024

byroot Jan 18, 2024

	if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
	if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) \|\|
	(termlen > 1 && (slen % termlen \|\| len % termlen))) {

Clear coderange when rb_str_resize change size #9552

Are you sure you want to change the base?

Clear coderange when rb_str_resize change size #9552

Conversation

tompng commented Jan 16, 2024 • edited

nobu Jan 16, 2024

Choose a reason for hiding this comment

tompng Jan 16, 2024

Choose a reason for hiding this comment

nobu commented Jan 16, 2024

tompng Jan 16, 2024

Choose a reason for hiding this comment

byroot Jan 17, 2024 • edited

Choose a reason for hiding this comment

tompng Jan 17, 2024

Choose a reason for hiding this comment

byroot Jan 17, 2024

Choose a reason for hiding this comment

tompng Jan 18, 2024

Choose a reason for hiding this comment

byroot Jan 18, 2024

Choose a reason for hiding this comment

tompng commented Jan 16, 2024 •

edited

byroot Jan 17, 2024 •

edited