Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear coderange when rb_str_resize change size #9552

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tompng
Copy link
Member

@tompng tompng commented Jan 16, 2024

Fixes https://bugs.ruby-lang.org/issues/20189

rb_str_resize was clearing coderange only when size shrinks

if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
  ENC_CODERANGE_CLEAR(str);
}

But coderange should be cleared also when size is extended.

Example of coderange change with extend
"\x7f" (invalid utf16) → "\x7f\x00" (valid utf16)
"\x7f\x7f" (valid utf16) → "\x7f\x7f\x00" (invalid utf16)

Every existing encoding except utf16 utf32 does not end with "\0".

Encoding.list.select do |encoding|
  (0..0x10ffff).any? do |codepoint|
    bytes = codepoint.chr(encoding).bytes
    bytes != [0] && bytes.last == 0
  rescue
    false
  end
end
#=> [#<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>, #<Encoding:UTF-32BE>, #<Encoding:UTF-32LE>]

Extending with \0 will change coderange only on these four encodings. These are the only encodings which is TERM_LEN > 1.

string.c Outdated
@@ -3110,7 +3110,7 @@ rb_str_resize(VALUE str, long len)
int independent = str_independent(str);
long slen = RSTRING_LEN(str);

if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With moving termlen from the following block to here, and:

Suggested change
if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) ||
(termlen > 1 && (slen % termlen || len % termlen))) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to (slen % termlen == 0) != (len % termlen == 0) because I think it's more precise

@nobu
Copy link
Member

nobu commented Jan 16, 2024

diff --git i/ext/-test-/string/set_len.c w/ext/-test-/string/set_len.c
index 049da2cdb5c..b55ef6f4691 100644
--- i/ext/-test-/string/set_len.c
+++ w/ext/-test-/string/set_len.c
@@ -16,9 +16,17 @@ bug_str_append(VALUE str, VALUE addendum)
     return str;
 }
 
+static VALUE
+bug_str_resize(VALUE str, VALUE len)
+{
+    rb_str_resize(str, NUM2LONG(len));
+    return str;
+}
+
 void
 Init_string_set_len(VALUE klass)
 {
     rb_define_method(klass, "set_len", bug_str_set_len, 1);
     rb_define_method(klass, "append", bug_str_append, 1);
+    rb_define_method(klass, "resize", bug_str_resize, 1);
 }
diff --git i/test/-ext-/string/test_set_len.rb w/test/-ext-/string/test_set_len.rb
index e3eff75d9b9..ecd50083650 100644
--- i/test/-ext-/string/test_set_len.rb
+++ w/test/-ext-/string/test_set_len.rb
@@ -63,4 +63,13 @@
     assert_not_predicate str, :ascii_only?
     assert_equal u, str
   end
+
+  def test_valid_encoding_after_resized
+    s = "\0\0".force_encoding(Encoding::UTF_16BE)
+    str = Bug::String.new(s)
+    str.resize(1)
+    assert_not_predicate str, :valid_encoding?
+    str.resize(2)
+    assert_predicate str, :valid_encoding?
+  end
 end

In some encoding like utf-16 utf-32, expanding the string with null bytes can change coderange to either broken or valid.
str = Bug::String.new(s)
assert_predicate str, :valid_encoding?
str.resize(2)
assert_not_predicate str, :valid_encoding?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test case for shrinking surrogate pair that len % termlen is not changed but valid_encoding? is changed.
This is covered by slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT branch.

string.c Outdated

if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) ||
(termlen > 1 && (slen % termlen == 0) != (len % termlen == 0))) {
Copy link
Member

@byroot byroot Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to clear if slen > len (truncation) regardless.

I'd advocate for the simplest condition that catch the most common cases. I don't think the goal should be to never clear the coderange when it's theoretically avoidable, just to use fast and robust heuristics to avoid clearing it in the overwhelming majority of cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advocate for the simplest condition

I'd like to use slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT because it's simple and respects previous optimization for 7bit coderange. (not updated the pull request to this condition yet)

I think there are about four options.

// 1
slen != len
// Most robust

// 2
slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT
// ENC_CODERANGE_7BIT means it's ascii compatible encoding and all bytes are <=127,
// so truncate and extend will never change coderange.

// 3
(slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || (slen < len && termlen > 1)
// Need non obvious knowledge that every character(codepoint = 1..0x10ffff) in
every termlen==1 encodings currently defined in `Encoding.list` does not end with "\0".

// 4
(slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || (termlen > 1 && (slen % termlen == 0) != (len % termlen == 0))
// Most efficient
// Need additional knowledge that every termlen > 1 encodings (utf16 and utf32) are fixed-width,
// and the latter part of surrogate pair(utf16) is not "\0\0" so extending by 2bytes does not change broken to valid.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cases that are worth optimizing for are the very common encodings, namely: ASCII, ASCII-BIT and UTF-8, and in the increase case, which is common when appending strings.

Truncation is much rarer in my opinion, so not worth the complexity.

So I would advocate for:

bool truncating = slen > len;
if (truncating || termlen != 1) {
  CLEAR_CODERANGE(...)
}

We lose the optimization for truncating ASCII-only strings, but I really doubt it's common enough to justify the complexity.

Of course I don't pretend to be an expert in this area, so I'm happy to defer to @nobu's opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make sense to me.
I updated to if (slen > len || (termlen != 1 && slen < len)) because rb_str_resize(str, RSTRING_LEN(str)) is used in rb_str_freeze and some other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants