New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear coderange when rb_str_resize change size #9552
base: master
Are you sure you want to change the base?
Conversation
string.c
Outdated
@@ -3110,7 +3110,7 @@ rb_str_resize(VALUE str, long len) | |||
int independent = str_independent(str); | |||
long slen = RSTRING_LEN(str); | |||
|
|||
if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) { | |||
if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With moving termlen
from the following block to here, and:
if (slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) { | |
if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || | |
(termlen > 1 && (slen % termlen || len % termlen))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to (slen % termlen == 0) != (len % termlen == 0)
because I think it's more precise
diff --git i/ext/-test-/string/set_len.c w/ext/-test-/string/set_len.c
index 049da2cdb5c..b55ef6f4691 100644
--- i/ext/-test-/string/set_len.c
+++ w/ext/-test-/string/set_len.c
@@ -16,9 +16,17 @@ bug_str_append(VALUE str, VALUE addendum)
return str;
}
+static VALUE
+bug_str_resize(VALUE str, VALUE len)
+{
+ rb_str_resize(str, NUM2LONG(len));
+ return str;
+}
+
void
Init_string_set_len(VALUE klass)
{
rb_define_method(klass, "set_len", bug_str_set_len, 1);
rb_define_method(klass, "append", bug_str_append, 1);
+ rb_define_method(klass, "resize", bug_str_resize, 1);
}
diff --git i/test/-ext-/string/test_set_len.rb w/test/-ext-/string/test_set_len.rb
index e3eff75d9b9..ecd50083650 100644
--- i/test/-ext-/string/test_set_len.rb
+++ w/test/-ext-/string/test_set_len.rb
@@ -63,4 +63,13 @@
assert_not_predicate str, :ascii_only?
assert_equal u, str
end
+
+ def test_valid_encoding_after_resized
+ s = "\0\0".force_encoding(Encoding::UTF_16BE)
+ str = Bug::String.new(s)
+ str.resize(1)
+ assert_not_predicate str, :valid_encoding?
+ str.resize(2)
+ assert_predicate str, :valid_encoding?
+ end
end |
In some encoding like utf-16 utf-32, expanding the string with null bytes can change coderange to either broken or valid.
20ba9cb
to
d200c55
Compare
str = Bug::String.new(s) | ||
assert_predicate str, :valid_encoding? | ||
str.resize(2) | ||
assert_not_predicate str, :valid_encoding? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test case for shrinking surrogate pair that len % termlen
is not changed but valid_encoding?
is changed.
This is covered by slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT
branch.
string.c
Outdated
|
||
if (slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) { | ||
if ((slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || | ||
(termlen > 1 && (slen % termlen == 0) != (len % termlen == 0))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to clear if slen > len
(truncation) regardless.
I'd advocate for the simplest condition that catch the most common cases. I don't think the goal should be to never clear the coderange when it's theoretically avoidable, just to use fast and robust heuristics to avoid clearing it in the overwhelming majority of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd advocate for the simplest condition
I'd like to use slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT
because it's simple and respects previous optimization for 7bit coderange. (not updated the pull request to this condition yet)
I think there are about four options.
// 1
slen != len
// Most robust
// 2
slen != len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT
// ENC_CODERANGE_7BIT means it's ascii compatible encoding and all bytes are <=127,
// so truncate and extend will never change coderange.
// 3
(slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || (slen < len && termlen > 1)
// Need non obvious knowledge that every character(codepoint = 1..0x10ffff) in
every termlen==1 encodings currently defined in `Encoding.list` does not end with "\0".
// 4
(slen > len && ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) || (termlen > 1 && (slen % termlen == 0) != (len % termlen == 0))
// Most efficient
// Need additional knowledge that every termlen > 1 encodings (utf16 and utf32) are fixed-width,
// and the latter part of surrogate pair(utf16) is not "\0\0" so extending by 2bytes does not change broken to valid.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the cases that are worth optimizing for are the very common encodings, namely: ASCII
, ASCII-BIT
and UTF-8
, and in the increase case, which is common when appending strings.
Truncation is much rarer in my opinion, so not worth the complexity.
So I would advocate for:
bool truncating = slen > len;
if (truncating || termlen != 1) {
CLEAR_CODERANGE(...)
}
We lose the optimization for truncating ASCII-only strings, but I really doubt it's common enough to justify the complexity.
Of course I don't pretend to be an expert in this area, so I'm happy to defer to @nobu's opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense to me.
I updated to if (slen > len || (termlen != 1 && slen < len))
because rb_str_resize(str, RSTRING_LEN(str))
is used in rb_str_freeze and some other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Fixes https://bugs.ruby-lang.org/issues/20189
rb_str_resize
was clearing coderange only when size shrinksBut coderange should be cleared also when size is extended.
Every existing encoding except utf16 utf32 does not end with
"\0"
.Extending with \0 will change coderange only on these four encodings. These are the only encodings which is
TERM_LEN > 1
.