Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coerce symbols internal fstrings in UTF8 rather than ASCII #2242

Closed
wants to merge 1 commit into from
Closed

Coerce symbols internal fstrings in UTF8 rather than ASCII #2242

wants to merge 1 commit into from

Conversation

casperisfine
Copy link
Contributor

@casperisfine casperisfine commented Jun 19, 2019

Ref: https://bugs.ruby-lang.org/issues/15940

It's not uncommon for symbols to have literal string counterparts, e.g.

class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end

Since the default source encoding is UTF-8, and that symbols coerce their
internal fstring to ASCII when possible, the above snippet will actually keep
two instances of "name" in the fstring registry. One in ASCII, the other
in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols
fstrings as UTF-8 instead makes no real difference, but allows in most cases
to reuse the equivalent string literals.

The only notable behavioral change is Symbol#to_s.

Previously :name.to_s.encoding would be #<Encoding:US-ASCII>.
After this patch it's #<Encoding:UTF-8>. I can't foresee any significant
compatibility impact of this change on existing code.

There are several ruby specs asserting this behavior. If this specification is impossible to change, then we could consider changing the encoding of the String returned by Symbol#to_s, e.g )ruby pseudo code:

def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end

@casperisfine
Copy link
Contributor Author

@eregon CI is now green, please let me know if I made mistake in the specs.

@k0kubun k0kubun changed the base branch from trunk to master August 15, 2019 17:08
@k0kubun
Copy link
Member

k0kubun commented Aug 17, 2019

Now master branch has a required check "check_branch", but I couldn't update your branch to include it. Could you rebase this from master to run it?

It's not uncommon for symbols to have literal string counterparts, e.g.

```
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their
internal fstring to ASCII when possible, the above snippet will actually keep
two instances of `"name"` in the fstring registry. One in ASCII, the other
in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols
fstrings as UTF-8 instead makes no real difference, but allows in most cases
to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant
compatibility impact of this change on existing code.
@casperisfine
Copy link
Contributor Author

I rebased and the check_branch status was properly triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants