Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing code with an invalid encoding raises #998

Closed
Earlopain opened this issue Mar 10, 2024 · 8 comments · Fixed by #999
Closed

Parsing code with an invalid encoding raises #998

Earlopain opened this issue Mar 10, 2024 · 8 comments · Fixed by #999

Comments

@Earlopain
Copy link
Contributor

Parser::CurrentRuby.parse("# encoding: utf")

# /home/earlopain/.rbenv/versions/3.3.0/lib/ruby/gems/3.3.0/gems/parser-3.3.0.5/lib/parser/source/buffer.rb:69:in `find': unknown encoding name - utf (ArgumentError)
#
#          Encoding.find(result[3] || result[4] || result[6])

Ruby does raise an error when trying to execute the file:

[earlopain@DESKTOP-PC whitequark-parser]$ ruby test.rb
test.rb:1: unknown encoding name: utf (ArgumentError)

It doesn't look like a SyntaxError, not quite sure what this actually is, where it's coming from. However I believe the parser gem here shouldn't raise an ArgumentError in this case. This makes tools like RuboCop raise as well when trying to analyze these invalid files.

@iliabylich
Copy link
Collaborator

What Ruby does also looks wrong to me but parser simply follows this behaviour. Ripper also throws ArgumentError:

> Ripper.sexp("# encoding: utf\n")
-:1: unknown encoding name: utf (ArgumentError)

To me this error looks very special, a set of encodings that is supported by your Ruby implementation/parser is not defined by any spec, and if it's not well-defined then it can't be a SyntaxError (because it's not a part of the syntax itself).

What should be thrown/returned in your opinion?

@Earlopain
Copy link
Contributor Author

I'm not quite sure what the right thing to do would be to be honest.

This is what Prism returns:

irb(main):002> Prism.parse("# encoding: utf\n")
=> 
#<Prism::ParseResult:0x000075467d799cb0
 @comments=[#<Prism::InlineComment @location=#<Prism::Location @start_offset=0 @length=15 start_line=1>>],
 @data_loc=nil,
 @errors=
  [#<Prism::ParseError @message="unknown or invalid encoding in the magic comment" @location=#<Prism::Location @start_offset=12 @length=3 start_line=1> @level=:argument>],
 @magic_comments=[#<Prism::MagicComment @key="encoding" @value="utf">],
 @source=#<Prism::Source:0x000075467d8a2af8 @offsets=[0, 16], @source="# encoding: utf\n", @start_line=1>,
 @value=
  @ ProgramNode (location: (1,0)-(1,0))
  ├── locals: []
  └── statements:
      @ StatementsNode (location: (1,0)-(1,0))
      └── body: (length: 0),
 @warnings=[]>

They appear to have the luxury of defining each encoding themselves and since they only support one Ruby version currently it's not that complicated https://github.com/ruby/prism/blob/a09fc1f913fed81f820c23534674f3d1c1695aa2/src/encoding.c#L5011-L5198

I don't think that doing this here is worth the effort.

Before reporting I made this commit Earlopain@ce9f99a which just returns nil when an unrecognized encoding is encountered. This results in the default encoding of the ruby version to be used. Does that sound like an acceptable solution to you?

@iliabylich
Copy link
Collaborator

This results in the default encoding of the ruby version to be used. Does that sound like an acceptable solution to you?

No, it makes parser swallow the error (which IMO is the worst of "return something special"/"throw something"/"return nothing" options). Invalid encoding should result in an error of some kind, but I'm not sure that we can find a way that is reasonable and backward-compatible at the same time (even the former is not easy by itself).

@Earlopain
Copy link
Contributor Author

A subclass of ArgumentError, like Parser::UnknownEncodingInMagicComment? If there are consumers that currently handle this behaviour they should continue to work with this change.

This would make the error more actionable. You can currently catch the ArgumentError of course but having something concrete like that is just nicer.

@iliabylich
Copy link
Collaborator

Agree, makes sense to me. It doesn't make this error "less special" but at least it'll be trivial to catch it.

Feel free to send a PR.

@Earlopain
Copy link
Contributor Author

Will do, thank you 👍

@koic
Copy link
Collaborator

koic commented Mar 10, 2024

Note, Prism performs error-tolerant parsing. It would be good to keep in mind that there can be differences due to encoding:

Invalid Encoding

$ ruby -rprism -e 'p Prism.parse_success?("# encoding: utf")'
false

$ ruby -rprism -e 'p Prism.parse("# encoding: utf")'
#<Prism::ParseResult:0x00007fe565ab7c08 @value=@ ProgramNode (location: (1,0)-(1,0))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,0))
    └── body: (length: 0)
, @comments=[#<Prism::InlineComment @location=#<Prism::Location @start_offset=0 @length=15 start_line=1>>], @magic_comments=[#<Prism::MagicComment @key="encoding" @value="utf">], @data_loc=nil, @errors=[#<Prism::ParseError @message="unknown or invalid encoding in the magic comment" @location=#<Prism::Location @start_offset=12 @length=3 start_line=1> @level=:argument>], @warnings=[], @source=#<Prism::Source:0x00007fe565abc140 @source="# encoding: utf", @start_line=1, @offsets=[0]>>

Valid Encoding

$ ruby -rprism -e 'p Prism.parse_success?("# encoding: utf-8")'
true

$ ruby -rprism -e 'p Prism.parse("# encoding: utf-8")'
#<Prism::ParseResult:0x00007fd2ff857c88 @value=@ ProgramNode (location: (1,0)-(1,0))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,0))
    └── body: (length: 0)
, @comments=[#<Prism::InlineComment @location=#<Prism::Location @start_offset=0 @length=17 start_line=1>>], @magic_comments=[#<Prism::MagicComment @key="encoding" @value="utf-8">], @data_loc=nil, @errors=[], @warnings=[], @source=#<Prism::Source:0x00007fd2ff85c080 @source="# encoding: utf-8", @start_line=1, @offsets=[0]>>

@iliabylich
Copy link
Collaborator

iliabylich commented Mar 10, 2024

@koic Of course I fully support the idea of returning as much data in case of an error as possible, but what prism does can be unsafe, especially this part from your output above:

@value="utf-8"

There are no guarantees that what comes after # encoding is a valid UTF-8 string and as a consequence it may return a broken string which may trigger an error later. Ideally an AST that is returned from the parser must e 100% safe, or otherwise you'll need a ton of rescue EncodingError wrappers in the downstream code.

IMO the only safe context of the error are source maps which are somewhat obvious, and like I said before I think this error is very special, the whole parsing output should look like this:

type Output =
  | OK(<ast>, <syntax errors/warnings>, ...)
  | EncodingError(<location>)

which at the end makes no sense. This error is so unusual that it becomes a leaky abstraction that infects types and contracts, probably it can only be "fixed" by deprecating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants