Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n") #81092

Closed
hawkowl mannequin opened this issue May 14, 2019 · 6 comments
Labels
3.8 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@hawkowl
Copy link
Mannequin

hawkowl mannequin commented May 14, 2019

BPO 36911
Nosy @mdickinson, @ericvsmith, @Carreau, @hawkowl

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-05-14.23:12:53.592>
created_at = <Date 2019-05-14.02:02:23.534>
labels = ['3.8', 'type-bug', 'library', 'invalid']
title = 'ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\\n")'
updated_at = <Date 2019-05-14.23:12:53.591>
user = 'https://github.com/hawkowl'

bugs.python.org fields:

activity = <Date 2019-05-14.23:12:53.591>
actor = 'eric.smith'
assignee = 'none'
closed = True
closed_date = <Date 2019-05-14.23:12:53.592>
closer = 'eric.smith'
components = ['Library (Lib)']
creation = <Date 2019-05-14.02:02:23.534>
creator = 'hawkowl'
dependencies = []
files = []
hgrepos = []
issue_num = 36911
keywords = []
message_count = 6.0
messages = ['342417', '342422', '342511', '342514', '342519', '342524']
nosy_count = 4.0
nosy_names = ['mark.dickinson', 'eric.smith', 'mbussonn', 'hawkowl']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue36911'
versions = ['Python 2.7', 'Python 3.8']

@hawkowl
Copy link
Mannequin Author

hawkowl mannequin commented May 14, 2019

reproducing case:

file.py:

"""
Hello \n blah.
"""

And then in a REPL (2.7 or 3+):

>>> import ast
>>> f = ast.parse(open("test.py", 'rb').read())
>>> f
<_ast.Module object at 0x7f609d0a4d68>
>>> f.body[0]
<_ast.Expr object at 0x7f609d0a4e10>
>>> f.body[0].value
<_ast.Str object at 0x7f609d02b780>
>>> f.body[0].value.s
'\nHello \n blah.\n'
>>> repr(f.body[0].value.s)
"'\\nHello \\n blah.\\n'"

Expected behaviour:

>>> repr(f.body[0].value.s)
"'\\nHello \\\\n blah.\\n'"

@hawkowl hawkowl mannequin added 3.8 (EOL) end of life stdlib Python modules in the Lib dir labels May 14, 2019
@Carreau
Copy link
Mannequin

Carreau mannequin commented May 14, 2019

I believe this one is even before the ast, in the tokenizer. Though the AST is also doing some normalisation in identifiers (“ε” U+03B5 Greek Small Letter Epsilon Unicode Character , and “ϵ” U+03F5 Greek Lunate Epsilon Symbol Unicode Character get normalized to the same for example, which is problematic as the look different, but end up being same identifier).

I'd be interested in an opt-in flag to not do this normalisation (I have a prototype with this for the identifier normalisation in ast, but I have not looked at the tokenizer), which might be useful for some linting tools.

@ericvsmith
Copy link
Member

The existing behavior is what I'd expect.

Using python3:

>>> import ast
>>> s = open('file.py', 'rb').read()
>>> s
b'"""\nHello \\n blah.\n"""\n'
>>> ast.dump(ast.parse(s))
"Module(body=[Expr(value=Str(s='\\nHello \\n blah.\\n'))])"
>>> eval(s)
'\nHello \n blah.\n'

As always with the AST, some information is lost. It's not designed to be able to round-trip back to the source text.

@hawkowl
Copy link
Mannequin Author

hawkowl mannequin commented May 14, 2019

There's a difference between round-tripping back to the source text and correctly representing the text in the source, though.

Since I'm using this module to perform static analysis of a Python module to retrieve class/function definitions and their docstrings to create API documentation, the string being the same as what it is in the file is important to me.

@mdickinson
Copy link
Member

The AST _does_ correctly represent the Python string object in the source, though. After:

>>> s = """
... Hello \n world
... """

we have a Python object s of type str, which contains exactly three newlines, zero "n" characters, and zero backslashes. So:

>>> s == '\nHello \n world\n'
True

If the AST Str node value were '\nHello \\n world\n' as you suggest, that would represent a different string to s: one containing two newline characters, one "n" and one backslash.

If you need to operate directly on the source as text, then the AST representation probably isn't what you want.

@ericvsmith
Copy link
Member

I agree with Mark: the string is being correctly interpreted by the AST parser, per Python's tokenizer rules.

You might want to look at lib2to3, which I think is also used by black. It's also possible that mypy or another static analyzer would be using some library you can leverage.

@ericvsmith ericvsmith added invalid type-bug An unexpected behavior, bug, or error labels May 14, 2019
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants