Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triple quoted literal definition and 7 in row #103594

Closed
VasilijKolomiets opened this issue Apr 17, 2023 · 8 comments
Closed

Triple quoted literal definition and 7 in row #103594

VasilijKolomiets opened this issue Apr 17, 2023 · 8 comments
Labels
docs Documentation in the Doc dir pending The issue will be closed if no feedback is provided

Comments

@VasilijKolomiets
Copy link

Documentation

Here
we can read that

  • longstringchar ::= <any source character except "">................................... (for example ")
  • longstringitem ::= [longstringchar]........................................................................(again " is valid)
  • longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""' ..(""""""" is valid)

But """"""" gives syntax error
So we need rewrite interpretator or point something in documentation, isn`t it?

@VasilijKolomiets VasilijKolomiets added the docs Documentation in the Doc dir label Apr 17, 2023
@VasilijKolomiets
Copy link
Author

Here was discussion. But it seams nobody understand what I say....
https://stackoverflow.com/questions/76036574/why-does-the-literal-string-seven-quotes-give-an-error

@terryjreedy
Copy link
Member

As should be expected, 7 quotes are interpreted as triple quote, triple quote, unmatch single quote, and the last is an error.

>>> '''''''
  File "<stdin>", line 1
    '''''''
          ^
SyntaxError: unterminated string literal (detected at line 1)

The quoted lexical analysis section says:
shortstringchar ::= <any source character except "\" or newline or the quote>
longstringchar ::= <any source character except "\">
As I also said on the SO answer, the grammar in the docs is only partly 'formal'. These 2 productions are partly informal. The doc grammar is not and can not be the grammar used to define the parser. It is a human readable version that must be read along with the text explanation that follows. This explanation starts with One syntactic restriction not indicated by these productions... . Another is given by In triple-quoted literals, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the literal. The informal “a quote” in the productions and in the sentence just quoted is then explained as the character used to open the literal, i.e. either ' or ".

The longstringchar definition could be extended with or quote part of a triple quote. But I am not sure that this really adds anything to the sentence given later and might possibly be a bit confusing.

I am inclined to close this as 'not planned', but will wait for opinions others than ours.

@zware zware added the pending The issue will be closed if no feedback is provided label Apr 17, 2023
@terryjreedy terryjreedy changed the title Tripple quoted literals Bacus-Nour definition anf 7 in row " Triple quoted literal definition and 7 in row Apr 18, 2023
@VasilijKolomiets
Copy link
Author

First of all - sorry for my English... )

Ok. May be I am a bit tooo formal. But indeed the docummrntation has no real definition the term triple-quoted literal. First when it appears they say 'see below' but below again is no any formal-like definition. So it is wroten like I have to know what is triple-quoted literal is BEFORE reading this text.

So I propose reformat the docs for adding explanation of this term a bit early and the text In triple-quoted literals, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the literal. have to be transformed. BC iread the triple-quoted literals as "literal inside triple quote frame" or " set of symbols between opening triple quoter and closing triple quoter". So in fact 7 in row is not like having " three unescaped quotes in a row terminate the literal". Because literal is the thing that was triple quaoted i.e. at least 9 total symbols " in a row are causing this situation.

Account this if it is possible.

@VasilijKolomiets
Copy link
Author

VasilijKolomiets commented Apr 18, 2023

First of all - I am appreciated of your patience and wise accounting all sides. Thanks a lot. You say:

The doc grammar is not and can not be the grammar used to define the parser.

What is th problem to place here - in Python documentation -the real gramma wich define the parser?
Or may be it is not exist? ;-)
How to understand that parser works properly if I can not see formal gramma?

Look. I so huge loved in Python that I am bealiving that all is perfect that toched to it.
Sorry for my formalism.

@terryjreedy
Copy link
Member

terryjreedy commented Apr 18, 2023

The doc grammar is for people to read and can have informalities. The formal grammar is for a program to read. One benefit of separating the two was that we cound switch from a formal context-free grammar to a formal program expression grammar (PEG) in 3.10 without changing the docs. You are free to read the PEG. Literal identification is part of the tokenizer in the Parser directory. As far as I could tell, this is done with hand-crafted C rather than generated from a grammar. The python-coded tokenize module uses relational expressions. See 105-118 for the string regexes. Most people find the semi-formal doc productions more readable.

@VasilijKolomiets
Copy link
Author

VasilijKolomiets commented Apr 18, 2023

Thanks a lot.
I am satisfied. May be if PEG links was near the Bacus-Nour description I do not ask anything.
But the last try to be free from traditional view.

Thanks for being human for such strange people like me.

Please do not close this issue 3 days - I will read the tokenizer ))

@VasilijKolomiets
Copy link
Author

VasilijKolomiets commented Apr 18, 2023

The doc grammar is for people to read and can have informalities. The formal grammar is for a program to read. One benefit of separating the two was that we cound switch from a formal context-free grammar to a formal program expression grammar (PEG) in 3.10 without changing the docs. You are free to read the PEG. Literal identification is part of the tokenizer in the Parser directory. As far as I could tell, this is done with hand-crafted C rather than generated from a grammar. The python-coded tokenize module uses relational expressions. See 105-118 for the string regexes. Most people find the semi-formal doc productions more readable.

Ok. I viewed this lines. Look at 114 row:

# Tail end of """ string.       
Double3 = r'[^"\\]*(?:(?:\\.|"(?!""))[^"\\]*)*"""'

As we see
"(?!"")" means the double quote that is not followed by another double quote (i.e., it matches a single double quote).

So """"""" again have to be good. Because " + """ is valid tail end of """ string.

But indeed
re.fullmatch(r'[^"\\]*(?:(?:\\.|"(?!""))[^"\\]*)*"""', '""""')

says None. I.e. logicaly 7*" is alowed but prohibited by RegEx writen in line 114.

So the question is - what behavior is correct? Error - or symbol " as result

Here is one another example nonhomogenous behavior... Quatro " alowed at the start but prohibited at the end:

In[5] :""""TEXT" """
Out[5]: '"TEXT" '

In[6] :""""TEXT""""
  File "C:\Users\vasil\AppData\Local\Temp\ipykernel_5516\2295884511.py", line 1
    """"TEXT""""
                ^
SyntaxError: EOL while scanning string literal

@VasilijKolomiets
Copy link
Author

I love Python I just wanna that it be perfect. I just wanna that """"TEXT"""" gives "TEXT". As for me it will be perfect. Sorry. They say I am agressive. May be it is IDK.
I am closing my issue.

@terryjreedy terryjreedy closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir pending The issue will be closed if no feedback is provided
Projects
None yet
Development

No branches or pull requests

3 participants