-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
shlex.split() does not tokenize like the shell #43667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When shlex.split defines tokens, it doesn't properly The shell treats the following as identical cases, but echo hi&&echo bye echo hi;echo bye echo hi&echo bye shlex.split makes these cases ambiguous: echo 'foo&' echo '&&exit' |
Thanks for the report. Would you like to work on a patch, or translate your examples into unit tests? The docs do not mention “&” at all, and platform discrepancies have to be taken into account too, so I really don’t know if this is a bug fix for the normal mode, the POSIX mode, or a feature request requiring a new argument to the shlex function to preserve compatibility. |
It's been a while since I looked at this. I'm not really in a I don't think POSIX mode existed when I first reported this, but There are really two cases in one bug. The first part is that the shell will split tokens at characters that The proper handling of quotes/escapes requires some kind of new shlex.get_token2()
Return a tuple of the token and the original text of the token
(including quotes and escapes). Otherwise, this is the same as
shlex.get_token(). Comparing the two values for equality (or maybe identity) would tell -Dan On Fri, Sep 3, 2010 at 10:27 AM, Éric Araujo <report@bugs.python.org> wrote:
|
Thanks for the comments.
See also bpo-7611. |
Of course, that's how it's used. That's all it can do right now. I was was splitting and combining commands (using ;, &&, and ||) and It's kind of like a makefile command block. You want to be able to Once things are split properly, then understanding the shell control I ended up doing my own parsing. I'm not actually at that company I'll see if I can come up with a reference case and maybe a unittest -Dan On Thu, Nov 24, 2011 at 9:20 AM, Éric Araujo <report@bugs.python.org> wrote:
|
|
I've attached a diff to test_shlex.py and a script that I used to I'm completely ignoring the quotes issue for now. That should ref_shlex is python 2 syntax. python -3 shows that it should convert cleanly. Getting this into the mainline will be interesting. I would think it -Dan On Fri, Nov 25, 2011 at 10:01 AM, Éric Araujo <report@bugs.python.org> wrote:
|
I just realized that I left out a major case. The shell will also I've attached updated versions of ref_shlex.py and test_shlex.diff. -Dan On Fri, Nov 25, 2011 at 12:25 PM, Dan Christian <report@bugs.python.org> wrote:
|
Thanks for the diff and test. (I removed the older versions; there are “edit” links in the list of files leading to pages where it’s possible to remove them, if one has the required permissions.) Your script passes with dash, which is probably the most POSIX-compliant shell we can find. (bash has extensions, zsh/csh don’t use the POSIX shell language, so I think the behavior of dash should be our reference, not the bash man page.)
python-dev takes compatibility seriously. Some things are clearly bugs and we fix them, even if it will break buggy code out there. For example, we recently fixed bugs in HTML parsing: We had a specification to decide that they were really bugs, and we judged that no sane program could be relying on the exact behavior of the parser. shlex is another case; in my opinion, it’s been used for years to implement parsing similar, but not identical in all cases, to the shell’s, and as there is code out there that depends on the current behavior of shlex and does not need to support && || ; ( ), if we add support for these tokens we should not break the existing code. Given that we can’t test all programs that use shlex, I think we’ll have to add a new parameter, with a default value which gets us the previous behavior, as I said in my previous message. (BTW, would you mind editing the quoted section when you reply by email? Otherwise we get unhelpful, distracting walls of quoted texts. Thanks in advance.) |
On Sat, Nov 26, 2011 at 7:12 AM, Éric Araujo <report@bugs.python.org> wrote:
I was just looking for a reference where I didn't have to sift through
Here's a thought on how that might work (just brainstorming). shlex There might be a shell specific script (or maybe it's left to the -Dan |
|
The point of ref_shlex.py is that all shells act the same for common
You don't have to do a subclass (although that might have some
Correct, quotes wouldn't change.
shlex is a pretty simple lexer (as lexers go), and I wouldn't want it
Ideally, the final tokens have exact meanings. It easier to write -Dan |
I've made a patch which implements this functionality, together with docs and tests. Please review. |
This time you should have received an email from Rietveld, I made sure that your ID was expanded to an email address. I like all the suggestions you made in reply to my comments. |
I updated the patch to reflect Éric's comments on Rietveld, but there are also some other changes: Previously when punctuation chars were set, wordchars was being augmented by '-'. This was incomplete, so the augmentation is now with '~-./*?=' which allows for wildcards, filename chars and argument flags. I added a token_type attribute whose value is 'a' for alphanumeric tokens and 'c' for punctuation tokens. This token type is internally tracked anyway - we just expose it now. It is needed for when multiple punctuation tokens need to be disambiguated, because we might return two logically separate punctuation tokens as one if they are not separated by whitespace in the source being tokenised. New attributes and the changes to wordchars have been documented, and a test added for token_type return values. |
Plus I also changed a few instances of the anachronism a = a + b to a += b |
Overall great patch! Dan, do you have time to test it (or read the new examples in the patch) to tell us if it meets what you wanted? |
Yes, I believe that testSyntaxSplitCustom covers this.
Thanks! It was a bit fiddly, shlex is somewhat difficult to extend cleanly. I developed this functionality for a subprocess ease-of-use-wrapper module called sarge, and I had to basically copy and modify the whole read_token method :-( |
I haven't been following this much. Sorry. My day job isn't in this area any more (and I'm stuck using 2.4 :-(). Looking at the docs, I notice the "old" is different from what it used to be. Notably: 'e;' gets split into two tokens; and ">'abc';" gets split into 3. I'm pretty sure that baseline code doesn't split those at all. So there is a question of if "old" is fully backward compatible. The "new" functionality looks great. That's what I was looking for when I filed the bug. Thank you! |
I've received no comments on the latest revision of my patch (incorporating comments on the previous version); is it OK to commit this? |
I'd like to take a look at this (I wasn't aware of it before). I'll try to do that some time in the next 24 hours, and if I don't you shouldn't wait for me :) Did you address Dan's concern about 'old' possibly not matching the old behavior completely? |
I believe Dan meant that the behaviour of shlex.split() now is different from what it was when he first raised the issue (in July 2006). Looking at the default branch of CPython, this is what I see: Python 3.3.0a2+ (default:ff6593aa8376, Apr 22 2012, 12:39:08)
[GCC 4.3.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> list(shlex.shlex('e;'))
['e', ';']
>>> list(shlex.shlex(">'abc';"))
['>', "'abc'", ';'] Likewise, on the 2.6 branch: Python 2.6.8+ (unknown, Apr 22 2012, 12:44:43)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> list(shlex.shlex('e;'))
['e', ';']
>>> list(shlex.shlex(">'abc';"))
['>', "'abc'", ';'] So from what Dan is saying, it would seem that he is saying that shlex behaviour (before my patch being applied) is different now to how he remembers it - not that the patch introduces any incompatibility. Still, another set of eyeballs on the patch would be good. |
BTW, what is the shlex unicode bug you mentioned a few times on Rietveld? The one I know is fixed now. |
I am interested in shell stuff in general. The unicode bug is bpo-1170. |
I've updated the patch following comments by RDM - it probably could do with a code review (now that I've addressed RDM's comments on the docs). |
Review, including a code-but-not-algorithm review :), posted. |
Let's hope we can get this into 3.5. I updated my patch a while ago to address RDM's comments. |
Is there any chance of getting this into 3.6? We are still in a situation where the shlex module misleads developers into believing that it has functionality to parse things the way the shell does. I've had to vendor the copy of shlex with patches from this bug applied (thanks Vinay!) |
This has been knocking around since 3.3, but never got enough attention to make it in. Barring objections from anyone, I'd like to commit this patch once I check that it applies cleanly against 3.6, before we get into 3.6 beta. |
No objection from me. I'm not likely to have the time to give it the kind of thorough review I'd *like* to, but I don't think it is really needed. |
Okay, I've updated with a new patch addressing SilentGhost's comments, and addressed the comments on that patch. If I don't hear any objections by Friday, I plan to commit this change. |
New changeset ea99e2f0b829 by Vinay Sajip in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: