Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Can't match unicode snowman literals in strings #4336

Closed
1 of 3 tasks
craigds opened this issue Nov 26, 2021 · 8 comments · Fixed by #4415
Closed
1 of 3 tasks

Python: Can't match unicode snowman literals in strings #4336

craigds opened this issue Nov 26, 2021 · 8 comments · Fixed by #4415
Assignees
Labels
lang:python priority:low user:external requested by someone outside of r2c

Comments

@craigds
Copy link

craigds commented Nov 26, 2021

Describe the bug

I'm trying to match a string, in python, which contains a literal snowman character (). I can't find a way to do so.

To Reproduce

https://semgrep.dev/s/craigds:unicode-snowman

Expected behavior
I expected to just be able to use the string itself

semgrep --lang=python --pattern '"Test ☃"'

It didn't match. Neither did any of the other things I tried, e.g.:

  • "Test \x{FE0F}"
  • "Test \u2603"

What is the priority of the bug to you?

  • P0: blocking your adoption of Semgrep or workflow
  • P1: important to fix or quite annoying
  • P2: regular bug that should get fixed

Environment

  • semgrep 0.70.0 via homebrew on MacOS 11.6
  • Doesn't work on semgrep.dev either though, see link above
@aryx
Copy link
Collaborator

aryx commented Nov 29, 2021

might be a unicode issue.

@aryx aryx changed the title Can't match snowman literals in strings Can't match unicode snowman literals in strings Nov 29, 2021
@aryx aryx changed the title Can't match unicode snowman literals in strings Python: Can't match unicode snowman literals in strings Nov 29, 2021
@aryx aryx added the user:external requested by someone outside of r2c label Nov 30, 2021
@aryx
Copy link
Collaborator

aryx commented Dec 7, 2021

@mjambon can you have a look at it? You're the unicode expert ...

@mjambon
Copy link
Member

mjambon commented Dec 8, 2021

@pad here's what I got:

$ semgrep-core -lang python -e '"snowman ☃"' <(echo '"snowman ☃"') -fast
/tmp/semgrep-core-cf24d0-63:1
 "snowman ☃"
$ semgrep-core -lang python -co <(echo '"snowman ☃"')
/tmp/semgrep-core-da6d22-63:1 with rule -
 "snowman ☃"
$ semgrep-core -lang python -config snowman.yml <(echo '"snowman ☃"') -fast

with the following rule file snowman.yml:

rules:
- id: '-'
  pattern: '"snowman ☃"'
  message: '"snowman ☃"'
  languages:
  - python
  severity: ERROR

The problem is when using a rule file, -fast, and some non-ascii input.

@mjambon
Copy link
Member

mjambon commented Dec 9, 2021

For background, see #2111. The hack that was implemented to work around bad locations replaces each non-ascii byte by a Z, resulting in false positives like the following:

$ semgrep-core -lang python -e '"😀"' <(echo '"🚀"')
/tmp/semgrep-core-124947-63:1
 "🚀"

This match happens because both 😀 and 🚀 and parsed as ZZZZ due to our unicode hack. However it breaks our -fast ("filter irrelevant rules") optimization which compares the parsed pattern (containing ZZZZ or ZZZ) against the raw source file still containing the original utf8 🚀 or ☃. It sees that the string ZZZZ doesn't exist in the source file and decides to skip it.

@mjambon
Copy link
Member

mjambon commented Dec 9, 2021

A proper fix would involve eliminating the hack that replaces non-ascii bytes by Zs, and then should do whatever is necessary to report proper locations. This would be a bit of work and needs to be done right.

Alternatively, we could extend the Unicode hack to work with the -fast optimization. This could make the optimization slower, in addition to making the code more complicated.

@emjin
Copy link
Collaborator

emjin commented Dec 9, 2021

imo we should hack a solution for -fast. Can we just exclude Z* from string filtering?

@mjambon
Copy link
Member

mjambon commented Dec 10, 2021

@emjin great idea. I was worried about the cost of editing each target but editing the pattern should be fine.

mjambon added a commit that referenced this issue Dec 10, 2021
mjambon added a commit that referenced this issue Dec 11, 2021
* Work around non-ascii byte substitution which was breaking the -fast
(filter irrelevant rules) optimization.
Fixes #4336

* Split big test "full rule" into one test case per file pair

* Explain expectations for unicode matching

* Update changelog

* typo

Co-authored-by: Emma Jin <emjin@users.noreply.github.com>

Co-authored-by: Emma Jin <emjin@users.noreply.github.com>
@mjambon
Copy link
Member

mjambon commented Dec 11, 2021

@craigds we just merged a "fix" for this, which is a hack on top of another hack. It will be in the next semgrep release. At some point, we'll need to handle Unicode correctly. Meanwhile, expect some false positives. See #4415.

mjambon added a commit that referenced this issue Dec 14, 2021
* Work around non-ascii byte substitution which was breaking the -fast
(filter irrelevant rules) optimization.
Fixes #4336

* Split big test "full rule" into one test case per file pair

* Explain expectations for unicode matching

* Update changelog

* Add tests for unicode hack

* Use correct version of pfff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang:python priority:low user:external requested by someone outside of r2c
Development

Successfully merging a pull request may close this issue.

5 participants