Python: Can't match unicode snowman literals in strings #4336

craigds · 2021-11-26T01:26:01Z

Describe the bug

I'm trying to match a string, in python, which contains a literal snowman character (☃). I can't find a way to do so.

To Reproduce

https://semgrep.dev/s/craigds:unicode-snowman

Expected behavior
I expected to just be able to use the string itself

semgrep --lang=python --pattern '"Test ☃"'

It didn't match. Neither did any of the other things I tried, e.g.:

"Test \x{FE0F}"
"Test \u2603"

What is the priority of the bug to you?

P0: blocking your adoption of Semgrep or workflow
P1: important to fix or quite annoying
P2: regular bug that should get fixed

Environment

semgrep 0.70.0 via homebrew on MacOS 11.6
Doesn't work on semgrep.dev either though, see link above

The text was updated successfully, but these errors were encountered:

aryx · 2021-11-29T11:56:24Z

might be a unicode issue.

aryx · 2021-12-07T09:35:20Z

@mjambon can you have a look at it? You're the unicode expert ...

mjambon · 2021-12-08T02:44:56Z

@pad here's what I got:

$ semgrep-core -lang python -e '"snowman ☃"' <(echo '"snowman ☃"') -fast
/tmp/semgrep-core-cf24d0-63:1
 "snowman ☃"
$ semgrep-core -lang python -co <(echo '"snowman ☃"')
/tmp/semgrep-core-da6d22-63:1 with rule -
 "snowman ☃"
$ semgrep-core -lang python -config snowman.yml <(echo '"snowman ☃"') -fast

with the following rule file snowman.yml:

rules:
- id: '-'
  pattern: '"snowman ☃"'
  message: '"snowman ☃"'
  languages:
  - python
  severity: ERROR

The problem is when using a rule file, -fast, and some non-ascii input.

mjambon · 2021-12-09T06:52:19Z

For background, see #2111. The hack that was implemented to work around bad locations replaces each non-ascii byte by a Z, resulting in false positives like the following:

$ semgrep-core -lang python -e '"😀"' <(echo '"🚀"')
/tmp/semgrep-core-124947-63:1
 "🚀"

This match happens because both 😀 and 🚀 and parsed as ZZZZ due to our unicode hack. However it breaks our -fast ("filter irrelevant rules") optimization which compares the parsed pattern (containing ZZZZ or ZZZ) against the raw source file still containing the original utf8 🚀 or ☃. It sees that the string ZZZZ doesn't exist in the source file and decides to skip it.

mjambon · 2021-12-09T07:07:19Z

A proper fix would involve eliminating the hack that replaces non-ascii bytes by Zs, and then should do whatever is necessary to report proper locations. This would be a bit of work and needs to be done right.

Alternatively, we could extend the Unicode hack to work with the -fast optimization. This could make the optimization slower, in addition to making the code more complicated.

emjin · 2021-12-09T19:11:16Z

imo we should hack a solution for -fast. Can we just exclude Z* from string filtering?

mjambon · 2021-12-10T00:30:36Z

@emjin great idea. I was worried about the cost of editing each target but editing the pattern should be fine.

(filter irrelevant rules) optimization. Fixes #4336

* Work around non-ascii byte substitution which was breaking the -fast (filter irrelevant rules) optimization. Fixes #4336 * Split big test "full rule" into one test case per file pair * Explain expectations for unicode matching * Update changelog * typo Co-authored-by: Emma Jin <emjin@users.noreply.github.com> Co-authored-by: Emma Jin <emjin@users.noreply.github.com>

mjambon · 2021-12-11T09:48:33Z

@craigds we just merged a "fix" for this, which is a hack on top of another hack. It will be in the next semgrep release. At some point, we'll need to handle Unicode correctly. Meanwhile, expect some false positives. See #4415.

* Work around non-ascii byte substitution which was breaking the -fast (filter irrelevant rules) optimization. Fixes #4336 * Split big test "full rule" into one test case per file pair * Explain expectations for unicode matching * Update changelog * Add tests for unicode hack * Use correct version of pfff

aryx added the lang:python label Nov 29, 2021

aryx changed the title ~~Can't match snowman literals in strings~~ Can't match unicode snowman literals in strings Nov 29, 2021

aryx changed the title ~~Can't match unicode snowman literals in strings~~ Python: Can't match unicode snowman literals in strings Nov 29, 2021

aryx added the user:external requested by someone outside of r2c label Nov 30, 2021

ievans added the priority:low label Dec 2, 2021

aryx assigned mjambon Dec 7, 2021

mjambon added a commit that referenced this issue Dec 10, 2021

Work around non-ascii byte substitution which was breaking the -fast

717acc3

(filter irrelevant rules) optimization. Fixes #4336

mjambon mentioned this issue Dec 10, 2021

Add hack to allow matches on Unicode data with -fast #4415

Merged

3 tasks

mjambon closed this as completed in #4415 Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Can't match unicode snowman literals in strings #4336

Python: Can't match unicode snowman literals in strings #4336

craigds commented Nov 26, 2021 •

edited

aryx commented Nov 29, 2021

aryx commented Dec 7, 2021

mjambon commented Dec 8, 2021

mjambon commented Dec 9, 2021

mjambon commented Dec 9, 2021

emjin commented Dec 9, 2021

mjambon commented Dec 10, 2021

mjambon commented Dec 11, 2021

Python: Can't match unicode snowman literals in strings #4336

Python: Can't match unicode snowman literals in strings #4336

Comments

craigds commented Nov 26, 2021 • edited

aryx commented Nov 29, 2021

aryx commented Dec 7, 2021

mjambon commented Dec 8, 2021

mjambon commented Dec 9, 2021

mjambon commented Dec 9, 2021

emjin commented Dec 9, 2021

mjambon commented Dec 10, 2021

mjambon commented Dec 11, 2021

craigds commented Nov 26, 2021 •

edited