Import inconsistencies between languages #975

mschwager · 2020-06-11T19:12:27Z

$ semgrep --version
0.10.1

Consider the following Python code:

$ cat test.py 
import os.path.join

When using a pattern looking for imports we get the following results:

$ semgrep --json --lang python --pattern 'import $X' test.py | python -mjson.tool
{
    "errors": [],
    "results": [
        {
            "check_id": "-",
            "end": {
                "col": 20,
                "line": 1
            },
            "extra": {
                "lines": "import os.path.join",
                "message": "import os",
                "metadata": {},
                "metavars": {
                    "$X": {
                        "abstract_content": "os",
                        "end": {
                            "col": 10,
                            "line": 1,
                            "offset": 9
                        },
                        "start": {
                            "col": 8,
                            "line": 1,
                            "offset": 7
                        },
                        "unique_id": {
                            "md5sum": "aa9320c62d85df5585dec78521910201",
                            "type": "AST"
                        }
                    }
                },
                "severity": "ERROR"
            },
            "path": "test.py",
            "start": {
                "col": 1,
                "line": 1
            }
        }
    ]
}

Here the $X metavariable content is os. This doesn't feel correct looking at the code, I'd expect it to be os.path.join. An argument could be made for this behavior, but it doesn't even seem consistent across languages. For example, consider the following Java code:

$ cat test.java 
import deep.imp.statement;

Running a similar import check gives the following results:

$ semgrep --json --lang java --pattern 'import $X;' test.java | python -mjson.tool
{
    "errors": [],
    "results": [
        {
            "check_id": "-",
            "end": {
                "col": 26,
                "line": 1
            },
            "extra": {
                "lines": "import deep.imp.statement;",
                "message": "import statement;",
                "metadata": {},
                "metavars": {
                    "$X": {
                        "abstract_content": "statement",
                        "end": {
                            "col": 26,
                            "line": 1,
                            "offset": 25
                        },
                        "start": {
                            "col": 17,
                            "line": 1,
                            "offset": 16
                        },
                        "unique_id": {
                            "md5sum": "876b2cf6cf7324525b2283402c5df855",
                            "type": "AST"
                        }
                    }
                },
                "severity": "ERROR"
            },
            "path": "test.java",
            "start": {
                "col": 1,
                "line": 1
            }
        }
    ]
}

Here the $X metavariable is statement. The last module instead of the first that Python uses. Personally I think the full module path should be used in both cases, but whatever we choose it should at least be consistent.

This feels similar to #806.

The text was updated successfully, but these errors were encountered:

aryx · 2020-06-12T14:38:28Z

Yep, in both case they should really match the entire path.

mschwager · 2020-06-22T16:22:49Z

I think it would be useful to apply consistent behavior to all supported languages. It seems we have a few other open issues for this, I'll include them here so we can track in one place:

dlukeomalley · 2020-06-22T17:11:36Z

@DrewDennison @aryx Thoughts on where/when to prioritize this work?

clintgibler · 2020-08-26T01:55:32Z

For what it's worth, I've had this bite me a few times, in Java and Python.

It would be nice if import $X matched the full path regardless of syntax, though in some cases that could be confusing for the new user (e.g. from foo import bar).

If it's not too hard to support matching the full path of imports, I would be happy 🙏

aryx · 2020-08-26T12:52:49Z

Do you have a semgrep.dev example of what you would like? Just so I can use that later as a set of tests.

clintgibler · 2020-08-26T14:42:24Z

Hm, to be honest I'm not totally sure what the behavior semgrep should have is.

There are 2 main contenders in my mind:

import $X matches entire import (foo.bar.baz in import foo.bar.baz and from foo.bar import baz)
Match syntax more directly: import $X matches import foo.bar.baz but not from foo.bar import baz because there’s no from in the pattern

I think either are OK, as long as we’re consistent across languages.

might have slightly surprising behavior the first time you see it, but it’s nice in that it tries to make things easier for you behind the scenes, like when Semgrep handles import aliasing (e.g. import exec as safe_function; safe_function(foo))
is nice in that we’re aligning directly with syntax and might initially seem more intuitive to users, but then we’re putting the onus on the user of having to remember the various semantically equivalent ways things can be expressed.

So maybe this comes does to some meta questions about Semgrep’s philosophy:

Should Semgrep patterns align as closely as possible with the original syntax? How much should Semgrep try to “automagically” handle for users?

Import aliasing and matching function annotations or keyword dict args regardless of order are great examples of when Semgrep abstracts things for users in a way that’s intuitive (does the expected thing) and makes users’ lives easier.

I could see this case going either way

clintgibler · 2020-08-26T16:40:06Z

What should import $X match for the following 2 lines?

import one.two
from foo.bar import baz

@ievans and @colleend vote that import $X should match the full import path for both cases, $X matching one.two and foo.bar.baz respectively.

mschwager · 2020-08-26T17:19:27Z

I think it's worth going over all the different import styles in Python. Python is fairly representative here, so we can extrapolate to other languages as well. To my knowledge, here are the different ways you can import something in Python:

import foo.bar.baz
import foo.bar as baz

from foo.bar import baz
from foo.bar import baz as qux
from foo import bar, baz
from foo import bar as baz, qux as quine

# We can avoid wildcard imports for now
from foo import *

# Let's also avoid all relative imports for now
from . import foo

IMO, we should always return the FQIN (fully-qualified import name) as the metavariable abstract_content. We can define the FQIN as the full import path discarding any aliasing (using as). Using import $X with that logic against the above list would produce:

foo.bar.baz
foo.bar

foo.bar.baz
foo.bar.baz
foo.bar and foo.baz?
foo.bar and foo.qux?

The last example is another shortcoming of the automagic import detection initially highlighted in #806. We can put that aside for now. Here are the reasons I think always returning the FQIN is superior:

Users specify the FQIN. If you're looking for os.system you don't care that system was aliased to something else.
When performing post-processing on abstract_content you'll want the FQIN. If I'm looking for os.system and get back something else that makes my post-processing logic very difficult.
You can always slice the FQIN, you can't easily add more context. E.g. if we return foo from from foo import bar we can't easily add the bar context, but we can grab foo from foo.bar.

IMO it's fine to use the aliased import name when showing results on the CLI. Users will naturally be looking at code side-by-side with CLI Semgrep results. Using the aliased name makes matching those easier. However, users will typically be feeding JSON results into some post-processing functionality. It's unlikely that post-processing functionality is referencing back to the code, so returning the FQIN makes sense in JSON output.

clintgibler · 2020-08-26T17:32:28Z

@mschwager Thanks so much for the details and examples, this comment is 💯

I agree with your reasoning, I think having the metavariable return the fully-qualified import name makes the most sense.

In the last two examples you provided:

from foo import bar, baz
from foo import bar as baz, qux as quine

To me, I think the expected behavior should be what you've proposed, 2 matches for each line:

foo.bar and foo.baz
foo.bar and foo.qux

aryx · 2020-08-28T14:10:12Z

Ok I'll have a try at it next week.

Fixes #1771 This should also help #975 Test plan: $ semgrep -f /tmp/test2.yml tests/python/import_metavar_fullpath.py running 1 rules... tests/python/import_metavar_fullpath.py severity:error rule:tmp.import-auth: spooky import 12:import a.auth 14:import b.auth 16:import a.b.auth ran 1 rules on 1 files: 3 findings

#2094) Fixes #1771 This should also help #975 Test plan: $ semgrep -f /tmp/test2.yml tests/python/import_metavar_fullpath.py running 1 rules... tests/python/import_metavar_fullpath.py severity:error rule:tmp.import-auth: spooky import 12:import a.auth 14:import b.auth 16:import a.b.auth ran 1 rules on 1 files: 3 findings

aryx · 2020-11-19T15:59:05Z

Here are results on develop with the latest fix:
https://semgrep.dev/s/D8Do/?version=develop

Note that $X contains the right thing internally (if you look at the value binded to $X internally), but in the message we output sometimes the wrong thing because of the new way we print the content of metavariable. We used to rely on abstract_content, but there was some problems with missing tokens, but now @brendongo is using the range of the mached code, and here if you do from foo.bar import baz, then X will be binded to foo.bar.baz and the range will be from foo to baz, which includes unfortunately the 'import' token ...

aryx · 2020-11-19T15:59:32Z

Still, there's progress, we now bind the fully qualified name in Python.

aryx · 2021-02-10T21:21:58Z

And now we bind the fully qualified name in Java

mschwager · 2021-02-10T21:33:22Z

w00t, slowly but surely!

stale · 2021-05-12T03:48:21Z

This issue is being marked stale because there hasn't been any activity in 30 days. Please leave a comment if you think this issue is still relevant and should be prioritized, otherwise it will be automatically closed in 7 days (you can always reopen it later).

aryx · 2021-05-12T05:25:20Z

Wave

stale · 2021-08-10T05:58:31Z

This issue is being marked stale because there hasn't been any activity in 30 days. Please leave a comment if you think this issue is still relevant and should be prioritized, otherwise it will be automatically closed in 7 days (you can always reopen it later).

stale · 2021-11-09T03:56:24Z

This issue is being marked stale because there hasn't been any activity in 30 days. Please leave a comment if you think this issue is still relevant and should be prioritized, otherwise it will be automatically closed in 7 days (you can always reopen it later).

stale · 2022-01-20T17:33:48Z

This issue is being marked stale because there hasn't been any activity in 14 days and either it wasn't prioritized or its priority is high. Please apply the appropriate priority:* label before removing the stale label.

colleend · 2022-01-20T19:42:33Z

wave

ievans · 2022-10-10T18:06:52Z

Looks like this is fixed actually -- at least all the issues in the original comment! @aryx I'm going to close but feel free to re-open if there's another thing we're waiting for

mschwager added lang:python lang:java labels Jun 11, 2020

mschwager added lang:c lang:golang lang:javascript labels Jun 22, 2020

aryx self-assigned this Aug 28, 2020

mschwager mentioned this issue Oct 2, 2020

import not matching what's expected #1771

Closed

dlukeomalley added the alpha Relates to an experimental feature label Oct 29, 2020

ievans mentioned this issue Nov 3, 2020

Document how to detect "import x.[some-regex]" #1954

Open

aryx mentioned this issue Nov 19, 2020

Correctly bind a metavariable in an import to the fully qualified name #2094

Merged

mschwager mentioned this issue Dec 15, 2020

TypeScript: add support CommonJS-style import and export #2234

Closed

clintgibler mentioned this issue Feb 5, 2021

import $VAR metavariable does not match entire import path #2502

Closed

aryx mentioned this issue Feb 10, 2021

[Java] import $LIB should binds $LIB to a full import path #2541

Merged

colleend mentioned this issue Apr 26, 2021

Python import alias #2992

Closed

stale bot added the stale needs prioritization; label applied by stalebot label May 12, 2021

stale bot removed the stale needs prioritization; label applied by stalebot label May 12, 2021

stale bot added the stale needs prioritization; label applied by stalebot label Aug 10, 2021

colleend removed the stale needs prioritization; label applied by stalebot label Aug 10, 2021

stale bot added the stale needs prioritization; label applied by stalebot label Nov 9, 2021

colleend removed the stale needs prioritization; label applied by stalebot label Nov 9, 2021

stale bot added the stale needs prioritization; label applied by stalebot label Jan 20, 2022

stale bot removed the stale needs prioritization; label applied by stalebot label Jan 20, 2022

ievans added the planned-project Do not make stale label Jan 28, 2022

aryx added the pattern:inconsistency label Jul 28, 2022

aryx added priority:low and removed planned-project Do not make stale labels Sep 17, 2022

ievans closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import inconsistencies between languages #975

Import inconsistencies between languages #975

mschwager commented Jun 11, 2020

aryx commented Jun 12, 2020

mschwager commented Jun 22, 2020

dlukeomalley commented Jun 22, 2020

clintgibler commented Aug 26, 2020

aryx commented Aug 26, 2020

clintgibler commented Aug 26, 2020

clintgibler commented Aug 26, 2020

mschwager commented Aug 26, 2020 •

edited

Loading

clintgibler commented Aug 26, 2020

aryx commented Aug 28, 2020

aryx commented Nov 19, 2020

aryx commented Nov 19, 2020

aryx commented Feb 10, 2021

mschwager commented Feb 10, 2021

stale bot commented May 12, 2021

aryx commented May 12, 2021

stale bot commented Aug 10, 2021

stale bot commented Nov 9, 2021

stale bot commented Jan 20, 2022

colleend commented Jan 20, 2022

ievans commented Oct 10, 2022

Import inconsistencies between languages #975

Import inconsistencies between languages #975

Comments

mschwager commented Jun 11, 2020

aryx commented Jun 12, 2020

mschwager commented Jun 22, 2020

dlukeomalley commented Jun 22, 2020

clintgibler commented Aug 26, 2020

aryx commented Aug 26, 2020

clintgibler commented Aug 26, 2020

clintgibler commented Aug 26, 2020

mschwager commented Aug 26, 2020 • edited Loading

clintgibler commented Aug 26, 2020

aryx commented Aug 28, 2020

aryx commented Nov 19, 2020

aryx commented Nov 19, 2020

aryx commented Feb 10, 2021

mschwager commented Feb 10, 2021

stale bot commented May 12, 2021

aryx commented May 12, 2021

stale bot commented Aug 10, 2021

stale bot commented Nov 9, 2021

stale bot commented Jan 20, 2022

colleend commented Jan 20, 2022

ievans commented Oct 10, 2022

mschwager commented Aug 26, 2020 •

edited

Loading