Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import inconsistencies between languages #975

Closed
mschwager opened this issue Jun 11, 2020 · 21 comments
Closed

Import inconsistencies between languages #975

mschwager opened this issue Jun 11, 2020 · 21 comments

Comments

@mschwager
Copy link
Contributor

$ semgrep --version
0.10.1

Consider the following Python code:

$ cat test.py 
import os.path.join

When using a pattern looking for imports we get the following results:

$ semgrep --json --lang python --pattern 'import $X' test.py | python -mjson.tool
{
    "errors": [],
    "results": [
        {
            "check_id": "-",
            "end": {
                "col": 20,
                "line": 1
            },
            "extra": {
                "lines": "import os.path.join",
                "message": "import os",
                "metadata": {},
                "metavars": {
                    "$X": {
                        "abstract_content": "os",
                        "end": {
                            "col": 10,
                            "line": 1,
                            "offset": 9
                        },
                        "start": {
                            "col": 8,
                            "line": 1,
                            "offset": 7
                        },
                        "unique_id": {
                            "md5sum": "aa9320c62d85df5585dec78521910201",
                            "type": "AST"
                        }
                    }
                },
                "severity": "ERROR"
            },
            "path": "test.py",
            "start": {
                "col": 1,
                "line": 1
            }
        }
    ]
}

Here the $X metavariable content is os. This doesn't feel correct looking at the code, I'd expect it to be os.path.join. An argument could be made for this behavior, but it doesn't even seem consistent across languages. For example, consider the following Java code:

$ cat test.java 
import deep.imp.statement;

Running a similar import check gives the following results:

$ semgrep --json --lang java --pattern 'import $X;' test.java | python -mjson.tool
{
    "errors": [],
    "results": [
        {
            "check_id": "-",
            "end": {
                "col": 26,
                "line": 1
            },
            "extra": {
                "lines": "import deep.imp.statement;",
                "message": "import statement;",
                "metadata": {},
                "metavars": {
                    "$X": {
                        "abstract_content": "statement",
                        "end": {
                            "col": 26,
                            "line": 1,
                            "offset": 25
                        },
                        "start": {
                            "col": 17,
                            "line": 1,
                            "offset": 16
                        },
                        "unique_id": {
                            "md5sum": "876b2cf6cf7324525b2283402c5df855",
                            "type": "AST"
                        }
                    }
                },
                "severity": "ERROR"
            },
            "path": "test.java",
            "start": {
                "col": 1,
                "line": 1
            }
        }
    ]
}

Here the $X metavariable is statement. The last module instead of the first that Python uses. Personally I think the full module path should be used in both cases, but whatever we choose it should at least be consistent.

This feels similar to #806.

@aryx
Copy link
Collaborator

aryx commented Jun 12, 2020

Yep, in both case they should really match the entire path.

@mschwager
Copy link
Contributor Author

I think it would be useful to apply consistent behavior to all supported languages. It seems we have a few other open issues for this, I'll include them here so we can track in one place:

@dlukeomalley
Copy link
Member

@DrewDennison @aryx Thoughts on where/when to prioritize this work?

@clintgibler
Copy link
Contributor

For what it's worth, I've had this bite me a few times, in Java and Python.

It would be nice if import $X matched the full path regardless of syntax, though in some cases that could be confusing for the new user (e.g. from foo import bar).

If it's not too hard to support matching the full path of imports, I would be happy 🙏

@aryx
Copy link
Collaborator

aryx commented Aug 26, 2020

Do you have a semgrep.dev example of what you would like? Just so I can use that later as a set of tests.

@clintgibler
Copy link
Contributor

Hm, to be honest I'm not totally sure what the behavior semgrep should have is.

There are 2 main contenders in my mind:

  1. import $X matches entire import (foo.bar.baz in import foo.bar.baz and from foo.bar import baz)
  2. Match syntax more directly: import $X matches import foo.bar.baz but not from foo.bar import baz because there’s no from in the pattern

I think either are OK, as long as we’re consistent across languages.

  1. might have slightly surprising behavior the first time you see it, but it’s nice in that it tries to make things easier for you behind the scenes, like when Semgrep handles import aliasing (e.g. import exec as safe_function; safe_function(foo))

  2. is nice in that we’re aligning directly with syntax and might initially seem more intuitive to users, but then we’re putting the onus on the user of having to remember the various semantically equivalent ways things can be expressed.

So maybe this comes does to some meta questions about Semgrep’s philosophy:

Should Semgrep patterns align as closely as possible with the original syntax? How much should Semgrep try to “automagically” handle for users?

Import aliasing and matching function annotations or keyword dict args regardless of order are great examples of when Semgrep abstracts things for users in a way that’s intuitive (does the expected thing) and makes users’ lives easier.

I could see this case going either way

@clintgibler
Copy link
Contributor

What should import $X match for the following 2 lines?

import one.two
from foo.bar import baz

@ievans and @colleend vote that import $X should match the full import path for both cases, $X matching one.two and foo.bar.baz respectively.

@mschwager
Copy link
Contributor Author

mschwager commented Aug 26, 2020

I think it's worth going over all the different import styles in Python. Python is fairly representative here, so we can extrapolate to other languages as well. To my knowledge, here are the different ways you can import something in Python:

import foo.bar.baz
import foo.bar as baz

from foo.bar import baz
from foo.bar import baz as qux
from foo import bar, baz
from foo import bar as baz, qux as quine

# We can avoid wildcard imports for now
from foo import *

# Let's also avoid all relative imports for now
from . import foo

IMO, we should always return the FQIN (fully-qualified import name) as the metavariable abstract_content. We can define the FQIN as the full import path discarding any aliasing (using as). Using import $X with that logic against the above list would produce:

foo.bar.baz
foo.bar

foo.bar.baz
foo.bar.baz
foo.bar and foo.baz?
foo.bar and foo.qux?

The last example is another shortcoming of the automagic import detection initially highlighted in #806. We can put that aside for now. Here are the reasons I think always returning the FQIN is superior:

  • Users specify the FQIN. If you're looking for os.system you don't care that system was aliased to something else.
  • When performing post-processing on abstract_content you'll want the FQIN. If I'm looking for os.system and get back something else that makes my post-processing logic very difficult.
  • You can always slice the FQIN, you can't easily add more context. E.g. if we return foo from from foo import bar we can't easily add the bar context, but we can grab foo from foo.bar.

IMO it's fine to use the aliased import name when showing results on the CLI. Users will naturally be looking at code side-by-side with CLI Semgrep results. Using the aliased name makes matching those easier. However, users will typically be feeding JSON results into some post-processing functionality. It's unlikely that post-processing functionality is referencing back to the code, so returning the FQIN makes sense in JSON output.

@clintgibler
Copy link
Contributor

@mschwager Thanks so much for the details and examples, this comment is 💯

I agree with your reasoning, I think having the metavariable return the fully-qualified import name makes the most sense.

In the last two examples you provided:

from foo import bar, baz
from foo import bar as baz, qux as quine

To me, I think the expected behavior should be what you've proposed, 2 matches for each line:

foo.bar and foo.baz
foo.bar and foo.qux

@aryx aryx self-assigned this Aug 28, 2020
@aryx
Copy link
Collaborator

aryx commented Aug 28, 2020

Ok I'll have a try at it next week.

@dlukeomalley dlukeomalley added the alpha Relates to an experimental feature label Oct 29, 2020
aryx added a commit that referenced this issue Nov 19, 2020
Fixes #1771

This should also help
#975

Test plan:

$ semgrep -f /tmp/test2.yml tests/python/import_metavar_fullpath.py
running 1 rules...
tests/python/import_metavar_fullpath.py
severity:error rule:tmp.import-auth: spooky import

12:import a.auth
14:import b.auth
16:import a.b.auth
ran 1 rules on 1 files: 3 findings
aryx added a commit that referenced this issue Nov 19, 2020
#2094)

Fixes #1771

This should also help
#975

Test plan:

$ semgrep -f /tmp/test2.yml tests/python/import_metavar_fullpath.py
running 1 rules...
tests/python/import_metavar_fullpath.py
severity:error rule:tmp.import-auth: spooky import

12:import a.auth
14:import b.auth
16:import a.b.auth
ran 1 rules on 1 files: 3 findings
@aryx
Copy link
Collaborator

aryx commented Nov 19, 2020

Here are results on develop with the latest fix:
https://semgrep.dev/s/D8Do/?version=develop

Note that $X contains the right thing internally (if you look at the value binded to $X internally), but in the message we output sometimes the wrong thing because of the new way we print the content of metavariable. We used to rely on abstract_content, but there was some problems with missing tokens, but now @brendongo is using the range of the mached code, and here if you do from foo.bar import baz, then X will be binded to foo.bar.baz and the range will be from foo to baz, which includes unfortunately the 'import' token ...

@aryx
Copy link
Collaborator

aryx commented Nov 19, 2020

Still, there's progress, we now bind the fully qualified name in Python.

@aryx
Copy link
Collaborator

aryx commented Feb 10, 2021

And now we bind the fully qualified name in Java

@mschwager
Copy link
Contributor Author

w00t, slowly but surely!

@stale
Copy link

stale bot commented May 12, 2021

This issue is being marked stale because there hasn't been any activity in 30 days. Please leave a comment if you think this issue is still relevant and should be prioritized, otherwise it will be automatically closed in 7 days (you can always reopen it later).

@stale stale bot added the stale needs prioritization; label applied by stalebot label May 12, 2021
@aryx
Copy link
Collaborator

aryx commented May 12, 2021

Wave

@stale stale bot removed the stale needs prioritization; label applied by stalebot label May 12, 2021
@stale
Copy link

stale bot commented Aug 10, 2021

This issue is being marked stale because there hasn't been any activity in 30 days. Please leave a comment if you think this issue is still relevant and should be prioritized, otherwise it will be automatically closed in 7 days (you can always reopen it later).

@stale stale bot added the stale needs prioritization; label applied by stalebot label Aug 10, 2021
@colleend colleend removed the stale needs prioritization; label applied by stalebot label Aug 10, 2021
@stale
Copy link

stale bot commented Nov 9, 2021

This issue is being marked stale because there hasn't been any activity in 30 days. Please leave a comment if you think this issue is still relevant and should be prioritized, otherwise it will be automatically closed in 7 days (you can always reopen it later).

@stale stale bot added the stale needs prioritization; label applied by stalebot label Nov 9, 2021
@colleend colleend removed the stale needs prioritization; label applied by stalebot label Nov 9, 2021
@stale
Copy link

stale bot commented Jan 20, 2022

This issue is being marked stale because there hasn't been any activity in 14 days and either it wasn't prioritized or its priority is high. Please apply the appropriate priority:* label before removing the stale label.

@stale stale bot added the stale needs prioritization; label applied by stalebot label Jan 20, 2022
@colleend
Copy link
Contributor

wave

@stale stale bot removed the stale needs prioritization; label applied by stalebot label Jan 20, 2022
@ievans ievans added the planned-project Do not make stale label Jan 28, 2022
@aryx aryx added priority:low and removed planned-project Do not make stale labels Sep 17, 2022
@ievans
Copy link
Member

ievans commented Oct 10, 2022

Looks like this is fixed actually -- at least all the issues in the original comment! @aryx I'm going to close but feel free to re-open if there's another thing we're waiting for

@ievans ievans closed this as completed Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants