Skip to content
This repository has been archived by the owner on Nov 26, 2023. It is now read-only.

resolve *args, **kwargs #8

Closed
rotemiman opened this issue Jul 26, 2021 · 5 comments
Closed

resolve *args, **kwargs #8

rotemiman opened this issue Jul 26, 2021 · 5 comments

Comments

@rotemiman
Copy link

Does pyCG resolve higher-order functions passed in *args or **kwargs? from the snippets, it seems like it works amazingly for normal dicts and lists, so I'm assuming it is possible, but I can't find a snippet for that.
For clarity, I mean usage like:

def foo(f):
    f()

def bar():
    pass

foo(**{"f": bar})

To get a cg like: main->foo->bar

@vitsalis
Copy link
Owner

vitsalis commented Jul 27, 2021

Currently, this is one of the TODOs. I will just write down some thoughts about an action plan on how this can be implemented.

On the function definition, if it has a kwargs or args argument we need to implement a corresponding dictionary/list that will contain those definitions. Since we can't know which items this dictionary/list will contain at function definition, we just create the dictionary object, and its items will be filled in on function call.

In the case of a function call with the **kwargs parameter, we just iterate the items of the dictionary that is passed as a parameter, and update the object that corresponds to the **kwargs that was created on function definition. The internal functionality of pycg will take care of the rest I believe.

So, a TODO list would be:

  • Create micro-benchmark items that correspond to kwargs and args.
  • Implement args functionality as a list.
  • Implement kwargs functionality as a dictionary.

I will be happy to provide more detail if anyone wants to assign this to themselves.

@rotemiman
Copy link
Author

Sounds awesome!
I'm considering using PyCG in one of our projects, and I can see us addressing this issue (and perhaps others).

Lately, I cloned the repo and ran the benchmarks and saw that a lot of them fail, many of them on completeness rather than soundness problems. Even though I'm not worried about certain complex cases, could you provide a more definitive list of limitations?

In addition, we may want to add a more programmatic, configurable API to PyCG, so that we can call it from other code. Would that be something you're interested in?

@vitsalis
Copy link
Owner

vitsalis commented Jul 29, 2021

Could you provide some examples of the tests that fail due to completeness issues (category-test)? Some test cases are new and experimental (especially those related to external calls) but on my local machine (OSx) most tests fail due to soundness issues (i.e. false negatives).

Regarding an API for PyCG, sure I would be interested in that. What is the main use case that you have in mind?

As far as concrete limitations:

  • External Calls: PyCG is pretty good at identifying calls that are related to internal calls within a package but has trouble identifying external calls due to not having their source code available. This problem can mainly be solved by heuristics that identify the namespaces of external entities (which will not lead to maximal recall and might even harm precision) or by implementing an extension of PyCG that can handle the analysis of many packages at once. The main challenge for the latter is identifying the exact location of an external module that is imported, but I believe we can come up with an efficient design.
  • Built-in Methods: Currently there is no support for the effects of built-in methods. So, for example, if there's a call to list.append, PyCG won't identify the effects of that call and won't store the new identified element. This problem can be solved by modeling built-in functions and their effects (e.g. whenever we see an append call, we internally store the new element). As another example consider the call hash(obj) which will lead to the call of the method __hash__() of that object. There are many such calls on Python's standard library, but I believe we are interested in only a subset of those.

These are the most important issues that come from the top of my head right now. There are some minor other limitations that can be found by executing the test cases. Also, if you find any other limitations or would like to test a particular test case that is not included in the micro-benchmark, it would be really cool if you created a pull request with those new test cases!

@rotemiman
Copy link
Author

I am mostly concerned about false negatives (i.e. missing information). Below is a screenshot of which tests fail for me.

Essentially, what we're trying to do here is trace call paths to specific external methods. As we do not have control over the analyzed code, we cannot assume it is typed. We'd prefer not to work with files more than neccessary, and remain in Python-land.

To accomplish this, we must resolve the problem of the external call. We are only interested in a closed set of libraries, so we could simply download all the libraries, or a typeshed version of them, ahead of analysis. However, from what I've seen, PyCG not only does not leverage type annotations to enhance results, but it also ignores any clauses with type annotations. As an example, if you changed test_assignments/chained on line 7 to something like b: Callable = func1, the test will ignore the assignment and won't know anything about b even though it ran successfully without the type annotation.

To wrap things up, how difficult would it be to:

  1. Provide support for type annotations, including .pyi stubs, and leverage that to improve results
  2. Identify the call paths to a closed set of libraries (and methods of external classes).

My locally failing tests:
Screen Shot 2021-07-29 at 20 35 32


By the way - according to my understanding (and as described https://steemit.com/software/@cpuu/the-difference-between-soundness-and-completeness), having false negatives (corresponds to completeness problems. Am I right?

@vitsalis
Copy link
Owner

vitsalis commented Aug 6, 2021

Related to completeness and soundness, in general, completeness is related to false negatives but as far as I understand in the program analysis world it relates to false positives. I'll try to use the term false negatives to avoid any confusion.

Regarding type annotations, PyCG does not need any information about the types of variables since it infers any potential types during program execution. In this case, I'm not confident that the analysis of .pyi stubs will provide any additional benefit. However, the case that PyCG is ignoring elements with a type annotation is probably due to the AST Visitor. Specifically, we need to define methods for each object that is being visited. For example, for function definitions, we must define the method visit_Func. Probably, the AST Visitor pattern requires a different method for a typed assignment. Are there any other cases with type definitions that PyCG ignores? It would be very useful to have a list of those and find the relevant methods (even though I could not find any documentation of the available visitor methods, but there are some hacky ways around finding them).

Regarding external calls to a closed set of libraries, the most straightforward way of accomplishing this is by providing the source code for all packages to PyCG. Currently, PyCG is designed to work with one specific package, but I do not see any issues that can arise by extending this functionality to work for multiple packages.

I see two main engineering tasks:

  1. Retrieving the correct namespace of a module under analysis based on the current package under analysis.
  2. Making the import mechanism identify the correct namespace of an imported module.

For the first one, we retrieve the namespace of a module using the operation to_mod_name(os.path.relpath(package_path, module_path)). If we had the correct package path for the current module we are analyzing this would lead to the correct namespace. The second one is a bit more tricky. One needs to identify whether the imported module is residing on the current package or an external package. This can be done through heuristics but needs some experimenting on for a concrete action plan.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants