Skip to content
This repository has been archived by the owner on Nov 26, 2023. It is now read-only.

Add Support for Calls to External Class Instances #61

Merged
merged 2 commits into from
May 30, 2023

Conversation

gdrosos
Copy link
Collaborator

@gdrosos gdrosos commented May 30, 2023

Description of the Problem

Up until now, for python code in the form of:

# test.py
import externalPackage
node = externalPackage.Class().instance_method()

PyCG would capture only the call to the external Class.

{"test": ["externalPackage.Class"], "externalPackage.Class": []}

This incomplete call graph does not capture the external call to the class instance, resulting in limited analysis capabilities for code using external packages and their class instances.
An enriched call graph with all the external functions or classes called would also significantly improve the later stitching process with the externalPackage

Root Cause

During the call graph processing phase, when PyCG visits a call within the AST, it aims to find the actual namespace of the calls performed. The namespace provides the complete path or location of the function or class within the program's module structure.

names = self.retrieve_call_names(node)

Note: The calls might be more than one in cases like the following:
package.Class().method().method()

Then, for each identified call, if it is an external call PyCG will create an edge.

for pointer in names:
pointer_def = self.def_manager.get(pointer)
if not pointer_def or not isinstance(pointer_def, Definition):
continue
if pointer_def.is_callable():
if pointer_def.get_type() == utils.constants.EXT_DEF:
ext_modname = pointer.split(".")[0]
create_ext_edge(pointer, ext_modname)
continue

During theretrieve_call_names method the _retrieve_attribute_names is called to find the call name in the case of an attribute (a call in the form of something.something)

elif isinstance(node.func, ast.Attribute):
names = self._retrieve_attribute_names(node.func)

Within the _retrieve_attribute_names, we call the _retrieve_parent_names in order to identify the parent namespace of a specific attribute node.
To adduce an example,
in the following attribute:

 Class().instance_method()

the parent namespace of the attribute is Class
while on this example:

x = Class()
x.instance_method()

the parent namespace of the x.instance_method() attribute is again Class

So in order to find the parent namespace (and consequently its definition) we just decode the parent node

def _retrieve_parent_names(self, node):
if not isinstance(node, ast.Attribute):
raise Exception("The node is not an attribute")
decoded = self.decode_node(node.value)

The decode_node method is a very fundamental method which tries to find the definition stored in the symbol table of a corresponding AST node.
In our case e.g.
the externalPackage.Class().instance_method() call the parent node of the .instance_method()`` which will be decoded is externalPackage.Class()which is instance ofast.Call`

When decoding call nodes, PyCG tries to decode the node performing the call and based on the type of definition, it returns the correct namespace e.g. for functions it returns the return type. for classes it returns the namespace of the class (since it is a class instance). But currently PyCG would ignore calls performed by nodes with external definitions, since it could not resolve for example the return type of such calls.

elif isinstance(node, ast.Call):
decoded = self.decode_node(node.func)
return_defs = []
for called_def in decoded:
if not isinstance(called_def, Definition):
continue
return_ns = utils.constants.INVALID_NAME
if called_def.get_type() == utils.constants.FUN_DEF:
return_ns = utils.join_ns(
called_def.get_ns(), utils.constants.RETURN_NAME
)
elif called_def.get_type() == utils.constants.CLS_DEF:
return_ns = called_def.get_ns()
defi = self.def_manager.get(return_ns)
if defi:
return_defs.append(defi)
return return_defs

Proposed Change

To tackle the afforementioned issue, we handle external definitions performing calls as internal classes performing calls, e.g. we return the namespace of the external node performing the call. On this way all calls to external functions or classes are stored within the call graph.

With this change, we store calls to external class instances in the call graph. There are two types of such calls.
The first one, is a call in the form of node = externalPackage.Class().instance_method() which will be stored as

{"test": ["externalPackage.Class", "externalPackage.Class.instance_method"], "externalPackage.Class": [], "externalPackage.Class.instance_method": []}

This wil be a sound representation which will lead to a sound stitching at a later stage.
The second one, is a call in the form of node = externalPackage.method().instance_method() were method is a function returning a class and the instance_method is an instance of the class returned by the method().
This call will be represented in the call graph as following:

{"test": ["externalPackage.method", "externalPackage.method.instance_method"], "externalPackage.method": [], "externalPackage.method.instance_method": []}

This representation will not be sound, since the externalPackage.method.instance_method is not the actual namespace of the instance_method, but PyCG cannot provide any additional insight on the returned class, and therefore the best it can do is to outsource the resolution for the stitching process.

Note:

With the proposed changes, all types of external calls can be handled:
Test Case 1

import externalPackage
 externalPackage.Class().instance_method()
{"test": ["externalPackage.Class.instance_method", "externalPackage.Class"], "externalPackage.Class": [], "externalPackage.Class.instance_method": []}

Test Case 2

import externalPackage
 x = externalPackage.Class()
x.instance_method()
{"test": ["externalPackage.Class.method.Class2.method2", "externalPackage.Class.method.Class2", "externalPackage.Class", "externalPackage.Class.method"], "externalPackage.Class": [], "externalPackage.Class.method": [], "externalPackage.Class.method.Class2": [], "externalPackage.Class.method.Class2.method2": []}

Test Case 3

import externalPackage
 x = externalPackage.Class().method().Class2().method2()
{"test": ["externalPackage.Class.method.Class2.method2", "externalPackage.Class.method.Class2", "externalPackage.Class", "externalPackage.Class.method"], "externalPackage.Class": [], "externalPackage.Class.method": [], "externalPackage.Class.method.Class2": [], "externalPackage.Class.method.Class2.method2": []}

@gdrosos gdrosos added the enhancement New feature or request label May 30, 2023
@gdrosos gdrosos requested a review from vitsalis May 30, 2023 15:13
@gdrosos gdrosos force-pushed the add-support-for-external-instance-calls branch from 7cec760 to 298c5a0 Compare May 30, 2023 15:25
Copy link
Owner

@vitsalis vitsalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! While this indeed reduces soundness I think the extra output would be useful for a processor that knows PyCG's limitations and can lead to better stitching.

@vitsalis vitsalis merged commit bc4d41e into main May 30, 2023
@gdrosos gdrosos deleted the add-support-for-external-instance-calls branch May 31, 2023 10:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants