Was thinking if it is possible to get a quick breakdown of what proportion of function calls in Python code actually belong to what library. Thought about using `ast` with it's ability to pick up functions.

Ref: https://greentreesnakes.readthedocs.io/en/latest/

In [15]:
import ast

In [41]:
# From: http://alexleone.blogspot.com/2010/01/python-ast-pretty-printer.html
# Further modified to be usable in Python 3
# ast.dump drives me nuts
import astpp

In [73]:
tree = ast.parse(
"""
import numpy as np
from pandas import *
from sklearn.metrics import mean_squared_error
import scipy
a = np.array([1, 2, 3])
b = a.sum()
c = b.mean()
Timestamp.now()
isinstance(10, list)
scipy.linalg.svd(a)
""")

In [74]:
print(astpp.dump(tree))

Module(body=[
    Import(names=[
        alias(name='numpy', asname='np'),
      ]),
    ImportFrom(module='pandas', names=[
        alias(name='*', asname=None),
      ], level=0),
    ImportFrom(module='sklearn.metrics', names=[
        alias(name='mean_squared_error', asname=None),
      ], level=0),
    Import(names=[
        alias(name='scipy', asname=None),
      ]),
    Assign(targets=[
        Name(id='a', ctx=Store()),
      ], value=Call(func=Attribute(value=Name(id='np', ctx=Load()), attr='array', ctx=Load()), args=[
        List(elts=[
            Num(n=1),
            Num(n=2),
            Num(n=3),
          ], ctx=Load()),
      ], keywords=[])),
    Assign(targets=[
        Name(id='b', ctx=Store()),
      ], value=Call(func=Attribute(value=Name(id='a', ctx=Load()), attr='sum', ctx=Load()), args=[], keywords=[])),
    Assign(targets=[
        Name(id='c', ctx=Store()),
      ], value=Call(func=Attribute(value=Name(id='b', ctx=Load()), attr='mean', ctx=Load()), args=[], 

In [18]:
tree.body

[<_ast.Import at 0x7f15943bf150>,
 <_ast.Import at 0x7f15943bf1d0>,
 <_ast.Assign at 0x7f15943bf250>,
 <_ast.Assign at 0x7f15943bf490>,
 <_ast.Assign at 0x7f15943bf5d0>,
 <_ast.Expr at 0x7f15943b8390>]

In [19]:
ast.dump(tree.body[2])

"Assign(targets=[Name(id='a', ctx=Store())], value=Call(func=Attribute(value=Name(id='np', ctx=Load()), attr='array', ctx=Load()), args=[List(elts=[Num(n=1), Num(n=2), Num(n=3)], ctx=Load())], keywords=[]))"

Was thinking if can just index all the Calls. But why is the formatting stored in string like this though? Can be a bit sticky in terms of extracting the Calls here.

In [20]:
tree.body[2].value

<_ast.Call at 0x7f15943bf2d0>

Nevermind, can be extracted like this. Reminded of this one time I was grokking an XML tree for something, also was able to access objects down the tree through their properties. So, `ast.dump` literally just prints the object and its child nodes. 

In [27]:
ast.dump(tree.body[5])

"Expr(value=Call(func=Attribute(value=Attribute(value=Name(id='pd', ctx=Load()), attr='Timestamp', ctx=Load()), attr='now', ctx=Load()), args=[], keywords=[]))"

In [34]:
ast.dump(tree.body[5].value)

"Call(func=Attribute(value=Attribute(value=Name(id='pd', ctx=Load()), attr='Timestamp', ctx=Load()), attr='now', ctx=Load()), args=[], keywords=[])"

In [35]:
ast.dump(tree.body[5].value.func)

"Attribute(value=Attribute(value=Name(id='pd', ctx=Load()), attr='Timestamp', ctx=Load()), attr='now', ctx=Load())"

In [29]:
isinstance(tree.body[5].value, ast.Call)

True

I can probably walk the entire tree and mark nodes that are `ast.Call` objects.

From https://greentreesnakes.readthedocs.io/en/latest/manipulating.html#working-on-the-tree, I can subclass `ast.NodeVisitor` and override the methods that correspond to the node types. The following code snippet is from the reference:
``` python
class FuncLister(ast.NodeVisitor):
    def visit_FunctionDef(self, node):
        print(node.name)
        self.generic_visit(node)

FuncLister().visit(tree)
```

Inspected https://greentreesnakes.readthedocs.io/en/latest/nodes.html#Call and wrote the following.

In [38]:
class FuncLister(ast.NodeVisitor):
    def visit_Call(self, node):
        print(ast.dump(node.func))
        # Call self.generic_visit(node) to include child nodes
        self.generic_visit(node)

FuncLister().visit(tree)

Attribute(value=Name(id='np', ctx=Load()), attr='array', ctx=Load())
Attribute(value=Name(id='a', ctx=Load()), attr='sum', ctx=Load())
Attribute(value=Name(id='b', ctx=Load()), attr='mean', ctx=Load())
Attribute(value=Attribute(value=Name(id='pd', ctx=Load()), attr='Timestamp', ctx=Load()), attr='now', ctx=Load())


Hmm `Attribute` objects. From reading the docs, they describe the literal attribute of objects.
+ `value`: node, parent of the attribute. Typically a `Name`.
+ `attr`: string that serves as the name that houses the attribute
+ `ctx`: Load, Store or Del operation

`Name` objects have `id` which holds the actual name as a string, and `ctx`.

So tracking down each function call is possible. But determining if they are part of a library or not can be a bit tricky. `ast` does not actually run code, so it is not possible to determine what is in a variable! 

What I've gathered thus far:
+ It is possible to pick out all function calls.
+ The whole name of the function can be traced. Something like `pd.np.nan()` can be traced back to `pd`.
+ Imports are their own explicit object in AST, thus it is simple to determine the origin of functions that can be traced back to imported modules. 
+ Function definitions are their own explicit objects in AST, thus it is also possible to trace function calls that are defined in the script itself.
+ If a function cannot be traced just from reading the code itself, it cannot be traced in AST. Example, a standard library function pretty much has its own call as the top level node. Functions in the document that come from an asterisk import cannot have their lineage traced. 
+ Object methods need some effort to trace to the host library as well, and even so it can't be guaranteed. In the examples above, we could see functions assigned to `b` and `c` being outcomes of `a` and `b`. The assignment of `a` can be traced to a `np.array` call, but it is not guaranteed that `a` is an object that is defined under `np`! Will just have to take this as an assumption, and have functions assigned to `b` and `c` to count towarsd `np`. The idea being that we are involving objects from `numpy`, thus assigning it. 

Starting to see something take shape here. Workflow is basically:
+ Identify all imports and function definitions
+ Identify all function calls
+ For each function call
    + Assign them to each library or function definition or object (is object method). If not possible, then the function is either a std lib function, or is not mappable, and is mapped to an unknown class.
+ Objects with function calls mapped to them are mapped to their source. All functions mapped to them will be re-classified to the source.
+ Tabulate everything and print results

In [79]:
# The idea is to have a dict that can be queried to find the final library to assign to
class ImportTracker(ast.NodeVisitor):
    
    def __init__(self):
        super().__init__()
        self.libs = {}
    
    def visit_Import(self, node):
        """
        node.names are basically what is imported
        node.module is where they are from
        
        The idea is to have self.libs be a lookup, so that
        the modules where functions can be imported from can
        be traced.
        """
        for i in node.names:
            assert isinstance(i, ast.alias)
            name = i.name
            asname = i.asname
                    
            if asname is None:
                self.libs[name] = name
            else:
                self.libs[asname] = name
                        
        # Call self.generic_visit(node) to include child nodes
        self.generic_visit(node)
        
    def visit_ImportFrom(self, node):
        for i in node.names:
            assert isinstance(i, ast.alias)
            name = i.name
            asname = i.asname
                    
            if asname is None:
                if name != "*":
                    pass
                    # This omits the library from the list! 
                    # self.libs[name] = node.module
                else:
                    self.libs[name] = "unknown"
            else:
                self.libs[asname] = {name:node.module}
                        
        # Call self.generic_visit(node) to include child nodes
        self.generic_visit(node)

obj = ImportTracker()
obj.visit(tree)
print(obj.libs)

{'np': 'numpy', '*': 'unknown', 'scipy': 'scipy'}


In [72]:
# Identify all function definitions
class FunctionDefTracker(ast.NodeVisitor):
    
    def __init__(self):
        super().__init__()
        self.functiondefs = []
    
    def visit_FunctionDef(self, node):
        """
        node.name is the function name. Just have to track that.
        """
        self.functiondefs.append(node.name)

        # Call self.generic_visit(node) to include child nodes
        self.generic_visit(node)

obj = FunctionDefTracker()
obj.visit(tree)
print(obj.functiondefs)

[]


In [83]:
class CallTracker(ast.NodeVisitor):
    
    def __init__(self):
        super().__init__()
        self.calls = []
        
    def visit_Call(self, node):
        # Take the node.func object, which is either ast.Name or ast.Attribute
        if isinstance(node.func, ast.Name):
            # This is if the function call is a single line
            self.calls.append(node.func.id)
        elif isinstance(node.func, ast.Attribute):
            # This is if the function call has multiple submodules
            # Find the top-level name and store it! 
            # TODO
            toplvlname = self.find_top_lvl_name(node.func)
            self.calls.append(toplvlname)
        else:
            pass
        
        # Call self.generic_visit(node) to include child nodes
        self.generic_visit(node)
        
    def find_top_lvl_name(self, func):
        # Wade through the first ast.Attribute of each layer until an ast.Name is found
        current_layer = func
        for _ in range(10): # no such thing as 10 nested attributes! 
            if isinstance(current_layer, ast.Name):
                return current_layer.id
            elif isinstance(current_layer, ast.Attribute):
                current_layer = current_layer.value
            else:
                raise Error
        
obj = CallTracker()
obj.visit(tree)
obj.calls

['np', 'a', 'b', 'Timestamp', 'isinstance', 'scipy']

In [85]:
class AssignTracker(ast.NodeVisitor):
    
    def __init__(self):
        super().__init__()
        self.assigns = {}
        
    def visit_Assign(self, node):
        # In an ast.Assign, we have `targets` as a list of node, and `value` as a single node
        # Most likely that `targets` contains ast.Names, and `value` contains Calls.
        assert isinstance(node.value, ast.Call)
        
        for i in node.targets:
            assert isinstance(i, ast.Name)
            name = i.id    
            self.assigns[name] = self.find_top_lvl_name(node.value.func)
        
        # Call self.generic_visit(node) to include child nodes
        self.generic_visit(node)

    def find_top_lvl_name(self, func):
        # Wade through the first ast.Attribute of each layer until an ast.Name is found
        current_layer = func
        for _ in range(10): # no such thing as 10 nested attributes! 
            if isinstance(current_layer, ast.Name):
                return current_layer.id
            elif isinstance(current_layer, ast.Attribute):
                current_layer = current_layer.value
            else:
                raise Error    
    
        
obj = AssignTracker()
obj.visit(tree)
obj.assigns

{'a': 'np', 'b': 'a', 'c': 'b'}

I think these are most of the pieces! Just need to merge them.

In [91]:
class LibSum(ImportTracker, FunctionDefTracker, CallTracker, AssignTracker, ast.NodeVisitor):
    pass

In [96]:
obj = LibSum()
obj.visit(tree)

In [97]:
obj.assigns

{'a': 'np', 'b': 'a', 'c': 'b'}

In [98]:
obj.calls

['np', 'a', 'b', 'Timestamp', 'isinstance', 'scipy']

In [100]:
obj.functiondefs

[]

In [101]:
obj.libs

{'np': 'numpy', '*': 'unknown', 'scipy': 'scipy'}

Merge these four items into {library: count} is all that is left!