SemForms is powered by static analysis of programs.  This means it can run a large repository of programs efficiently.  Most importantly it does not actually run the program, so it has no need for resolving software dependencies, and it **does not need access to the data**.  

SemForms is powered by [WALA](https://github.com/wala/WALA) which performs inter-procedural analysis of programs, building control flow graphs (how does execution flow through constructs within a program), call graphs (how do calls flow through the program), and most importantly for SemForms **data flow** which specifies how objects get created and used within the program.

While many tools exist for static analysis, Python is an especially challenging for analysis because it is a dynamically typed language.  In addition, for SemForms we needed to be able to analyze each script independently for maximal flexibility and handle all sorts of imports of user libraries.  We built a layer of analysis technology on top of [WALA](https://github.com/wala/WALA) to handle calls to arbitrary libraries and creates data flow and control flow graphs from it [Graph4Code](https://arxiv.org/pdf/2002.09440.pdf).  We note that many popular tools for static analysis on Python do not have the mechanisms to support the data flow graph we need for expression extraction. 

Let's look at a simple example.

In [1]:
!cat example/test.py

import pandas as pd

def read_df():
  return pd.read_csv('houses.csv')

def manipulate_df(houses_df):
    houses_df['beds_to_total'] = houses_df['total_bedrooms'] / houses_df['total_rooms']
    houses_df['popdf'] = houses_df['population' ] / houses_df['households']

def main():
    h_df = read_df()
    manipulate_df(h_df)

main()




Notice how the code creates a dataframe in one method and passes it to a different method, where its individual fields are manipulated.  To capture this, we extended [Graph4Code](https://arxiv.org/pdf/2002.09440.pdf) to model field reads and writes to library data structures.

Notice how calling a popular recent static analysis tool for Python called `python_graphs` analyzes the code.  They provide control flow graphs for each function but there is no flow that connects the creation of the data frame to it use, nor anything that models reads or writes to it.  For instance, they just know that `houses_df` is passed in as an arg to `manipulate`.  They do not understand that `read_csv` is a library call to pandas either.

<image src="./python_graphs_output/example-controlflow-main-graph.png">

![Control flow for main](python_graphs_output/example-controlflow-main-graph.png)

![Control flow for manipulate](python_graphs_output/example-controlflow-manip-graph.png)

![Control flow for read](python_graphs_output/example-controlflow-read-graph.png)

Now let us look at the graph that SemForms uses for extraction.

In [1]:
import requests
import json
from visualize_graph import show_analysis

with open('example/test.json') as f:
    data = json.load(f)
response = requests.post('http://localhost:4567/analyze_code', json=data)
graph = response.json()
g = show_analysis(graph)



Local cdn resources have problems on chrome/safari when used in jupyter-notebook. 


In [2]:
g.show('example.html')