# Safe parsing for `eval`

We often want to allow users to execute arbitrary command-like statements without having 
to write a parser form scratch and at the same time worry that the user might attempt
something malicious.

Let's start with an example where the user who knows nothing of python needs to perform 
some math using a custom function, where the data source is a table and the function 
needs to be performed on each row.

| # | A | B | C | D |
|---|---|---|---|---|
|1|123| 2 | 32|321|
|2|133| 2.1 | 33|123|
|3|143| 3.1e4 | 34|111|
|4|163| 3 | 35|222|
|5|143| 4 | 36|333|
|6|123| 4 | 37|444|

The user defines the function for column **`E`** as:  

    `func = "(int(round(('A'*'B)+('C'*'A') + 4, 0)) + max('D, 'B))/'A'"`

When looping over the table, we want to add column `E` to the table as the result of the function.  
To do so, we will need to:  

1. check the function for malicious content.
2. if okay, evaluate the function and return the result.

We start with the table:

In [1]:
table = {
    0: ['A',   'B', 'C', 'D'],
    1: [123,     2,  32, 321],
    2: [133,   2.1,  33, 123],
    3: [143, 3.1e4,  34, 111],
    4: [163,     3,  35, 222],
    5: [143,     4,  36, 333],
    6: [123,     4,  37, 444],
}

and the function provided by the user:


In [2]:
func = "(int(round(('A'*'B)+('C'*'A') + 4, 0)) + max('D, 'B))/'A'"
new_column_name = "E"

With the following filters, we can check the function by seeing
if anything is left after pruning the function.


## Part 1: The slow way.


First we need the filters, with the longest word first, so we don't accidentally
remove a shorter word that is a sub-string in a longer word, like for example
`round` which is permitted in `rounddown`:

In [3]:
operators = ["max", "min", "int", "round", "+", "-", "/", "*", "(", ")", " ", ",", "'", '"']
operators.sort(key=lambda x: len(x), reverse=True)  # guarantees longest word first.
numbers = list('1234567890e.')
permitted = operators + numbers

The pruning function is straight forward:


In [4]:
def strip(text, replacements):
    for word in replacements:
        text = text.replace(word, "")
        if not text:
            break
    return text

Now we can check the function and raise a value error if it is malformed using something like:

```
table_headers = table[0]

remainder = strip(func)
if remainder:
    raise ValueError(f"Bad sign near '{remainder}' in '{func}'")
```


At this point we now know the function is not malicious, but we don't know if any
combination of column names and data can produce malicious content.

To process the rows, we will therefore have to substitute each heading with data
and check it in the same manner. To do so we need a short helper:

In [5]:
def replace(text, dictionary):
    for k, v in dictionary.items():
        text = text.replace(k, str(v))
    return text

The user defined function can now be processed by:

1. checking the UDF for invalid content.
2. processing each row, by:
3. replacing the headers to values from the row.
4. getting rid of the remaining text marks.
5. check that no malicious content is left in the string
6. evaluate the string as if it was math and update the table with the calculated value.

Like this:


In [6]:
def evaluate_custom_expression(user_defined_function, new_column_name, table):
    """
    :param user_defined_function:
    :param new_column_name:
    :param table:
    :return:
    """

    table_headers = table[0]
    table[0] += [new_column_name]

    remainder = strip(user_defined_function, table_headers + permitted)
    if remainder:
        raise ValueError(f"Bad sign near '{strip(func, table_headers + permitted)}' in '{user_defined_function}'")

    for row_index in (i for i in table if i > 0):
        data = {k: str(v) for k, v in zip(table_headers, table[row_index])}
        new_func = strip(replace(func, data), ["'", '"'])  # replace column names with values and remove text marks.

        if strip(new_func, permitted):  # check that no malicious content is left in the string
            raise ValueError(f"Bad sign near '{strip(new_func, permitted)}' in '{func}'")

        table[row_index] += [eval(new_func)]  # evaluate the string as if it was math.


Example

In [7]:
evaluate_custom_expression(func, new_column_name, table)
for k, v in table.items():
    print(k, ":", v)

0 : ['A', 'B', 'C', 'D', 'E']
1 : [123, 2, 32, 321, 36.642276422764226]
2 : [133, 2.1, 33, 123, 36.05263157894737]
3 : [143, 31000.0, 34, 111, 31250.81118881119]
4 : [163, 3, 35, 222, 39.38650306748466]
5 : [143, 4, 36, 333, 42.35664335664335]
6 : [123, 4, 37, 444, 44.642276422764226]


## Part 2: The faster way.

However there is a better way.

The code above practically parses the function every time. What if we only parsed the content once?

Let's try:

In [8]:
import ast  # python interpreters abstract syntax tree.

def expression_interpreter(expression, columns):
    # using python's compiler.
    req_columns = ",".join(columns)
    # Fix expression, which will also throws syntax error if heading has space in line 420

    script = f"def f({req_columns}):\n    return {expression}"
    tree = ast.parse(script)
    code = compile(tree, filename="blah", mode="exec")
    namespace = {**globals(), **locals()}
    exec(code, namespace)
    f = namespace["f"]
    if not callable(f):
        raise ValueError(f"The expression could not be parsed: {expression}")
    return f

Now we have the ability to transfer an expression into an interpreted python function.

Let's try it out:

In [9]:
func = "(int(round((A*B)+(C*A) + 4, 0)) + max(D, B))/A"
udf = expression_interpreter(func, ['A','B','C','D'])
print(udf)


<function f at 0x7f28b43553a0>


In [10]:
from inspect import signature
print(signature(udf))

(A, B, C, D)


In [11]:
udf(123, 2, 32, 321)  # expected: 36.642276422764226

36.642276422764226

We can now rerun our function with the revised function:

In [12]:
def evaluate_custom_expression(new_column_name, table):
    """
    :param user_defined_function:
    :param new_column_name:
    :param table:
    :return:
    """
    table_headers = table[0]
    table[0] += [new_column_name]

    keys = str(signature(udf))

    for row_index in (i for i in table if i > 0):
        data = {k: v for k, v in zip(table_headers, table[row_index])if k in keys}
        table[row_index] += [udf(**data)]  # evaluate the string as if it was math.

In [13]:
evaluate_custom_expression('F', table)
for k, v in table.items():
    print(k, ":", v)

0 : ['A', 'B', 'C', 'D', 'E', 'F']
1 : [123, 2, 32, 321, 36.642276422764226, 36.642276422764226]
2 : [133, 2.1, 33, 123, 36.05263157894737, 36.05263157894737]
3 : [143, 31000.0, 34, 111, 31250.81118881119, 31250.81118881119]
4 : [163, 3, 35, 222, 39.38650306748466, 39.38650306748466]
5 : [143, 4, 36, 333, 42.35664335664335, 42.35664335664335]
6 : [123, 4, 37, 444, 44.642276422764226, 44.642276422764226]


As you see from the print above, the table now has the column `"F"` with the exact same value as before, but it doesn't parse every line.

**Note**: You still have to scrub the expression for illegal items as in part 1.