# Information Flow

In this chapter, we explore in depth how to track information flows in python by origining input strings, and tracking the origin across string operations.

Some material on `eval` exploitation is adapted from the excellent [blog post](https://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html) by Ned Batchelder.

**Prerequisites**

* You should have read the [chapter on coverage](Coverage.ipynb).
* Some knowledge of inheritance in Python is required.

We first setup our infrastructure so that we can make use of previously defined functions.

In [None]:
import fuzzingbook_utils

In [None]:
from ExpectError import ExpectError

In [None]:
import inspect
import enum

Say we want to implement a *calculator* service in Python. A rather easy way to do that is to rely on the `eval()` function in Python. However, unrestricted `eval()` can be used by users to execute arbitrary commands. Since we want to restrict our users to using only the *calculator* functionality, and do not want the users to trash our server, we use `eval()` with empty `locals` and `globals` (as recommended [elsewhere](https://www.programiz.com/python-programming/methods/built-in/eval)).

In [None]:
def my_calculator(my_input):
    result = eval(my_input, {}, {})
    print("The result of %s was %d" % (my_input, result))

It works as expected:

In [None]:
my_calculator('1+2')

Does it?

In [None]:
with ExpectError():
    my_calculator('__import__("os").popen("ls").read()')

As you can see from the error, `eval()` completed successfully, with the system command `ls` executing successfully. It is easy enough for the user to see the output if needed.

In [None]:
my_calculator(
    "1 if __builtins__['print'](__import__('os').popen('pwd').read()) else 0")

The problem is that the Python `__builtins__` is [inserted by default](https://docs.python.org/3/library/functions.html#eval) when one uses `eval()`. We can avoid this by restricting `__builtins__` in `eval` explicitly (again as recommended [elsewhere](http://lybniz2.sourceforge.net/safeeval.html)).

In [None]:
def my_calculator(my_input):
    result = eval(my_input, {"__builtins__": None}, {})
    print("The result of %s was %d" % (my_input, result))

Does it help?

In [None]:
with ExpectError():
    my_calculator(
        "1 if __builtins__['print'](__import__('os').popen('pwd').read()) else 0")

But does it actually?

In [None]:
my_calculator("1 if [x['print'](x['__import__']('os').popen('pwd').read()) for x in ([x for x in (1).__class__.__base__.__subclasses__() if x.__name__ == 'Sized'][0].__len__.__globals__['__builtins__'],)] else 0")

The problem here is that when the user has a way to inject **uninterpreted strings** that can reach a dangerous routine such as  `eval()` or an `exec()`, it makes it possible for them to inject dangerous code. What we need is a way to restrict the ability of uninterpreted input string fragments from reaching dangerous portions of code.

## A Simple Taint Tracker

For capturing information flows we need a new string class. The idea is to use the new origined string class `tstr` as a wrapper on the original `str` class. However, `str` is an *immutable* class. Hence, it does not call its `__init__` method after being constructed. This means that any subclasses of `str` also will not get the `__init__` called. If we want to get our initialization routine called, we need to [hook into `__new__`](https://docs.python.org/3/reference/datamodel.html#basic-customization) and return an instance of our own class.

We need to write the `__new__()` method because we want to track the taint object responsible for the origin during our initialization `tstr.__init__()`. Hence, we define a class `tstr_` that subclasses `str`, and enables its subclasses to initialize using `__init__()`.

There are various levels of origin tracking that one can perform. The simplest is to track that a string fragment originated in an untrusted environment, and has not undergone a origin removal process. For this, we simply need to wrap the original string in the untrusted environment with `tstr`, and produce `tstr` instances on each operation that results in another string fragment.

Distinguishing various untrusted sources may be accomplished by origining each instances as separate instances (called *colors* in dynamic origin research). You will see an instance of this technique in the chapter on [Grammar Mining](GrammarMiner.ipynb).

In this chapter, we carry *character level* origins. That is, given a fragment that resulted from a portion of the original origined string, one will be able to tell which portion of the input string the fragment was taken from. In essence, each input character index from a origined source gets its own color.

More complex origining such as *bitmap origins* are possible where a single character may result from multiple origined character indexes (such as *checksum* operations on strings). We do not consider these in this chapter.

We now define our initialization code in `__init__()`.

The variable `origin` contains non-overlapping origins mapped to the original string. The variable `taint` holds a reference to the `tstr` instance from which this instance was derived.

In [None]:
class tstr(str):
    def __new__(cls, value, *args, **kw):
        return str.__new__(cls, value)

    def __init__(self, value, origin=None, taint=None, **kwargs):
        self.taint = taint
        l = len(self)
        if origin is None:
            origin = 0
        self.origin = list(range(origin, origin + l)) if isinstance(
            origin, int) else origin
        assert len(self.origin) == l

\todo{rename origin to origin}
\todo{rename taint to origin}
\todo{have repr() invoke create()}

In [None]:
class tstr(tstr):
    def __repr__(self):
        # return self
        # FIXME: Is this better?
        return tstr(str.__repr__(self))

In [None]:
class tstr(tstr):
    def __str__(self):
        return str.__str__(self)

For example, if we wrap `hello` in `tstr`, then we should be able to access its origin in indices `0..4`

In [None]:
thello = tstr('hello')

In [None]:
thello

In [None]:
assert thello.origin == [0, 1, 2, 3, 4]

We can also specify the starting origin as below -- `6..10`

In [None]:
tworld = tstr('world', origin=6)
assert tworld.origin == [6, 7, 8, 9, 10]

`str()` returns an unorigined `str` instance.

In [None]:
assert type(str(thello)) == str

However, `repr()` returns a origined representation of the object.

In [None]:
assert type(repr(thello)) == type(thello)

By default, when we wrap a string, it is origined. Hence we also need a way to `unorigin` the string. One way is to simply return a `str` instance as above. However, one may sometimes wish to remove origin from an existing instance. This is accomplished with `unorigin()`. During `unorigin()`, we simply set the origin indexes to `-1`. This method comes with a pair method `has_origin()` which checks whether a `tstr` instance is currently origined.

In [None]:
class tstr(tstr):
    def unorigin(self):
        self.origin = [None] * len(self)
        return self

    def has_origin(self):
        return any(True for i in self.origin if i is not None)

    def origin_in(self, gsentence):
        return set(self.origin) <= set(gsentence.origin)

In [None]:
thw = tstr('hello world')
thw.unorigin()
assert not thw.has_origin()

While the basic origined string creation works, we have not completed the origin transition. For example, getting a substring does not transfer origin from the original string.

In [None]:
with ExpectError():
    t = tstr('hello world')
    t[0:5].has_origin()

In Python, the substring as shown above is implemented using `slice`. We implement this next.

### Create

We need to create new substrings that are wrapped in `tstr`. However, we also want to allow our subclasses to create their own instances. Hence we provide a `create()` method that produces a new `tstr` instance.

In [None]:
class tstr(tstr):
    def create(self, res, origin):
        return tstr(res, origin, self)

In [None]:
thello = tstr('hello')
tworld = thello.create('world', 6)
assert (tworld.taint.origin, tworld.origin) == (
    [0, 1, 2, 3, 4], [6, 7, 8, 9, 10])

### Index

In Python, indexing is provided through `__getitem__()`. Indexing on positive integers is simple enough. However, it has two additional wrinkles. The first is that, if the index is negative, that many characters are counted from the end of the string which lies just after the last character. That is, the last character has a negative index `-1`

In [None]:
class tstr(tstr):
    def __getitem__(self, key):
        res = super().__getitem__(key)
        if isinstance(key, int):
            key = len(self) + key if key < 0 else key
            return self.create(res, [self.origin[key]])
        elif isinstance(key, slice):
            return self.create(res, self.origin[key])
        else:
            assert False

In [None]:
hello = tstr('hello')
assert (hello[0], hello[-1]) == ('h', 'o')

The other wrinkle is that `__getitem__()` can accept a slice. We discuss this next.

### Slice

The Python `slice` operator `[n:m]` relies on the object being an `iterator`. Hence, we define the `__iter__()` method, which returns a custom `iterator`.

In [None]:
class tstr(tstr):
    def __iter__(self):
        return tstr_iterator(self)

#### The iterator class
The `__iter__()` method requires a supporting `iterator` object. The `iterator` is used to save the state of the current iteration, which it does by keeping a reference to the original `tstr`, and the current index of iteration `_str_idx`.

In [None]:
class tstr_iterator():
    def __init__(self, tstr):
        self._tstr = tstr
        self._str_idx = 0

    def __next__(self):
        if self._str_idx == len(self._tstr):
            raise StopIteration
        # calls tstr getitem should be tstr
        c = self._tstr[self._str_idx]
        assert isinstance(c, tstr)
        self._str_idx += 1
        return c

Bringing all these together:

In [None]:
thw = tstr('hello world')
assert thw[0:5].has_origin()

### Concatenation

If two origined strings are concatenated together, it may be desirable to transer the origins from each to the corresponding portion of the resulting string. The concatenation of strings is accomplished by overriding `__add__()`.

In [None]:
class tstr(tstr):
    def __add__(self, other):
        if isinstance(other, tstr):
            return self.create(str.__add__(self, other),
                               (self.origin + other.origin))
        else:
            return self.create(str.__add__(self, other),
                               (self.origin + [-1 for i in other]))

Testing concatenations between two `tstr` instances:

In [None]:
thello = tstr("hello")
tworld = tstr("world", origin=6)
thw = thello + tworld
assert thw.origin == [0, 1, 2, 3, 4, 6, 7, 8, 9, 10]

What if a `tstr` is concatenated with a `str`?

In [None]:
space = "  "
th_w = thello + space + tworld
assert th_w.origin == [0, 1, 2, 3, 4, -1, -1, 6, 7, 8, 9, 10]

One wrinkle here is that when adding a `tstr` and a `str`, the user may place the `str` first, in which case, the `__add__()` method will be called on the `str` instance. Not on the `tstr` instance. However, Python provides a solution. If one defines `__radd__()` on the `tstr` instance, that method will be called rather than `str.__add__()`

In [None]:
class tstr(tstr):
    def __radd__(self, other):
        origin = other.origin if isinstance(other, tstr) else [
            None for i in other]
        return self.create(str.__add__(other, self), (origin + self.origin))

We test it out:

In [None]:
shello = "hello"
tworld = tstr("world")
thw = shello + tworld
assert thw.origin == [None, None, None, None, None, 0, 1, 2, 3, 4]

These methods: `slicing` and `concatenation` is sufficient to implement other string methods that result in a string, and does not change the character underneath (i.e no case change). Hence, we look at a helper method next.

### Extract origined string.

Given a specific input index, the method `x()` extracts the corresponding origined portion from a `tstr`. As a convenience it supports `slices` along with `ints`.

In [None]:
class tstr(tstr):
    class TaintException(Exception):
        pass

    def x(self, i=0):
        if not self.origin:
            raise origin.TaintException('Invalid request idx')
        if isinstance(i, int):
            return [self[p]
                    for p in [k for k, j in enumerate(self.origin) if j == i]]
        elif isinstance(i, slice):
            r = range(i.start or 0, i.stop or len(self), i.step or 1)
            return [self[p]
                    for p in [k for k, j in enumerate(self.origin) if j in r]]

In [None]:
thw = tstr('hello world', origin=100)

In [None]:
assert thw.x(101) == ['e']

In [None]:
assert thw.x(slice(101, 105)) == ['e', 'l', 'l', 'o']

### Replace

The `replace()` method replaces a portion of the string with another.

In [None]:
class tstr(tstr):
    def replace(self, a, b, n=None):
        old_origin = self.origin
        b_origin = b.origin if isinstance(b, tstr) else [None] * len(b)
        mystr = str(self)
        i = 0
        while True:
            if n and i >= n:
                break
            idx = mystr.find(a)
            if idx == -1:
                break
            last = idx + len(a)
            mystr = mystr.replace(a, b, 1)
            partA, partB = old_origin[0:idx], old_origin[last:]
            old_origin = partA + b_origin + partB
            i += 1
        return self.create(mystr, old_origin)

In [None]:
my_str = tstr("aa cde aa")
res = my_str.replace('aa', 'bb')
assert res, res.origin == ('bb', 'cde', 'bb',
                          [None, None, 2, 3, 4, 5, 6, None, None])

In [None]:
my_str = tstr("aa cde aa")
res = my_str.replace('aa', tstr('bb', origin=100))
assert (res, res.origin) == (('bb cde bb'), [100, 101, 2, 3, 4, 5, 6, 100, 101])

### Split

We essentially have to re-implement split operations, and split by space is slightly different from other splits.

In [None]:
class tstr(tstr):
    def _split_helper(self, sep, splitted):
        result_list = []
        last_idx = 0
        first_idx = 0
        sep_len = len(sep)

        for s in splitted:
            last_idx = first_idx + len(s)
            item = self[first_idx:last_idx]
            result_list.append(item)
            first_idx = last_idx + sep_len
        return result_list

    def _split_space(self, splitted):
        result_list = []
        last_idx = 0
        first_idx = 0
        sep_len = 0
        for s in splitted:
            last_idx = first_idx + len(s)
            item = self[first_idx:last_idx]
            result_list.append(item)
            v = str(self[last_idx:])
            sep_len = len(v) - len(v.lstrip(' '))
            first_idx = last_idx + sep_len
        return result_list

    def rsplit(self, sep=None, maxsplit=-1):
        splitted = super().rsplit(sep, maxsplit)
        if not sep:
            return self._split_space(splitted)
        return self._split_helper(sep, splitted)

    def split(self, sep=None, maxsplit=-1):
        splitted = super().split(sep, maxsplit)
        if not sep:
            return self._split_space(splitted)
        return self._split_helper(sep, splitted)

In [None]:
my_str = tstr('ab cdef ghij kl')
ab, cdef, ghij, kl = my_str.rsplit(sep=' ')
assert (ab.origin, cdef.origin, ghij.origin,
        kl.origin) == ([0, 1], [3, 4, 5, 6], [8, 9, 10, 11], [13, 14])

my_str = tstr('ab cdef ghij kl', origin=list(range(0, 15)))
ab, cdef, ghij, kl = my_str.rsplit(sep=' ')
assert(ab.origin, cdef.origin, kl.origin) == ([0, 1], [3, 4, 5, 6], [13, 14])

In [None]:
my_str = tstr('ab   cdef ghij    kl', origin=100)
ab, cdef, ghij, kl = my_str.rsplit()
assert (ab.origin, cdef.origin, ghij.origin,
        kl.origin) == ([100, 101], [105, 106, 107, 108], [110, 111, 112, 113],
                      [118, 119])

my_str = tstr('ab   cdef ghij    kl', origin=list(range(0, 20)))
ab, cdef, ghij, kl = my_str.split()
assert (ab.origin, cdef.origin, kl.origin) == ([0, 1], [5, 6, 7, 8], [18, 19])

### Strip

In [None]:
class tstr(tstr):
    def strip(self, cl=None):
        return self.lstrip(cl).rstrip(cl)

    def lstrip(self, cl=None):
        res = super().lstrip(cl)
        i = self.find(res)
        return self[i:]

    def rstrip(self, cl=None):
        res = super().rstrip(cl)
        return self[0:len(res)]


In [None]:
my_str1 = tstr("  abc  ")
v = my_str1.strip()
assert v, v.origin == ('abc', [2, 3, 4])

In [None]:
my_str1 = tstr("  abc  ")
v = my_str1.lstrip()
assert (v, v.origin) == ('abc  ', [2, 3, 4, 5, 6])

In [None]:
my_str1 = tstr("  abc  ")
v = my_str1.rstrip()
assert (v, v.origin) == ('  abc', [0, 1, 2, 3, 4])

### Expand Tabs

In [None]:
class tstr(tstr):
    def expandtabs(self, n=8):
        parts = self.split('\t')
        res = super().expandtabs(n)
        all_parts = []
        for i, p in enumerate(parts):
            all_parts.extend(p.origin)
            if i < len(parts) - 1:
                l = len(all_parts) % n
                all_parts.extend([p.origin[-1]] * l)
        return self.create(res, all_parts)

In [None]:
my_str = str("ab\tcd")
my_tstr = tstr("ab\tcd")
v1 = my_str.expandtabs(4)
v2 = my_tstr.expandtabs(4)
assert str(v1) == str(v2)
assert (len(v1), repr(v2), v2.origin) == (6, "'ab  cd'", [0, 1, 1, 1, 3, 4])

In [None]:
class tstr(tstr):
    def join(self, iterable):
        mystr = ''
        myorigin = []
        sep_origin = self.origin
        lst = list(iterable)
        for i, s in enumerate(lst):
            sorigin = s.origin if isinstance(s, tstr) else [None] * len(s)
            myorigin.extend(sorigin)
            mystr += str(s)
            if i < len(lst) - 1:
                myorigin.extend(sep_origin)
                mystr += str(self)
        res = super().join(iterable)
        assert len(res) == len(mystr)
        return self.create(res, myorigin)

In [None]:
my_str = tstr("ab cd", origin=100)
(v1, v2), v3 = my_str.split(), 'ef'
assert (v1.origin, v2.origin) == ([100, 101], [103, 104])
v4 = tstr('').join([v2, v3, v1])
assert (v4, v4.origin) == ('cdefab', [103, 104, None, None, 100, 101])

In [None]:
my_str = tstr("ab cd", origin=100)
(v1, v2), v3 = my_str.split(), 'ef'
assert (v1.origin, v2.origin) == ([100, 101], [103, 104])
v4 = tstr(',').join([v2, v3, v1])
assert (v4, v4.origin) == ('cd,ef,ab', [103, 104, 0, None, None, 0, 100, 101])

### Partitions

In [None]:
class tstr(tstr):
    def partition(self, sep):
        partA, sep, partB = super().partition(sep)
        return (self.create(partA, self.origin[0:len(partA)]),
                self.create(sep, self.origin[len(partA):len(partA) + len(sep)]),
                self.create(partB, self.origin[len(partA) + len(sep):]))

    def rpartition(self, sep):
        partA, sep, partB = super().rpartition(sep)
        return (self.create(partA, self.origin[0:len(partA)]),
                self.create(sep, self.origin[len(partA):len(partA) + len(sep)]),
                self.create(partB, self.origin[len(partA) + len(sep):]))

### Justify

In [None]:
class tstr(tstr):
    def ljust(self, width, fillchar=' '):
        res = super().ljust(width, fillchar)
        initial = len(res) - len(self)
        if isinstance(fillchar, tstr):
            t = fillchar.x()
        else:
            t = -1
        return self.create(res, [t] * initial + self.origin)

    def rjust(self, width, fillchar=' '):
        res = super().rjust(width, fillchar)
        final = len(res) - len(self)
        if isinstance(fillchar, tstr):
            t = fillchar.x()
        else:
            t = -1
        return self.create(res, self.origin + [t] * final)

### String methods that do not change origin

In [None]:
class tstr(tstr):
    def swapcase(self):
        return self.create(str(self).swapcase(), self.origin)

    def upper(self):
        return self.create(str(self).upper(), self.origin)

    def lower(self):
        return self.create(str(self).lower(), self.origin)

    def capitalize(self):
        return self.create(str(self).capitalize(), self.origin)

    def title(self):
        return self.create(str(self).title(), self.origin)

In [None]:
a = tstr('aa', origin=100).upper()
a, a.origin

In [None]:
def origin_include(gword, gsentence):
    return set(gword.origin) <= set(gsentence.origin)

### General wrappers

These are not strictly needed for operation, but can be useful for tracing

In [None]:
def make_str_wrapper(fun):
    def proxy(*args, **kwargs):
        res = fun(*args, **kwargs)
        return res
    return proxy

In [None]:
import types
tstr_members = [name for name, fn in inspect.getmembers(tstr, callable)
                if isinstance(fn, types.FunctionType) and fn.__qualname__.startswith('tstr')]

for name, fn in inspect.getmembers(str, callable):
    if name not in set(['__class__', '__new__', '__str__', '__init__',
                        '__repr__', '__getattribute__']) | set(tstr_members):
        setattr(tstr, name, make_str_wrapper(fn))

### Methods yet to be translated

These methods generate strings from other strings. However, we do not have the right implementations for any of these. Hence these are marked as dangerous until we can generate the right translations.

In [None]:
def make_str_abort_wrapper(fun):
    def proxy(*args, **kwargs):
        raise tstr.TaintException('%s Not implemented in TSTR' % fun.__name__)
    return proxy

In [None]:
for name, fn in inspect.getmembers(str, callable):
    if name in ['__format__', '__rmod__', '__mod__', 'format_map', 'format',
                '__mul__', '__rmul__', 'center', 'zfill', 'decode', 'encode', 'splitlines']:
        setattr(tstr, name, make_str_abort_wrapper(fn))

## Lessons Learned

* One can track the information flow form input to the internals of a system.

## Next Steps

_Link to subsequent chapters (notebooks) here:_

## Background

\cite{Lin2008}

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```markdown
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_