# Tracking Information Flow

In this chapter, we explore in depth how to track information flows in python by origining input strings, and tracking the origin across string operations.

Some material on `eval` exploitation is adapted from the excellent [blog post](https://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html) by Ned Batchelder.

**Prerequisites**

* You should have read the [chapter on coverage](Coverage.ipynb).
* Some knowledge of inheritance in Python is required.

We first setup our infrastructure so that we can make use of previously defined functions.

In [None]:
import fuzzingbook_utils

In [None]:
import Fuzzer

In [None]:
from ExpectError import ExpectError

Say we want to implement a *calculator* service in Python. A rather easy way to do that is to rely on the `eval()` function in Python. However, unrestricted `eval()` can be used by users to execute arbitrary commands. Since we want to restrict our users to using only the *calculator* functionality, and do not want the users to trash our server, we use `eval()` with empty `locals` and `globals` (as recommended [elsewhere](https://www.programiz.com/python-programming/methods/built-in/eval)).

In [None]:
def my_calculator(my_input):
    result = eval(my_input, {}, {})
    print("The result of %s is %d" % (my_input, result))

It works as expected:

In [None]:
my_calculator('1+2')

Does it?

In [None]:
with ExpectError():
    my_calculator('__import__("os").popen("ls").read()')

As you can see from the error, `eval()` completed successfully, with the system command `ls` executing successfully. It is easy enough for the user to see the output if needed.

In [None]:
my_calculator(
    "1 if __builtins__['print'](__import__('os').popen('pwd').read()) else 0")

The problem is that the Python `__builtins__` is [inserted by default](https://docs.python.org/3/library/functions.html#eval) when one uses `eval()`. We can avoid this by restricting `__builtins__` in `eval` explicitly (again as recommended [elsewhere](http://lybniz2.sourceforge.net/safeeval.html)).

In [None]:
def my_calculator(my_input):
    result = eval(my_input, {"__builtins__": None}, {})
    print("The result of %s was %d" % (my_input, result))

Does it help?

In [None]:
with ExpectError():
    my_calculator(
        "1 if __builtins__['print'](__import__('os').popen('pwd').read()) else 0")

But does it actually?

In [None]:
my_calculator("1 if [x['print'](x['__import__']('os').popen('pwd').read()) for x in ([x for x in (1).__class__.__base__.__subclasses__() if x.__name__ == 'Sized'][0].__len__.__globals__['__builtins__'],)] else 0")

The problem here is that when the user has a way to inject **uninterpreted strings** that can reach a dangerous routine such as  `eval()` or an `exec()`, it makes it possible for them to inject dangerous code. What we need is a way to restrict the ability of uninterpreted input string fragments from reaching dangerous portions of code.

## Tracking Taints

There are various levels of origin tracking that one can perform. The simplest is to track that a string fragment originated in a specific environment, and has not undergone a taint removal process. For this, we simply need to wrap the original string with an environment identifier (the _taint_) with `tstr`, and produce `tstr` instances on each operation that results in another string fragment.  

The sttaribute `taint` holds a label identifying the environment this instance was derived.

For capturing information flows we need a new string class. The idea is to use the new tainted string class `tstr` as a wrapper on the original `str` class. However, `str` is an *immutable* class. Hence, it does not call its `__init__()` method after being constructed. This means that any subclasses of `str` also will not get the `__init__()` method called. If we want to get our initialization routine called, we need to [hook into `__new__()`](https://docs.python.org/3/reference/datamodel.html#basic-customization) and return an instance of our own class.  We combine this with our initialization code in `__init__()`.

In [None]:
class tstr(str):
    def __new__(cls, value, *args, **kw):
        return str.__new__(cls, value)

    def __init__(self, value, taint=None, **kwargs):
        self.taint = taint

In [None]:
class tstr(tstr):
    def __repr__(self):
        return tstr(str.__repr__(self), taint=self.taint)

In [None]:
class tstr(tstr):
    def __str__(self):
        return str.__str__(self)

For example, if we wrap `"hello"` in `tstr`, then we should be able to access its taint:

In [None]:
thello = tstr('hello', taint='LOW')

In [None]:
thello.taint

In [None]:
repr(thello).taint

By default, when we wrap a string, it is tainted. Hence we also need a way to `untaint` the string. One way is to simply return a `str` instance as above. However, one may sometimes wish to remove the taint from an existing instance. This is accomplished with `untaint()`. During `untaint()`, we simply set the taint to `None`. This method comes with a pair method `has_taint()` which checks whether a `tstr` instance is currently origined.

In [None]:
class tstr(tstr):
    def clear_taint(self):
        self.taint = None
        return self

    def has_taint(self):
        return self.taint is not None

### String Operators

To propagate the taint, we have to extend string functions, such as operators.  We can do so in one single big step, overloading all string methods and operators.

When we create a new string from an existing tainted string, we propagate its taint.

In [None]:
class tstr(tstr):
    def create(self, s):
        # print("New tstr from", repr(s))
        return tstr(s, taint=self.taint)

The `make_str_wrapper()` function creates a wrapper around an existing string method which attaches the taint to the result of the method:

In [None]:
def make_str_wrapper(fun):
    def proxy(self, *args, **kwargs):
        res = fun(self, *args, **kwargs)
        # print(fun, args, kwargs, "=", repr(res))
        return self.create(res)
    return proxy

We do this for all string methods that return a string:

In [None]:
for name in ['__format__', '__getitem__', '__add__', '__mul__', '__rmul__', 
             'capitalize', 'casefold', 'center', 'encode',
            'expandtabs', 'format', 'format_map', 'join', 'ljust', 'lower', 'lstrip', 'replace',
            'rjust', 'rstrip', 'strip', 'swapcase', 'title', 'translate', 'upper']:
    fun = getattr(str, name)
    setattr(tstr, name, make_str_wrapper(fun))

The one missing operator is `+` with a regular string on the left side and a tainted string on the right side.  Python supports a `__radd__()` method which is invoked if the associated object is used on the right side of an addition.

In [None]:
class tstr(tstr):
    def __radd__(self, s):
        # print("__radd__", repr(s))
        return self.create(s + str(self))

With this, we are already done.  Let us create a string `thello` with a taint `LOW`.

In [None]:
thello = tstr('hello', taint='LOW')

Now, any substring will also be tainted:

In [None]:
thello[0].taint

In [None]:
thello[1:3].taint

String additions will return a `tstr` object with the taint:

In [None]:
(tstr('foo', taint='HIGH') + 'bar').taint

Our `__radd__()` method ensures this also works if the `tstr` occurs on the right side of a string addition:

In [None]:
('foo' + tstr('bar', taint='HIGH')).taint

In [None]:
thello += ', world'

In [None]:
thello.taint

Other operators such as multiplication also work:

In [None]:
(thello * 5).taint

## Applications

So, what can one do with tainted strings?

### Tracking Untrusted Input

We reconsider the `my_calculator()` example.  We define a "better" calculator which only accepts strings tainted as `"TRUSTED"`.

In [None]:
def better_calculator(s):
    assert isinstance(s, tstr), "Need a tainted string"
    assert s.taint == 'TRUSTED', "Need a string with trusted taint"
    return my_calculator(s)

Feeding a string with an "unknown" (i.e., non-existing) trust level will cause `better_calculator()` to fail:

In [None]:
with ExpectError():
    better_calculator("2 + 2")

Additionally any user input would be originally tagged with `"UNTRUSTED"` as taint.  If we place an untrusted string into our better calculator, it will also fail:

In [None]:
bad_user_input = tstr('__import__("os").popen("ls").read()', taint='UNTRUSTED')
with ExpectError():
    better_calculator(bad_user_input)

Hence, somewhere along the computation, we have to turn the "untrusted" inputs into "trusted" strings.  This process is called *sanitization*.  A simple sanitization function for our purposes could ensure that the input consists only of few allowed characters (not including letters or quotes); if this is the case, then the input gets a new `"TRUSTED"` taint.  If not, we turn the string into an (untrusted) empty string; other alternatives would be to raise an error or to escape or delete "untrusted" characters.

In [None]:
import re

In [None]:
def sanitize(user_input):
    assert isinstance(user_input, tstr)
    if re.match(r'^[-0-9.+*/%() ]*$', user_input):
        return tstr(user_input, taint='TRUSTED')
    else:
        return tstr('', taint='UNTRUSTED')

In [None]:
good_user_input = tstr("2 + 2", taint='UNTRUSTED')
sanitized_input = sanitize(good_user_input)
sanitized_input

In [None]:
sanitized_input.taint

In [None]:
better_calculator(sanitized_input)

Let us now try out our untrusted input:

In [None]:
sanitized_input = sanitize(bad_user_input)
sanitized_input

In [None]:
sanitized_input.taint

In [None]:
with ExpectError():
    better_calculator(sanitized_input)

In a similar fashion, we can prevent SQL and code injections discussed in [the chapter on Web fuzzing](WebFuzzer.ipynb).

### Preventing Privacy Leaks

Using taints, we can also ensure that secret information does not leak out.  We can assign a special taint `"SECRET"` to strings whose information must not leak out:

In [None]:
secrets = tstr('<Plenty of secret keys>', taint='SECRET')

Accessing any substring of `secrets` will propagate the taint:

In [None]:
secrets[1:3].taint

Consider the _heartbeat_ security leak from [the chapter on Fuzzing](Fuzzer.ipynb), in which a server would accidentally reply not only the user input sent to it, but also secret memory.  If the reply consists only of the user input, there is no taint associated with it:

In [None]:
user_input = "hello"
reply = user_input

In [None]:
isinstance(reply, tstr)

If, however, the reply contains _any_ part of the secret, the reply will be tainted:

In [None]:
reply = user_input + secrets[0:5]

In [None]:
reply

In [None]:
reply.taint

The output function of our server would now ensure that the data sent back does not contain any secret information:

In [None]:
def send_back(s):
    assert not isinstance(s, tstr) and not s.taint == 'SECRET'
    ...

In [None]:
with ExpectError():
    send_back(reply)

## Tracking Origins

Our `tstr` solution can be help to identify information leaks – but it is by no means complete.  If we actually take the `heartbeat()` implementation from [the chapter on Fuzzing](Fuzzer.ipynb), we will see that _any_ reply is marked as `SECRET` – even those not even accessing secret memory:

In [None]:
from Fuzzer import heartbeat

In [None]:
reply = heartbeat('hello', 5, memory=secrets)

In [None]:
reply.taint

Why is this?  If we look into the implementation of `heartbeat()`, we will see that it first builds a long string `memory` from the (non-secret) reply and the (secret) memory, before returning the first characters from `memory`.
```python
    # Store reply in memory
    memory = reply + memory[len(reply):]
```
At this point, the whole memory still is tainted as `SECRET`, _including_ the non-secret part from `reply`.

We may be able to circumvent the issue by tagging the `reply` as `PUBLIC` – but then, this taint would be in conflict with the `SECRET` tag of `memory`.  What happens if we compose a string from two differently tainted strings?

In [None]:
thilo = tstr("High", taint='HIGH') + tstr("Low", taint='LOW')

It turns out that in this case, the `__add__()` method takes precedence over the `__radd__()` method, which means that the right-hand `"Low"` string is treated as a regular (non-tainted) string.

In [None]:
thilo

In [None]:
thilo.taint

We could set up the `__add__()` and other methods with special handling for conflicting taints.  However, the way this conflict should be resolved would be highly _application-dependent_:

* If we use taints to indicate _privacy levels_, `SECRET` privacy should take precedence over `PUBLIC` privacy.  Any combination of a `SECRET`-tainted string and a `PUBLIC`-tainted string thus should have a `SECRET` taint.

* If we use taints to indicate _origins_ of information, an `UNTRUSTED` origin should take precedence over a `TRUSTED` origin.  Any combination of an `UNTRUSTED`-tainted string and a `TRUSTED`-tainted string thus should have an `UNTRUSTED` taint.

Of course, such conflict resolutions can be implemented.  But even so, they will not help us in the `heartbeat()` example differentiating secret from non-secret output data.

Fortunately, there is a better, more generic way to achieve this.  The key to composition of differently tainted strings is to assign taints not only to strings, but actually to every bit of information – in our case, characters.  If every character has a taint on its own, a new composition of characters will simply inherit this very taint _per character_.  To this end, we introduce a second bit of information named _origin_.

Distinguishing various untrusted sources may be accomplished by origining each instances as separate instances (called *colors* in dynamic origin research). You will see an instance of this technique in the chapter on [Grammar Mining](GrammarMiner.ipynb).

In this chapter, we carry *character level* origins. That is, given a fragment that resulted from a portion of the original origined string, one will be able to tell which portion of the input string the fragment was taken from. In essence, each input character index from a origined source gets its own color.

More complex origining such as *bitmap origins* are possible where a single character may result from multiple origined character indexes (such as *checksum* operations on strings). We do not consider these in this chapter.

The _origin_ of a character indicates its source.  It is a consecutive number in a particular range (by default, starting with zero) indicating its _position_ within a specific origin.

In [None]:
class ostr(str):
    def __new__(cls, value, *args, **kw):
        return str.__new__(cls, value)

    def __init__(self, value, taint=None, origin=None, **kwargs):
        self.taint = taint

        if origin is None:
            origin = 0
        if isinstance(origin, int):
            self.origin = list(range(origin, origin + len(self)))
        else:
            self.origin = origin
        assert len(self.origin) == len(self)

In [None]:
class ostr(ostr):
    def create(self, s):
        return ostr(s, taint=self.taint, origin=self.origin)

In [None]:
class ostr(ostr):
    def __repr__(self):
        return ostr(str.__repr__(self), taint=self.taint, origin=[None] + self.origin + [None])

In [None]:
class ostr(ostr):
    def __str__(self):
        return str.__str__(self)

By default, character origins start with `0`:

In [None]:
thello = ostr('hello')
assert thello.origin == [0, 1, 2, 3, 4]

We can also specify the starting origin as below -- `6..10`

In [None]:
tworld = ostr('world', origin=6)
assert tworld.origin == [6, 7, 8, 9, 10]

`str()` returns an `str` instance without origin or taint information:

In [None]:
assert type(str(thello)) == str

`repr()`, however, keeps the origin information for the original string:

In [None]:
repr(thello)

In [None]:
repr(thello).origin

Just as with taints, we can clear origins and check whether an origin is present:

In [None]:
class ostr(ostr):
    def clear_taint(self):
        self.taint = None
        return self

    def has_taint(self):
        return self.taint is not None

In [None]:
class ostr(ostr):
    def clear_origin(self):
        self.origin = [None] * len(self)
        return self

    def has_origin(self):
        return any(True for i in self.origin if i is not None)

### Create

We need to create new substrings that are wrapped in `ostr` objects. However, we also want to allow our subclasses to create their own instances. Hence we again provide a `create()` method that produces a new `ostr` instance.

In [None]:
class ostr(ostr):
    def create(self, res, origin=None):
        return ostr(res, taint=self.taint, origin=origin)

In [None]:
thello = ostr('hello', taint='HIGH')
tworld = thello.create('world', origin=6)

In [None]:
tworld.origin

In [None]:
tworld.taint

In [None]:
assert (thello.origin, tworld.origin) == (
    [0, 1, 2, 3, 4], [6, 7, 8, 9, 10])

### Index

In Python, indexing is provided through `__getitem__()`. Indexing on positive integers is simple enough. However, it has two additional wrinkles. The first is that, if the index is negative, that many characters are counted from the end of the string which lies just after the last character. That is, the last character has a negative index `-1`

In [None]:
class ostr(ostr):
    def __getitem__(self, key):
        res = super().__getitem__(key)
        if isinstance(key, int):
            key = len(self) + key if key < 0 else key
            return self.create(res, [self.origin[key]])
        elif isinstance(key, slice):
            return self.create(res, self.origin[key])
        else:
            assert False

In [None]:
hello = ostr('hello', taint='HIGH')
assert (hello[0], hello[-1]) == ('h', 'o')
hello[0].taint

The other wrinkle is that `__getitem__()` can accept a slice. We discuss this next.

### Slices

The Python `slice` operator `[n:m]` relies on the object being an `iterator`. Hence, we define the `__iter__()` method, which returns a custom `iterator`.

In [None]:
class ostr(ostr):
    def __iter__(self):
        return ostr_iterator(self)

The `__iter__()` method requires a supporting `iterator` object. The `iterator` is used to save the state of the current iteration, which it does by keeping a reference to the original `ostr`, and the current index of iteration `_str_idx`.

In [None]:
class ostr_iterator():
    def __init__(self, ostr):
        self._ostr = ostr
        self._str_idx = 0

    def __next__(self):
        if self._str_idx == len(self._ostr):
            raise StopIteration
        # calls ostr getitem should be ostr
        c = self._ostr[self._str_idx]
        assert isinstance(c, ostr)
        self._str_idx += 1
        return c

Bringing all these together:

In [None]:
thw = ostr('hello world', taint='HIGH')
thw[0:5]

In [None]:
assert thw[0:5].has_taint()
assert thw[0:5].has_origin()

In [None]:
thw[0:5].taint

In [None]:
thw[0:5].origin

### Splits

In [None]:
def make_split_wrapper(fun):
    def proxy(self, *args, **kwargs):
        lst = fun(self, *args, **kwargs)
        return [self.create(elem) for elem in lst]
    return proxy

In [None]:
for name in ['split', 'rsplit', 'splitlines']:
    fun = getattr(str, name)
    setattr(ostr, name, make_split_wrapper(fun))

In [None]:
thello = ostr('hello world', taint='LOW')
thello == 'hello world'

In [None]:
thello.split()[0].taint

\todo{Partitions!}

In [None]:
thw = ostr('hello world')
thw.clear_origin()
assert not thw.has_origin()

In [None]:
with ExpectError():
    t = ostr('hello world')
    t[0:5].has_origin()

### Concatenation

If two origined strings are concatenated together, it may be desirable to transer the origins from each to the corresponding portion of the resulting string. The concatenation of strings is accomplished by overriding `__add__()`.

In [None]:
class ostr(ostr):
    def __add__(self, other):
        if isinstance(other, ostr):
            return self.create(str.__add__(self, other),
                               (self.origin + other.origin))
        else:
            return self.create(str.__add__(self, other),
                               (self.origin + [-1 for i in other]))

In [None]:
thello = ostr("hello")
tworld = ostr("world", origin=6)
thw = thello + tworld
assert thw.origin == [0, 1, 2, 3, 4, 6, 7, 8, 9, 10]

What if a `ostr` is concatenated with a `str`?

In [None]:
space = "  "
th_w = thello + space + tworld
assert th_w.origin == [0, 1, 2, 3, 4, -1, -1, 6, 7, 8, 9, 10]

One wrinkle here is that when adding a `ostr` and a `str`, the user may place the `str` first, in which case, the `__add__()` method will be called on the `str` instance. Not on the `ostr` instance. However, Python provides a solution. If one defines `__radd__()` on the `ostr` instance, that method will be called rather than `str.__add__()`

In [None]:
class ostr(ostr):
    def __radd__(self, other):
        origin = other.origin if isinstance(other, ostr) else [
            None for i in other]
        return self.create(str.__add__(other, self), (origin + self.origin))

We test it out:

In [None]:
shello = "hello"
tworld = ostr("world")
thw = shello + tworld
assert thw.origin == [None, None, None, None, None, 0, 1, 2, 3, 4]

These methods: `slicing` and `concatenation` is sufficient to implement other string methods that result in a string, and does not change the character underneath (i.e no case change). Hence, we look at a helper method next.

### Extract Origin String

Given a specific input index, the method `x()` extracts the corresponding origined portion from a `ostr`. As a convenience it supports `slices` along with `ints`.

In [None]:
class ostr(ostr):
    class TaintException(Exception):
        pass

    def x(self, i=0):
        if not self.origin:
            raise origin.TaintException('Invalid request idx')
        if isinstance(i, int):
            return [self[p]
                    for p in [k for k, j in enumerate(self.origin) if j == i]]
        elif isinstance(i, slice):
            r = range(i.start or 0, i.stop or len(self), i.step or 1)
            return [self[p]
                    for p in [k for k, j in enumerate(self.origin) if j in r]]

In [None]:
thw = ostr('hello world', origin=100)

In [None]:
assert thw.x(101) == ['e']

In [None]:
assert thw.x(slice(101, 105)) == ['e', 'l', 'l', 'o']

### Replace

The `replace()` method replaces a portion of the string with another.

In [None]:
class ostr(ostr):
    def replace(self, a, b, n=None):
        old_origin = self.origin
        b_origin = b.origin if isinstance(b, ostr) else [None] * len(b)
        mystr = str(self)
        i = 0
        while True:
            if n and i >= n:
                break
            idx = mystr.find(a)
            if idx == -1:
                break
            last = idx + len(a)
            mystr = mystr.replace(a, b, 1)
            partA, partB = old_origin[0:idx], old_origin[last:]
            old_origin = partA + b_origin + partB
            i += 1
        return self.create(mystr, old_origin)

In [None]:
my_str = ostr("aa cde aa")
res = my_str.replace('aa', 'bb')
assert res, res.origin == ('bb', 'cde', 'bb',
                          [None, None, 2, 3, 4, 5, 6, None, None])

In [None]:
my_str = ostr("aa cde aa")
res = my_str.replace('aa', ostr('bb', origin=100))
assert (res, res.origin) == (('bb cde bb'), [100, 101, 2, 3, 4, 5, 6, 100, 101])

### Split

We essentially have to re-implement split operations, and split by space is slightly different from other splits.

In [None]:
class ostr(ostr):
    def _split_helper(self, sep, splitted):
        result_list = []
        last_idx = 0
        first_idx = 0
        sep_len = len(sep)

        for s in splitted:
            last_idx = first_idx + len(s)
            item = self[first_idx:last_idx]
            result_list.append(item)
            first_idx = last_idx + sep_len
        return result_list

    def _split_space(self, splitted):
        result_list = []
        last_idx = 0
        first_idx = 0
        sep_len = 0
        for s in splitted:
            last_idx = first_idx + len(s)
            item = self[first_idx:last_idx]
            result_list.append(item)
            v = str(self[last_idx:])
            sep_len = len(v) - len(v.lstrip(' '))
            first_idx = last_idx + sep_len
        return result_list

    def rsplit(self, sep=None, maxsplit=-1):
        splitted = super().rsplit(sep, maxsplit)
        if not sep:
            return self._split_space(splitted)
        return self._split_helper(sep, splitted)

    def split(self, sep=None, maxsplit=-1):
        splitted = super().split(sep, maxsplit)
        if not sep:
            return self._split_space(splitted)
        return self._split_helper(sep, splitted)

In [None]:
my_str = ostr('ab cdef ghij kl')
ab, cdef, ghij, kl = my_str.rsplit(sep=' ')
assert (ab.origin, cdef.origin, ghij.origin,
        kl.origin) == ([0, 1], [3, 4, 5, 6], [8, 9, 10, 11], [13, 14])

my_str = ostr('ab cdef ghij kl', origin=list(range(0, 15)))
ab, cdef, ghij, kl = my_str.rsplit(sep=' ')
assert(ab.origin, cdef.origin, kl.origin) == ([0, 1], [3, 4, 5, 6], [13, 14])

In [None]:
my_str = ostr('ab   cdef ghij    kl', origin=100, taint='HIGH')
ab, cdef, ghij, kl = my_str.rsplit()
assert (ab.origin, cdef.origin, ghij.origin,
        kl.origin) == ([100, 101], [105, 106, 107, 108], [110, 111, 112, 113],
                      [118, 119])

my_str = ostr('ab   cdef ghij    kl', origin=list(range(0, 20)), taint='HIGH')
ab, cdef, ghij, kl = my_str.split()
assert (ab.origin, cdef.origin, kl.origin) == ([0, 1], [5, 6, 7, 8], [18, 19])
assert ab.taint == 'HIGH'

### Strip

In [None]:
class ostr(ostr):
    def strip(self, cl=None):
        return self.lstrip(cl).rstrip(cl)

    def lstrip(self, cl=None):
        res = super().lstrip(cl)
        i = self.find(res)
        return self[i:]

    def rstrip(self, cl=None):
        res = super().rstrip(cl)
        return self[0:len(res)]


In [None]:
my_str1 = ostr("  abc  ")
v = my_str1.strip()
assert v, v.origin == ('abc', [2, 3, 4])

In [None]:
my_str1 = ostr("  abc  ")
v = my_str1.lstrip()
assert (v, v.origin) == ('abc  ', [2, 3, 4, 5, 6])

In [None]:
my_str1 = ostr("  abc  ")
v = my_str1.rstrip()
assert (v, v.origin) == ('  abc', [0, 1, 2, 3, 4])

### Expand Tabs

In [None]:
class ostr(ostr):
    def expandtabs(self, n=8):
        parts = self.split('\t')
        res = super().expandtabs(n)
        all_parts = []
        for i, p in enumerate(parts):
            all_parts.extend(p.origin)
            if i < len(parts) - 1:
                l = len(all_parts) % n
                all_parts.extend([p.origin[-1]] * l)
        return self.create(res, all_parts)

In [None]:
my_str = str("ab\tcd")
my_ostr = ostr("ab\tcd")
v1 = my_str.expandtabs(4)
v2 = my_ostr.expandtabs(4)
assert str(v1) == str(v2)
assert (len(v1), repr(v2), v2.origin) == (6, "'ab  cd'", [0, 1, 1, 1, 3, 4])

In [None]:
class ostr(ostr):
    def join(self, iterable):
        mystr = ''
        myorigin = []
        sep_origin = self.origin
        lst = list(iterable)
        for i, s in enumerate(lst):
            sorigin = s.origin if isinstance(s, ostr) else [None] * len(s)
            myorigin.extend(sorigin)
            mystr += str(s)
            if i < len(lst) - 1:
                myorigin.extend(sep_origin)
                mystr += str(self)
        res = super().join(iterable)
        assert len(res) == len(mystr)
        return self.create(res, myorigin)

In [None]:
my_str = ostr("ab cd", origin=100)
(v1, v2), v3 = my_str.split(), 'ef'
assert (v1.origin, v2.origin) == ([100, 101], [103, 104])
v4 = ostr('').join([v2, v3, v1])
assert (v4, v4.origin) == ('cdefab', [103, 104, None, None, 100, 101])

In [None]:
my_str = ostr("ab cd", origin=100)
(v1, v2), v3 = my_str.split(), 'ef'
assert (v1.origin, v2.origin) == ([100, 101], [103, 104])
v4 = ostr(',').join([v2, v3, v1])
assert (v4, v4.origin) == ('cd,ef,ab', [103, 104, 0, None, None, 0, 100, 101])

### Partitions

In [None]:
class ostr(ostr):
    def partition(self, sep):
        partA, sep, partB = super().partition(sep)
        return (self.create(partA, self.origin[0:len(partA)]),
                self.create(sep, self.origin[len(partA):len(partA) + len(sep)]),
                self.create(partB, self.origin[len(partA) + len(sep):]))

    def rpartition(self, sep):
        partA, sep, partB = super().rpartition(sep)
        return (self.create(partA, self.origin[0:len(partA)]),
                self.create(sep, self.origin[len(partA):len(partA) + len(sep)]),
                self.create(partB, self.origin[len(partA) + len(sep):]))

### Justify

In [None]:
class ostr(ostr):
    def ljust(self, width, fillchar=' '):
        res = super().ljust(width, fillchar)
        initial = len(res) - len(self)
        if isinstance(fillchar, tstr):
            t = fillchar.x()
        else:
            t = -1
        return self.create(res, [t] * initial + self.origin)

    def rjust(self, width, fillchar=' '):
        res = super().rjust(width, fillchar)
        final = len(res) - len(self)
        if isinstance(fillchar, tstr):
            t = fillchar.x()
        else:
            t = -1
        return self.create(res, self.origin + [t] * final)

### String methods that do not change origin

In [None]:
class ostr(ostr):
    def swapcase(self):
        return self.create(str(self).swapcase(), self.origin)

    def upper(self):
        return self.create(str(self).upper(), self.origin)

    def lower(self):
        return self.create(str(self).lower(), self.origin)

    def capitalize(self):
        return self.create(str(self).capitalize(), self.origin)

    def title(self):
        return self.create(str(self).title(), self.origin)

In [None]:
a = ostr('aa', origin=100).upper()
a, a.origin

### General wrappers

These are not strictly needed for operation, but can be useful for tracing

In [None]:
def make_str_wrapper(fun):
    def proxy(*args, **kwargs):
        res = fun(*args, **kwargs)
        return res
    return proxy

In [None]:
import inspect

In [None]:
import types

In [None]:
ostr_members = [name for name, fn in inspect.getmembers(ostr, callable)
                if isinstance(fn, types.FunctionType) and fn.__qualname__.startswith('ostr')]

for name, fn in inspect.getmembers(str, callable):
    if name not in set(['__class__', '__new__', '__str__', '__init__',
                        '__repr__', '__getattribute__']) | set(ostr_members):
        setattr(ostr, name, make_str_wrapper(fn))

### Methods yet to be translated

These methods generate strings from other strings. However, we do not have the right implementations for any of these. Hence these are marked as dangerous until we can generate the right translations.

In [None]:
def make_str_abort_wrapper(fun):
    def proxy(*args, **kwargs):
        raise ostr.TaintException('%s Not implemented in `ostr`' % fun.__name__)
    return proxy

In [None]:
for name, fn in inspect.getmembers(str, callable):
    if name in ['__format__', '__rmod__', '__mod__', 'format_map', 'format',
                '__mul__', '__rmul__', 'center', 'zfill', 'decode', 'encode', 'splitlines']:
        setattr(ostr, name, make_str_abort_wrapper(fn))

### Origin Checkers

We can also check whether a string originates from another string: \todo{Just return the set}

In [None]:
class ostr(ostr):
    def origin_in(self, origin_set):
        return set(self.origin) <= set(origin_set)

    def origin_from(self, originating_string):
        return self.origin_in(originating_string.origin)

In [None]:
s = ostr("hello", origin=100)
str(s[1])

In [None]:
s.origin_from(s)

In [None]:
s[1].origin

In [None]:
s[1].origin_from(s)

In [None]:
t = ostr("world", origin=200)

In [None]:
s.origin_from(t)

### Privacy leaks revisited

With all this implemented, we now have a full-fledged `ostr` strings where we can easily check the origin of each and every character.

In [None]:
SECRET_ORIGIN = 1000

In [None]:
secret = ostr('<again, some super-secret input>', origin=SECRET_ORIGIN)

In [None]:
s = heartbeat('hello', 5, memory=secret)
str(s)

In [None]:
s.origin

In [None]:
s.origin_in([None])

In [None]:
s.origin_in(list(range(SECRET_ORIGIN, SECRET_ORIGIN + 1000)))

In [None]:
s = heartbeat('hello', 32, memory=secret)
str(s)

In [None]:
s.origin_in([None])

In [None]:
s.origin_in([None] + list(range(SECRET_ORIGIN, SECRET_ORIGIN + 1000)))

In [None]:
s.origin

## Lessons Learned

* One can track the information flow form input to the internals of a system.

## Next Steps

_Link to subsequent chapters (notebooks) here:_

## Background

\cite{Lin2008}

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```markdown
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_