# Trees

A tree is a structure made up of nodes, which can have children (which are other nodes). In the example, node `C` has as children nodes `F` and `I`, and is itself the child of node `A`.

<!--
digraph G {
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 A -> {B C D}
 B -> E
 C -> {F I}
 F -> {G H}
}
-->

<figure>
<table><tr>
    <td><img src="files/tree-01.png" style="width: 200px"></td>
</tr></table>
</figure>

Two types of nodes have a special status: nodes that have no children, called **leaves**, and a node that is not a child of any other node, called the **root**. There must be only one root, otherwise it would make several disjointed trees. In the example, `A` is the root and `E`, `G`, `H`, `I` are the leaves.

<figure>
<table><tr>
    <td><img src="files/tree-01-with-infos.png" style="width: 380px"></td>
</tr></table>
</figure>

In a tree, it is also imposed that every node is a descendant of the root and that there is no cycles (i.e. it is not possible to pass twice through the same node when following the "children relation").

For example, the following structures are not trees (The first one because the node `E` cannot be reached from the root `A`. The second because there is a cycle `A -> B -> C -> A`.

<!--
digraph G {
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 A -> {B C D}
 B -> E [style=invis]
 C -> {F I}
 F -> {G H}
}

digraph G {
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 A -> {B C D}
 B -> E
 C -> {F I}
 F -> {G H}
 B -> C [constraint=false]
}
-->

<figure>
<table><tr>
    <td><img src="files/not-tree-01.png" style="width: 200px">
    <td><img src="files/not-tree-02.png" style="width: 200px">
</tr></table>
</figure>

We are not usually interested only in the structure of the tree (what is the root, where are the leaves, how many children a node has, etc.), but we usually want to store information in it. We will therefore consider trees where each node contains a value (for example an integer). Moreover, trees with additional restrictions are often used (number of children, relationship between children and parents, etc). In particular binary trees, where each node has no more than two children, are very common.

As an example, you can find below a tree where each node is labeled with an integer, where all children are labeled with values strictly smaller that their parent and the sum of all the labels of the children of a node is smaller than the label of that node (this is purely arbitrary, but we will find such restrictions in the following courses).

<!--
digraph G {
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 A [label="10"]
 B [label="2"]
 C [label="5"]
 D [label="2"]
 E [label="1"]
 F [label="3"]
 G [label="2"]
 H [label="1"]
 I [label="1"]
 A -> {B C D}
 B -> E
 C -> {F I}
 F -> {G H}
}
-->

<figure>
<table><tr>
    <td><img src="files/tree-02.png" style="width: 200px"></td>
</tr></table>
</figure>

Finally, we can give an inductive (a.k.a. recursive) definition of a tree: a tree with labels in $\mathcal{L}$ is defined by a value $l \in \mathcal{L}$ (its label) and a (possibly empty) sequence (or list) of trees (its children). For example, the tree of our first example has label `A` and has 3 children: the trees *rooted* at `B`, `C`, `D`. In turn, the first child of `A` as label `B` and has 1 children: the tree *rooted* `E`, etc, etc...

## Representing trees in Python

The standard/usual representation of trees in Python is done via a class `Node` whose instances represent different nodes of some trees. Following the recursive definition of trees, this instance must have two data attributes: one for the label and one for the node's list of children.

In [1]:
class Node:
    def __init__(self, label, children = []):
        """ Create a new tree node with label `label` and children
            `children` that must be a list of instances of Node. """
        self.label    = label    # We store the node label
        self.children = children # By default, the node is created with no children
     
# Create a Python representation of the tree of the first example
mytree = \
  Node("A", [
      Node("B", [Node("E")]),
      Node("C", [
          Node("F", [Node("G"), Node("H")]),
          Node("I")
      ]),
      Node("D")
  ])

print(mytree.label)
print(mytree.children[1].label)
print(mytree.children[0].children[0].label)

A
C
E


Sometime, we may use more specialized versions of tree representations. For example, binary trees are trees that have at most two children which are referred to as the *left child* and the *right child*. In that case, we can use the following representation:

In [2]:
class BinaryNode:
    def __init__(self, label, left = None, right= None):
        self.label = label # Node label
        self.left  = left  # Left child (might be None when not present)
        self.right = right # Right child (might be None when not present)

Use the class `BinaryNode`, we can represent the following binary tree as shown in the next Python cell.

<!--
digraph G {
 graph [ordering="out"];
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 BL [style = invis]
 A -> {B C}
 B -> BL [style = dashed, arrowhead = none]
 B -> D
 C -> {E F}
 F -> {G H}
}
-->

<figure>
<table><tr>
    <td><img src="files/tree-03.png" style="width: 250px"></td>
</tr></table>
</figure>

In [3]:
# Create a Python representation of the binary tree above
mybintree = \
    BinaryNode("A",
        BinaryNode("B", None, BinaryNode("D")),
        BinaryNode("C",
            BinaryNode("E"),
            BinaryNode("F", BinaryNode("G"), BinaryNode("H"))
        )
    )

## Computing function on trees

We are now interested in computing function on trees. We here give a few functions that can be defined recursively on the structure of trees.

### Size of a tree

The size of a tree is defined by the numbers of nodes that it contains (i.e. the number of nodes that can be reached from the root, including the root itself). The reasoning for computing it is quite simple: a tree is a node and the list of its children, so its size is one (the node), plus the size of the subtrees starting from each child:

$$|\text{node}| = 1 + \sum_{c\ \in\ \text{node.children}} |c|$$

where $|\text{node}|$ denotes the size of the tree rooted at `node`. On our first example, this gives:

<figure>
<table><tr>
    <td><img src="files/tree-size.png" style="height: 150px"></td>
</tr></table>
</figure>

Of course, we are going to compute the size of the children use a recursive call to the relevant function. In Python, this gives:

In [4]:
def size(node):
    # One liner: return 1 + sum(size(x) for x in node.children)
    aout = 0
    for child in node.children:
        aout += size(child)
    return 1 + aout

print(size(mytree))

9


Note that we could have defined the size computation using a method of the class `Node`:

In [5]:
class Node:
    def __init__(self, label, children = []):
        """ Create a new tree node with label `label` and children
            `children` that must be a list of instances of Node. """
        self.label    = label    # We store the node label
        self.children = children # By default, the node is created with no children
        
    def size(self):
        aout = 0
        for child in self.children:
            aout += child.size()
        return 1 + aout
     
# Since we redefined `Node`, we have to recreate `mytree`
# using the new implementation!
mytree = \
  Node("A", [
      Node("B", [Node("E")]),
      Node("C", [
          Node("F", [
              Node("G"), Node("H")]),
              Node("I")
          ]
      ),
      Node("D")
  ])

print(mytree.size())

9


**Note**: in the remaining of this notebook, I am going to define all operations using functions.

### Textual representation of a tree

Using recursion, we can easily define a textual representation of a tree, as follows:

In [6]:
def node2str(node):
    aout = str(node.label)
    if node.children:
        aout += "[{}]".format(", ".join(node2str(x) for x in node.children))
    return aout

print(node2str(mytree))

A[B[E], C[F[G, H], I], D]


### Gathering all labels of a tree

For our last example, we are interested in gathering all the labels (with duplicates) of a given tree. Here too, the reasonning is quite simple: the list of labels of a tree rooted at `node` is composed of the `node`'s label and all the labels of its children. This leads to the following function:

In [7]:
def labels(node, accumulator):
    accumulator.append(node.label)
    for child in node.children:
        labels(child, accumulator)
    return accumulator

print(labels(mytree, []))

['A', 'B', 'E', 'C', 'F', 'G', 'H', 'I', 'D']


Note that I used an accumulator for storing the labels. I could have written the function is a more functional way (i.e. in a way closer to its mathematical definition):

In [8]:
def labelsF(node):
    return [node.label] + [x for c in node.children for x in labelsF(c)]

print(labelsF(mytree))

['A', 'B', 'E', 'C', 'F', 'G', 'H', 'I', 'D']


However, for the next remark, we are going to look as the first version `labels`. If we look at the code, we see that we decided to consider the label of the node before considering its elements. But note that this choice is arbitraty: we could have decided to visit the node's children first, leading to the following implementation:

In [9]:
def labels_post(node, accumulator):
    for child in node.children:
        labels_post(child, accumulator)
    accumulator.append(node.label)
    return accumulator

print(labels_post(mytree, []))

['E', 'B', 'G', 'H', 'F', 'I', 'C', 'D', 'A']


You see that the answers are different, but both of them are valid: they are permutation of each other and both contains the labels of our tree. However, they clearly show that there exist several ways to traverse a tree. Here, we saw two of them: depth-first pre-order (you consider the node and then its children) and depth-first post-order (you consider the node's children and then the node). We will consider tree (and graphs) traversals more in details in the next course.

## Binary Search Tree

A binary search tree (BST) is a tree-like data structure representing a set whose keys belong to a completely ordered set (e.g. the integers). A binary search tree allows quick operations to search for an element, or to insert or delete an element.

More specifically, a binary search tree is defined as a binary tree whose labels are taken from an ordered set (in our examples, we are going to have integer labels) s.t.:

  1. all the labels of the left child of the root are smaller or equal than the root label,
  2. all the labels of the right child of the root are larger or equal than the root label, and
  3. the left and right subtrees are binary search tree.

For example, the following tree is a BST:

<!--
digraph G {
 graph [ordering="out"];
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 A [label = "10"]
 B [label = "5"]
 C [label = "11"]
 D [label = "7"]
 E [label = "11"]
 F [label = "20"]
 G [label = "15"]
 H [label = "24"]
 BL [style = invis]
 A -> {B C}
 B -> BL [style = dashed, arrowhead = none]
 B -> D
 C -> {E F}
 F -> {G H}
}
-->

<figure>
<table><tr>
    <td><img src="files/bst-01.png" style="width: 250px"></td>
</tr></table>
</figure>  

Let's first define a function computing the textual representation of a BST:

In [10]:
def bst_str(node):
    if node is None:
        return '.'
    return "{}[L={}, R={}]".format(node.label, bst_str(node.left), bst_str(node.right))

### Searching

When searching for a value `x` in a BST in a tree rooted at `node`, one can use the following algorithm:

  - if the label of `node` is `x`, then we are done.
  - if `x` is (strictly) smaller than the label of `node`, then we (recursively) search `x` in the left subtree of `node` (but we do **not** search it in the right subtree).
  - likewise, if `x` is (strictly) larger than the label of `node`, then we (recursively) search `x` in the right subtree of `node` (but we do **not** search in the left subtree).
  
As you can see, at each step, we go down in the tree and never go back or backtrack (i.e. revise a previous decision). So at most, we are going to do a number of comparisons that is proportional to the height of the tree - if the tree is balanced (we will define this more formally later), the height is going to be proportional to the **logarithm** of the number of elements --- i.e. this is going to be a big improvment over a linear search in a list and is comparable to a binary search in a sorted list.

But why can we decide to ignore some subtrees? Because of the data-invariant that we imposed on BST. For example, let's go back to our example and assume that we are in the second case: `x` is strictly small than the label of `node`. Since we have a BST, we know that all the labels in the right subtree of `node` are going to be larger than `node`'s label. Hence, all labels in the right subtree of `node` are going to be strictly larger than `x`. In consequence, `x` **cannot** be in the right subtree. It remains to check whether `x` is in the left subtree or not.

For example, if we go back to the previous tree, we will follow the red path when searching for the value `15`:

<!--
digraph G {
 graph [ordering="out"];
 node [margin=0 fontsize=12 width=0.5 shape=circle style=filled]
 A [label = "10"]
 B [label = "5"]
 C [label = "11"]
 D [label = "7"]
 E [label = "11"]
 F [label = "20"]
 G [label = "15"]
 H [label = "24"]
 BL [style = invis]
 A -> B
 A -> C [color = red]
 B -> BL [style = dashed, arrowhead = none]
 B -> D
 C -> E
 C -> F [color = red]
 F -> G [color = red]
 F -> H
}
-->

<figure>
<table><tr>
    <td><img src="files/bst-02.png" style="width: 250px"></td>
</tr></table>
</figure>

Here is the Python implementation:

In [11]:
mystree = \
    BinaryNode(10,
        BinaryNode(5, None, BinaryNode(7)),
        BinaryNode(11,
            BinaryNode(11),
            BinaryNode(20, BinaryNode(15), BinaryNode(24))
        )
    )

def bst_search(node, x):
    if node is None:
        return False
    if node.label == x:
        return True
    return bst_search(node.left if x < node.label else node.right, x)

print(bst_search(mystree, 15))
print(bst_search(mystree, 14))

True
False


### Inserting

For inserting a value `x` in a tree `node`, we start as for a search, going down in the tree, choosing the left or right subtree based on the comparison of `x` and the current node label. However, when we cannot progress more, we then insert a new left with `x` at its label. This gives the following algorithm:

  - if `node` is an actual node, if `x` is smaller or equal (resp. strictly larger) than `node.label`, then we insert `x` in the left (right) subtree, and
  - if `node` is `None`, we create a new `Node`, with `x` as its label and with no children.
  
Here, the algorithm works because: 1. after insertion, the BST has been augmented with a single `Node` whose label is `x`, and 2. by pushing `x` on the left (resp. right) when it is strictly smaller than (resp. larger than) the node label, we insert `x` in a place s.t. it doesn't break the BST invariants.

Here too, when the tree is balanced, the insertion will make a number of comparisons that is proportional to the logarothm of the numbers of nodes of the tree.

This gives the following Python code:

In [12]:
mystree = \
    BinaryNode(10,
        BinaryNode(5, None, BinaryNode(7)),
        BinaryNode(11,
            BinaryNode(11),
            BinaryNode(20, BinaryNode(15), BinaryNode(24))
        )
    )

def bst_insert(node, x):
    if node is None:
        return BinaryNode(x)
    if x <= node.label:
        node.left = bst_insert(node.left, x)
    else:
        node.right = bst_insert(node.right, x)
    return node
        
print(bst_search(mystree, 15))
print(bst_search(mystree, 14))
print(bst_str(mystree))
mytree = bst_insert(mystree, 14)
print(bst_str(mystree))
print(bst_search(mystree, 15))
print(bst_search(mystree, 14))

True
False
10[L=5[L=., R=7[L=., R=.]], R=11[L=11[L=., R=.], R=20[L=15[L=., R=.], R=24[L=., R=.]]]]
10[L=5[L=., R=7[L=., R=.]], R=11[L=11[L=., R=.], R=20[L=15[L=14[L=., R=.], R=.], R=24[L=., R=.]]]]
True
True


### When things go wrong

Note that there are cases when binary search trees can degenerate to lists, leading to searches and insertions that take a linear time (in the number of elements). For example:

In [13]:
mylsttree = None
for i in range(20):
    mylsttree = bst_insert(mylsttree, i)
print(bst_str(mylsttree))

0[L=., R=1[L=., R=2[L=., R=3[L=., R=4[L=., R=5[L=., R=6[L=., R=7[L=., R=8[L=., R=9[L=., R=10[L=., R=11[L=., R=12[L=., R=13[L=., R=14[L=., R=15[L=., R=16[L=., R=17[L=., R=18[L=., R=19[L=., R=.]]]]]]]]]]]]]]]]]]]]


You can see that this tree is totally degenerated: the left subtrees are always null and its height is equal to its number of elements. During the tutorial, we will see how we can solve this issue by *rotating* sub-trees while insert new values, leading to binary search trees that are always well balanced.