<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 14: tries

This notebook introduces **tries**, or **prefix trees**.
It is a tree-based data structure allowing to store a finite list of strings in memory compactly.


## Building prefix trees

Imagine that we have the following list and we need to store it in memory.
The cell below defines saving it as a list of strings.

In [None]:
words = ["linguist", "language", "physics"]

However, there is another _data structure_ that allows to store a collection of strings more succinctly.

A **trie**, or a **prefix tree**, stores strings in a tree that branches every time positions of the strings don't match. An empty string is the top node in the tree, because every string has something in-common: _every string starts with an empty substring_. Then, there are two options how to start a word in the list of `words`: "l" and "p", so the node of the empty string branches to the nodes "l" and "p", and so on.

<img src="images/14_1.png" width="500">

**Prefix tree** can also be viewed as an automaton, where the empty prefix is the initial state, and the arrows are pointing right (if represented as on the picture above).
This allows to move through the automaton, and yield strings formed based on the states/transitions.

To encode a lexicon we also need to mark some states as final to mean that the sequence up to this point forms a word that this tree is storing.

<img src="images/14_2.png" width="600">

To encode it, we can use a dictionary with the strings in nodes stored as the keys, and the dictionaries for its sub-trees saved as the values.

In [None]:
dictionary = {'': {'l': {'a': {'n': {'g': {'u': {'a': {'g': {'e': {}}}}}}},
                         'i': {'n': {'g': {'u': {'i': {'s': {'t': {'i': {'c': {'s': {}}}}}}}}}}},
                   'p': {'h': {'y': {'s': {'i': {'c': {'s': {}}}}}}}}}

But this representation does not encode the final states.
A simple way to do so is to add some special symbol to every word, and store those symbols in the prefix tree.
The appearance of that symbol in the tree means that the path from the empty prefix to the special symbol is a word.

In [None]:
dictionary = {'': {'l': {'a': {'n': {'g': {'u': {'a': {'g': {'e': {'#': {}}}}}}}},
                         'i': {'n': {'g': {'u': {'i': {'s': {'t': {'#': {},
                                                                   'i': {'c': {'s': {'#': {}}}}}}}}}}}},
                   'p': {'h': {'y': {'s': {'i': {'c': {'s': {'#': {}}}}}}}}}}

Consider `create_prefix_tree` that generates a prefix trees like the one from the prevoius cell.

In [None]:
def create_prefix_tree(words):
    """ Creates a prefix tree given the list of words. """
    
    trie = {}

    for w in words:
        annotated_word = w + "#"
        
        # start from the upper level of the dictionary
        current_level = trie

        # iterate through symbols of the word
        for s in annotated_word:

            # if the current level doesn't store this key aleady, add it
            if s not in current_level:
                current_level[s] = {}

            # move inside the dictionary of the current symbol stores as its value
            current_level = current_level[s]

    # add the empty string root node above all symbols
    trie = {"": trie}

    return trie

In [None]:
from pprint import pprint


words = ["linguist", "language", "physics", "linguistics"]
trie = create_prefix_tree(words)
pprint(trie)

Indeed, these two prefix trees are identical.

In [None]:
print(trie == dictionary)

**Practice.** Implement a function inserting a word in the trie.

In [None]:
def insert_word(trie, word):
    """
    Inserts the given word into the given prefix tree.
    
    Arguments:
        trie (dict): a prefix tree;
        word (str): a word that needs to be inserted.
        
    Returns:
        dict: an updated prefix tree.
    """
    pass

## Searching in tries

Tries allow to search through collections of strings in a more efficient way in comparison to storing them as a list.

**Practice.** Write a function that takes two arguments: `trie` and `word`. It returns True if `word` is represented in `trie`, and False otherwise.

In [None]:
def word_in_trie(trie, word):
    """
    Tells if a given word is represented in the given prefix tree.
    
    Arguments:
        trie (dict): a prefix tree;
        word (str): a word that might or might not be a part of the trie.
        
    Returns:
        bool: True if the word is a part of the trie, False otherwise.
    """
    pass

Test your code using the following strings:

In [None]:
words_test = ["linguist", "linguistics", "lang", "", "math"]
for word in words_test:
    print(word.rjust(12), "\t", word_in_trie(trie, word))

## Deleting words from tries

There are several cases we need to consider when deleting words from tries:

* word is a prefix of some other word;
* word contains a prefix of some other word;
* a word is unique;
* a word is not represented in the trie.

**Practice 1.** What is the way to handle the cases above?

**Practice 2. (optional)** Implement removing a word from the trie.

In [None]:
def remove_word_from_trie(trie, word):
    """
    Removes the given word from the given prefix tree.
    
    Arguments:
        trie (dict): a prefix tree;
        word (str): a word that needs to be removed.
        
    Returns:
        dict: an updated prefix tree.
    """
    pass