<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 14: tries

This notebook introduces **tries**, or **prefix trees**.
It is a tree-based data structure allowing to store a finite list of strings in memory compactly.

Using an example of X-SAMPA ("industrial" representation of IPA symbols) sequences, an exercise in the end of the notebook exemplifies how prefix trees can be used for parsing the phonemic sequences, i.e. predicting both "ice cream" and "I scream" when given the input `[aI s k r i m]`.


## Building prefix trees

Imagine that we have the following list and we need to store it in memory.
The cell below defines saving it as a list of strings.

In [1]:
words = ["linguist", "language", "physics"]

However, there is another _data structure_ that allows to store a collection of strings more succinctly.

A **trie**, or a **prefix tree**, stores strings in a tree that branches every time positions of the strings don't match. An empty string is the top node in the tree, because every string has something in-common: _every string starts with an empty substring_. Then, there are two options how to start a word in the list of `words`: "l" and "p", so the node of the empty string branches to the nodes "l" and "p", and so on.

<img src="images/14_1.png" width="500">

**Prefix tree** can also be viewed as an automaton, where the empty prefix is the initial state, and the arrows are pointing right (if represented as on the picture above).
This allows to move through the automaton, and yield strings formed based on the states/transitions.

To encode a lexicon we also need to mark some states as final to mean that the sequence up to this point forms a word that this tree is storing.

<img src="images/14_2.png" width="600">

We can use a dictionary with the strings in nodes stored as the keys, and the dictionaries for its sub-trees saved as the values to represent a trie.

In [2]:
dictionary = {'': {'l': {'a': {'n': {'g': {'u': {'a': {'g': {'e': {}}}}}}},
                         'i': {'n': {'g': {'u': {'i': {'s': {'t': {'i': {'c': {'s': {}}}}}}}}}}},
                   'p': {'h': {'y': {'s': {'i': {'c': {'s': {}}}}}}}}}

But this representation does not encode the final states.
A simple way to do so is to add some special symbol to every word, and store those symbols in the prefix tree.
The appearance of that symbol in the tree means that the path from the empty prefix to the special symbol is a word.

In [3]:
dictionary = {'': {'l': {'a': {'n': {'g': {'u': {'a': {'g': {'e': {'#': {}}}}}}}},
                         'i': {'n': {'g': {'u': {'i': {'s': {'t': {'#': {},
                                                                   'i': {'c': {'s': {'#': {}}}}}}}}}}}},
                   'p': {'h': {'y': {'s': {'i': {'c': {'s': {'#': {}}}}}}}}}}

Consider `create_prefix_tree` that generates a prefix trees like the one from the prevoius cell.

In [4]:
def create_prefix_tree(words):
    """ Creates a prefix tree given the list of words. """
    
    trie = {}

    for w in words:
        annotated_word = w + "#"
        
        # start from the upper level of the dictionary
        current_level = trie

        # iterate through symbols of the word
        for s in annotated_word:

            # if the current level doesn't store this key aleady, add it
            if s not in current_level:
                current_level[s] = {}

            # move inside the dictionary of the current symbol stores as its value
            current_level = current_level[s]

    # add the empty string root node above all symbols
    trie = {"": trie}

    return trie

Notice, that the code above takes advantage of the straightforward copy in Python: `current_level` is storing _a reference_ to the dictionary `trie`, and therefore modifying `current_level` modifies the variable `trie` because they are linked. To refresh this topic, see Notebook 3 explaining copying of lists: exactly the same applies here as well.

In [None]:
from pprint import pprint


words = ["linguist", "language", "physics", "linguistics"]
trie = create_prefix_tree(words)
pprint(trie)

Indeed, these two prefix trees are identical.

In [None]:
print(trie == dictionary)

**Practice.** Implement a function inserting a word in the trie.

In [None]:
def insert_word(trie, word):
    """
    Inserts the given word into the given prefix tree.
    
    Arguments:
        trie (dict): a prefix tree;
        word (str): a word that needs to be inserted.
        
    Returns:
        dict: an updated prefix tree.
    """
    pass

## Searching in tries

Tries allow to search through collections of strings in a more efficient way than if they are stored as a list.

**Practice.** Write a function that takes two arguments: `trie` and `word`. It returns True if `word` is represented in `trie`, and False otherwise.

In [None]:
def word_in_trie(trie, word):
    """
    Tells if a given word is represented in the given prefix tree.
    
    Arguments:
        trie (dict): a prefix tree;
        word (str): a word that might or might not be a part of the trie.
        
    Returns:
        bool: True if the word is a part of the trie, False otherwise.
    """
    pass

Test your code using the following strings:

In [None]:
words_test = ["linguist", "linguistics", "lang", "", "math"]
for word in words_test:
    print(word.rjust(12), "\t", word_in_trie(trie, word))

## Deleting words from tries

There are several cases we need to consider when deleting words from tries:

* a word is a prefix of some other word;
* a word contains a prefix of some other word;
* a word is unique;
* a word is not represented in the trie.

**Practice 1.** What is the way to handle the cases above?

**Practice 2. (optional, very advanced)** Implement removing a word from the trie.

## Parsing sequences of phonemes

Tries are used to store lexical items or their phonemic representations: this data structure allows to search its members in an effective way.
Consider a task of getting a sequence of phonemes as input and trying to assign it a phonetically possible parse.

In industry, IPA symbols are represented using a special alphabet called **X-SAMPA**:  Extended Speech Assessment Methods Phonetic Alphabet ([wiki](https://en.wikipedia.org/wiki/X-SAMPA)).

<img src="images/14_3.gif" width="300">

Consider the following X-SAMPA transcription: `[aI s k r i m]`. The task of the alignment module is to understand what lexical items in which order are contained in that transcription.

Assume we have the following words in the lexicon:

In [None]:
lexicon = ["aI", "s k r i m", "aI s", "k r i m", "k r i m s"]

_Part 1._ Translate every lexical item to a list of strings, where every string is an X-SAMPA character.

    Expected output: [['aI'], ['s', 'k', 'r', 'i', 'm'], ['aI', 's'], ['k', 'r', 'i', 'm'],
                      ['k', 'r', 'i', 'm', 's']]

_Part 2._ Build a prefix tree representing the lexicon. You might want to use `create_prefix_tree` function and implement one small change in it.

    Expected output: {'': {'aI': {'#': {}, 
                                  's': {'#': {}}},
                          'k': {'r': {'i': {'m': {'#': {}, 
                                                  's': {'#': {}}}}}},
                          's': {'k': {'r': {'i': {'m': {'#': {}}}}}}}}

_Part 3._ Take a look at the implementation of `find_words`, a function that finds all words in the given string that are present in the given trie.

In [None]:
def find_words(trie, string):
    """
    Finds all words in the given string that can be generated
    using the fiven prefix tree.
    
    Arguments:
        trie (dict): prefix tree;
        string (str): a string that needs to be parsed.
        
    Outputs:
        list: a list of words detected in that string.
    """
    
    words_detected = []
    current_level = trie[""]
    
    # iterating over indices of the string!
    for i in range(len(string)):

        # if a string cannot be parsed using a given trie
        if string[i] not in current_level:
            break
                
        # if a stop symbol can follow the current symbol,
        # the part of that string up until now is a possible word
        if "#" in current_level[string[i]]:
            words_detected.append(string[:i+1])

            # if the current symbol is not the last one,
            # parse the remaining part of the string
            if i < len(string) - 1:
                words_detected.extend(find_words(trie, string[i+1:]))
            
        # go one level deeper to read the following symbol
        current_level = current_level[string[i]]
        
    return words_detected

The next cell should produce the following output:

    [['aI'], ['s', 'k', 'r', 'i', 'm'], ['aI', 's'], ['k', 'r', 'i', 'm']]

In [None]:
input_string = "aI s k r i m"
words_detected = find_words(trie, input_string.split())
print(words_detected)

_Part 4._ Assume that we know in advance that there are only two words in the `input_string`. Using the output of `find_parses`, find all working alignments for the `input_string`.

_Part 5._ Now find all working alignments for `alternative_input`.

In [None]:
alternative_input = "aI s k r i m s"