In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import urllib.request
from collections import defaultdict, Counter
%config InlineBackend.figure_formats = ['svg']

## Exercise 6: Markov Model of language (optional)

_Meta-comment 1_: this is more of a programming exercise. There was some talk of having it in DSCI 511, but we ended up putting it here. However, it's definitely good practice in _using_ Python data structures like dictionaries. There are some more challenging questions about time/space complexity at the end, which you can skip if you don't have enough time for them. Overall, this is not a perfect thematic fit with DSCI 512, but it's very good practice (and hopefully fun!).

In this exercise we will try to synthesize English text by "learning" from some input text, also known as a _corpus_. As an example, let's say the input text is the following, taken from the MDS website:

> Data is everywhere. Continuously generated and collected across every domain, it is a vast and largely untapped resource of information with the potential to reveal insights about every aspect of our lives and the world we live in. However, the ability to uncover these insights is a highly specialized skill possessed by far too few. 

Our algorithm involves a parameter, which we'll call $n$. Let me first explain the approach when $n=1$: 

- We will start with an initial character, say "y". There are 8 occurrences of "y" in the input text above. What character typically comes after "y"? It turns out (according to the input text above) the next letter is "w" the first time and " " (space character) all the other 7 times. So we estimate the conditional probability distribution of the next character, given that the current character is "y", to be:
    - P("w" after "y") = 1/8 (_the probability that "w" comes after "y" is 1/8_)
    - P(" " after "y") = 7/8
    - probability zero for all other characters
- To generate the next character, we generate a sample from this simple distribution. Say we pick " ", so we add a " " to our output text and it is now "y ". Now " " is our current character. To generate the next character, we'd need to probability distribution of what comes after " " so that we could sample from it. We'd repeat this until the output text reaches a pre-specified length.

What about larger $n$? For $n=3$, we pick the next character by looking at the _preceding 3 characters_. We use the name [_n-gram_](https://en.wikipedia.org/wiki/N-gram) for a sequence of $n$ characters. Our method should work for any $n>0$.

For example, take our initial text to be the 3 characters "is ":
There are 3 occurrences of this $n$-gram in the text. In this case, the next letter is "e" once and "a" twice, so we estimate the conditional distribution to be:
- P("e" after "is ") = 1/3
- P("a" after "is ") = 2/3

So we pick randomly from this distribution. Say we pick "e". Then our output text is now "is e" but our current $n$-gram is just "s e" because we're only using $n=3$. So to pick the next character after this, we'd look at what happens after occurrences of "s e". And so on.

In order to implement this idea efficiently, you will pre-compute the conditional probability distribution for every possible $n$ gram. To do that we need to count, for every possibly $n$-gram, the frequencies of the possible next characters, and then normalize them into probability distributions.

*Attribution*: this exercise adapted with permission from Princeton COS 126, [_Markov Model of Natural Language_]( http://www.cs.princeton.edu/courses/archive/fall15/cos126/assignments/markov.html). Original assignment was developed by Bob Sedgewick and Kevin Wayne. If you are interested in more background info, you can take a look at the original version. The original paper by Shannon, [A Mathematical Theory of Communication](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf), essentially created the field of information theory and is thought to be one of the best scientific papers ever written (in terms of both impact and readability).

In [4]:
# Grimms' Fairy Tales by Jacob and Wilhelm Grimm
data_url = 'http://www.gutenberg.org/files/2591/2591-0.txt'
corpus = urllib.request.urlopen(data_url).read().decode("utf-8")

# remove the first chunk of characters, which contains some header stuff
corpus = corpus[2820:]

In [5]:
print(corpus[:200])  # print out the first 200 characters

   THE STORY OF THE YOUTH WHO WENT FORTH TO LEARN WHAT FEAR WAS
     KING GRISLY-BEARD
     IRON HANS
     CAT-SKIN
     SNOW-WHITE AND ROSE-RED




THE BROTHERS GRIMM FAIRY TALES




TH


#### 6(a): implementation
rubric={accuracy:10,quality:10}

You will implement the above algorithm in a class called `MarkovModel`. Your class will have the following methods:

- `__init__`, which is already implemented for you.
- `fit`, which calculates and stores the _frequencies_ of all possible next characters given an $n$-gram. These frequencies should be stored in a `dict` of `dicts`, where the keys of the outer `dict` are the $n$-grams and the keys of the inner `dict` are the possible next characters, and the values of the inner `dict` are the frequencies (counts). Then, at the end of `fit`, normalize these frequencies into empirical probabilities and store them in `self.probabilities`.
**Note:** before starting the calculations, append the first $n$ characters of your corpus to the end of the corpus, making it "circular"; this will avoid a situation where you your `generate` function might get stuck when your loop reaches the end of the corpus.
- `generate`, which creates a random text of a specified length by generating one character at a time from the appropriate (discrete) probability distribution. To perform the random sampling, use the parameter `p=` of `np.random.choice`. You can start the output text with the first $n$ characters of the input text.

**Note:** you may find some of the fancy dictionaries in the [`collections`](https://docs.python.org/3.7/library/collections.html) package useful, namely `defaultdict` and/or `Counter`. However, you can also just use `dict`; either way is fine.

**Hint:** if you find yourself searching for all occurrences of an $n$-gram in the text, you are approaching this incorrectly - in that case, ask us for help!

In [6]:
class MarkovModel:
    """A Markov model of languages based on character frequencies in text."""

    def __init__(self, n):
        self.n = n
        self.probabilities = None
        self.starting_chars = None

    def fit(self, text):
        """
        Fit a Markov model and create a transition matrix.

        Parameters
        ----------
        text : str
            a corpus of text 
        """
        
        # store the first n characters of the training text, as we will use these
        # to seed our `generate` function
        self.starting_chars = text[:self.n]
        
        # make text circular so Markov chain doesn't get stuck
        circ_text = text + text[:self.n]

        # Step 1: Compute frequencies
        # FILL IN THE REST OF THE CODE HERE
        # BEGIN SOLUTION
        # count the number of occurrences of each letter following a given n-gram
        frequencies = defaultdict(Counter)
        for i in range(len(text)): 
            ngram = circ_text[i:i+self.n]
            next_char = circ_text[i+self.n]
            frequencies[ngram][next_char] += 1
        # END SOLUTION
        
        # Step 2: Normalize the frequencies into probabilities
        # FILL IN THE REST OF THE CODE HERE
        # BEGIN SOLUTION
        self.probabilities = defaultdict(dict)
        for ngram, counts in frequencies.items():
            total_count = np.sum(list(counts.values()))
            for next_char, count in counts.items():
                self.probabilities[ngram][next_char] = count / total_count
        # END SOLUTION
        
    def generate(self, seq_len):
        """
        Generate a sequence of length seq_len, Markov model learned in `fit`.

        Parameters
        ----------
        seq_len : int
            the desired length of the sequence

        Returns
        -------
        str
            the generated sequence
        """
        s = self.starting_chars
        while len(s) < seq_len:
            current_ngram = s[-self.n:]
            
            # FILL IN THE REST OF THE CODE HERE
            # BEGIN SOLUTION
            probs = self.probabilities[current_ngram]
            s += np.random.choice(list(probs.keys()), p=list(probs.values()))
            # END SOLUTION
        return s

Here are some tests that should pass if `fit` is implemented correctly:

In [7]:
mm = MarkovModel(n=2)
test_corpus = "2 + 2 = 4; 2 + 3 = 5; 3 + 3 = 9; 3 + 2 = 5;"
mm.fit(test_corpus)

assert mm.starting_chars == '2 '
assert mm.probabilities['2 ']['+'] == 1/2
assert mm.probabilities[' 3'][' '] == 1
assert mm.probabilities[';2'][' '] == 1

In [8]:
print(mm.generate(40))

2 + 2 = 9; 3 + 2 = 9; 3 + 2 + 3 + 2 + 2 


And here we run it on our fairy tales corpus:

In [9]:
mm = MarkovModel(n=5)
mm.fit(corpus)

In [10]:
print(mm.generate(200))

   THE HORSE



The princess, they were soon learn it to a deep good
so that the sea.
When the children an unluckily seemed as to cut with her, and enjoy in the better.’ But they crowd
plucked 


#### 6(b): fun with language models
rubric={reasoning:5}

1. Explain what happens as you increase $n$ from 1 to larger and larger values. At what point does it start to look like English? At what point is your model just memorizing the input corpus?

2. Generate some random sequences using the data set of your choice. Submit your favourite randomly generated sequence as well as the link to the data you used to generate it. If you are out of ideas, you may find some text files of popular books [here](http://www.gutenberg.org/).

1. As  $n$  gets larger the output makes more sense. Around  $n$=5  it looks like English. Around  $n$=15  it look like it's memorizing the training set.

In [11]:
## IGNORE ##
# 2.
model = MarkovModel(n=10)
model.fit(corpus)
print(model.generate(500))

   THE STORY OF THE YOUTH WHO WENT FORTH TO LEARN WHAT FEAR WAS
     KING GRISLY-BEARD


A great king of a land far away in time. Just as he was come round, they thought what had happened. ‘My stars!’ said he, ‘I have not bent one hair of mine.’ Then the fox
and the little
fish, and has a
little pointed mouth?’ ‘Yes,’ said the other; ‘let us follow the
prince thought it with me, and was as strong and well behaved, and scorned to ask what you can draw him to your grandmother is here, and 


#### (challenging) 6(c): time complexity of `fit`
rubric={reasoning:1}

For the above implementation, what is the (worst case) time complexity of running `fit` in terms of:

- $n$, the length of each $n$-gram
- the length of the corpus, which we'll call $N$
- the length of the sequence to generate, `seq_len`, which we'll call $T$

You can assume `np.random.choice` takes $O(1)$ time. You can also assume $n \ll N$ and $n \ll T$.

**Short answer / acceptable answer:** time complexity is $O(N)$ if we assume all the dictionary operations are $O(1)$ time. The number of different $n$-grams cannot be more than $N$, so we can neglect the second loop.

**Optional assumption:** It is valid (but not required) to assume that inserting a key of size $k$ into a dictionary takes $O(k)$ time, because the key must be hashed, and hashing is a linear-time operation. This would yield $O(Nn)$ running time as we insert an $n$-sized ngram (at most) $N$ times in both loops.

**Optional, more precise answer:** Although the above is accurate, we can obtain a more precise time bound based on $n$ in some circumstances. For example, if $n=1$ then you can only have at most $c$ keys in the outer dictionary, where $c$ is the number of possible characters in your alphabet. (Note that we're mainly concerned with the outer dictionary here, because the inner dictionaries can only have at most $c$ keys each, which we can ignore as a constant factor.) In general, the number of possible $n$-grams given an alphabet size of $c$ is $c^n$. So, if $N$ is huge such that you are able to see all those $c^n$ different $n$-grams in the corpus, then you can expect $O(c^n)$ to appear in the time complexity. However, once $c^n\geq N$ then this effect disappears, because you cannot possibly have more than $N$ keys in your outer dictionary. Thus you could say the time complexity is $c^n$ if $c^n<N$ or $N$ otherwise.

These piece-wise function can be succinctly written as $O(\min(c^n,N))$, where the min function returns the smaller of the two inputs. This expression is still not quite right, because we always need to do a loop over the corpus, which takes $O(N)$ time. That is, even if $c^n < N$, such that the worst case number of keys is $c^n$, we still have an $O(N)$ step. We can capture all this by saying the running time is $O(\min(c^n,N)+N)$. In other words, if $c^n < N$ then the running time is $O(c^n+N)$, and otherwise it's just $O(N)$. Finally, if we simply assume that $c$ is very large then we'll be in the $O(N)$ situation; in other words, we revert back to the "short answer" above.

Also: these are worst case running times, which assume a very un-repetitive text. The typical case might be different and depends on the corpus. As an extreme example, if the corpus is all the same letter, like "aaaaaaaaaaaa", then the dictionary will only have 1 element in it and `fit` will run super fast.

#### (challenging) Exercise 6d: time complexity of `generate`
rubric={reasoning:0.5}

For the above implementation, what is the (worst case) time complexity of running `generate` in terms of $n$, $N$, and $T$?

**Short answer / expected answer:** for `generate`, the time complexity is $O(T)$. 

**Optional assumption:** It is valid (but not required) to assume that inserting/retrieving a key of size $k$ into a dictionary takes $O(k)$ time, because the key must be hashed, and hashing is a linear-time operation. This would yield $O(Tn)$ running time as we retrieve an $n$-sized ngram $T$ times.

**Longer answer:** if one wants to worry about string concatenation taking time proportional to the length of the string, then it might actually be $O(T^2)$. However, there are still ways to implement this in $O(T)$ time by preallocating an array of length $T$ and then filling it in character by character. So, while the particular implementation above might be slower, $O(T)$ is easily achievable if necessary. 

#### (challenging) Exercise 6(e): total time complexity
rubric={reasoning:0.5}

What is the total time complexity of running `fit` once and then `generate` once, in terms of $n$, $N$, and $T$?

To answer this question we need to add the time complexities from the answers above. Using the short answers of $O(N)$ for `fit` and $O(T)$ for `generate`, the overall time complexity is $O(N+T)$.

**Note:** Of course, if they provided one of the optional/longer answers to the questions above, then we should accept $O(F+G)$, where $F$ is their time complexity for `fit` and `g` is their time complexity for $G$.

#### (challenging) 6(f): space complexity
rubric={reasoning:0.5}

What is the space complexity of `fit`?

`fit`: The short answer is that in the worst case we could have $N$ different $n$-grams in our dictionary, and each one takes up $O(n)$ storage, meaning a space complexity of $O(Nn)$. However, there is a longer story as with the parts above. $O(Nn)$ space complexity is only true if the $n$-grams are all different. We can try to bound the number by looking at the size of the alphabet, $c$. With only $c$ possible characters, we could only have $c^n$ possible $n$-grams, which may be smaller than $N$ if $c$ and/or $n$ is small. So we again have $O(\min(c^n,N)n)$.

#### (challenging) 6(g): space complexity of `generate`
rubric={reasoning:0.5}

What is the space complexity of `generate`?

`generate`: Space complexity is $O(T)$. 