# language model by bigram

In the following series of tasks, you will work with 2-grams, or bigrams, as they are more commonly called. The objective is to create a function that calculates the probability that a particular sentence could occur in a corpus of text, based on the probabilities of its component bigrams. We'll do this in stages though:

* Task 1 - Extract tokens and bigrams from a sentence
* Task 2 - Calculate probabilities for bigrams
* Task 3 - Calculate the log probability of a given sentence based on a corpus of text using bigrams

**Assumptions and terminology**

We will assume that text data is in the form of sentences with no punctuation. If a sentence is in a single line, we will add add a token for start of sentence: `<s>` and end of sentence: `</s>`. For example, if the sentence is "I love language models." it will appear in code as:

```python
'I love language models'
```

The tokens for this sentence are represented as an ordered list of the lower case words plus the start and end sentence tags:

```python
tokens = ['<s>', 'i', 'love', 'language', 'models', '</s>']
```

The bigrams for this sentence are represented as a list of lower case ordered pairs of tokens:

```python
bigrams = [('<s>', 'i'), ('i', 'love'), ('love', 'language'), ('language', 'models'), ('models', '</s>')]
```

## 1. Bigrams

In the task below, write a function that returns a list of `tokens` (e.g., word) and a list of `bigrams` for a given sentence. You will need to first break a sentence into words in a list, then add a `<s>` and `<s/>` token to the start and end of the list to represent the start and end of the sentence.

Your final lists should be in the format shown above and called out in the function doc string.

In [11]:
def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    sentence_bigrams = []
    for index in range(len(sentence_tokens) - 1):
        sentence_bigrams.append((sentence_tokens[index], sentence_tokens[index+1]))
    return sentence_tokens, sentence_bigrams

In [13]:
test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

for sentence in test_sentences:
    print('\n*** Sentence: "{}"'.format(sentence))
    t, b = sentence_to_bigrams(sentence)
    print('tokens = {}'.format(t))
    print('bigrams = {}'.format(b))


*** Sentence: "the old man spoke to me"
tokens = ['<s>', 'the', 'old', 'man', 'spoke', 'to', 'me', '</s>']
bigrams = [('<s>', 'the'), ('the', 'old'), ('old', 'man'), ('man', 'spoke'), ('spoke', 'to'), ('to', 'me'), ('me', '</s>')]

*** Sentence: "me to spoke man old the"
tokens = ['<s>', 'me', 'to', 'spoke', 'man', 'old', 'the', '</s>']
bigrams = [('<s>', 'me'), ('me', 'to'), ('to', 'spoke'), ('spoke', 'man'), ('man', 'old'), ('old', 'the'), ('the', '</s>')]

*** Sentence: "old man me old man me"
tokens = ['<s>', 'old', 'man', 'me', 'old', 'man', 'me', '</s>']
bigrams = [('<s>', 'old'), ('old', 'man'), ('man', 'me'), ('me', 'old'), ('old', 'man'), ('man', 'me'), ('me', '</s>')]


## 2. Probabilities and Likelihoods with Bigrams

Recall from a previous video that the probability of a series of words can be calculated from the chained probabilities of its history:

<span class="mathquill">$$P(w_1w_2...w_n)=\prod_i P(w_i|w_1w_2...w_{i-1})$$</span>

The probabilities of sequence occurrences in a large textual corpus can be calculated this way and used as a language model to add grammar and contectual knowledge to a speech recognition system. However, there is a prohibitively large number of calculations for all the possible sequences of varying length in a large textual corpus.

To address this problem, we use the Markov Assumption to approximate a sequence probability with a shorter sequence:

$$P(w_1w_2...w_n)\approx \prod_i P(w_i|w_{i-k}...w_{i-1})$$

We can calculate the probabilities by using counts of the bigrams and individual tokens:

$$P(w_i|w_{i-1})=\frac{c(w_{i-1},w_i)}{c(w_{i-1})}$$

In Python, the [Counter](https://docs.python.org/2/library/collections.html) method is useful for this task

In [14]:
from collections import Counter

# Sentence: "i am as i am"
tokens = ['<s>', 'i', 'am', 'as', 'i', 'am', '</s>']
bigrams = [('<s>', 'i'), ('i', 'am'), ('am', 'as'), ('as', 'i'), ('i', 'am'), ('am', '</s>')]

In [15]:
token_counts = Counter(tokens)
token_counts

Counter({'</s>': 1, '<s>': 1, 'am': 2, 'as': 1, 'i': 2})

In [16]:
bigram_counts = Counter(bigrams)
bigram_counts

Counter({('<s>', 'i'): 1,
         ('am', '</s>'): 1,
         ('am', 'as'): 1,
         ('as', 'i'): 1,
         ('i', 'am'): 2})

In [19]:
bigram = ('i', 'am')
# P('am' | 'i')
P = bigram_counts[bigram] / token_counts[bigram[0]]
print(P)

1.0


In [20]:
def bigrams_from_transcript(filename):
    """
    read a file of sentences, adding start '<s>' and stop '</s>' tags; Tokenize it into a list of lower case words
    and bigrams
    :param filename: string 
        filename: path to a text file consisting of lines of non-puncuated text; assume one sentence per line
    :return: list, list
        tokens: ordered list of words found in the file
        bigrams: a list of ordered two-word tuples found in the file
    """
    tokens = []
    bigrams = []
    with open(filename, 'r') as f:
        for line in f:
            line_tokens, line_bigrams = sentence_to_bigrams(line)
            tokens = tokens + line_tokens
            bigrams = bigrams + line_bigrams
    return tokens, bigrams

In [22]:
tokens, bigrams = bigrams_from_transcript('transcripts.txt')

write a function that returns a probability dictionary when given a lists of tokens and bigrams.

In [25]:
def bigram_mle(tokens, bigrams):
    """
    provide a dictionary of probabilities for all bigrams in a corpus of text
    the calculation is based on maximum likelihood estimation and does not include
    any smoothing.  A tag '<unk>' has been added for unknown probabilities.
    :param tokens: list
        tokens: list of all tokens in the corpus
    :param bigrams: list
        bigrams: list of all two word tuples in the corpus
    :return: dict
        bg_mle_dict: a dictionary of bigrams:
            key: tuple of two bigram words, in order OR <unk> key
            value: float probability

    """
    token_counts = Counter(tokens)
    bigram_counts = Counter(bigrams)
    bg_mle_dict = {}
    for key, val in bigram_counts.items():
        bg_mle_dict[key] = bigram_counts[key] / token_counts[key[0]]
    bg_mle_dict['<unk>'] = 0.
    return bg_mle_dict

In [29]:
test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

bg_dict = bigram_mle(tokens, bigrams)
print("Probability bigram dictionary:")
# print(bg_dict)

Probability bigram dictionary:


## 3. Smoothing and logs

There are still a couple of problems to sort out before we use the bigram probability dictionary to calculate the probabilities of new sentences:

1. Some possible combinations may not exist in our probability dictionary but are still possible. We don't want to multiply in a probability of 0 just because our original corpus was deficient. This is solved through "smoothing". There are a number of methods for this, but a simple one is the [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) with the "add-one" estimate where V is the size of the vocabulary for the corpus, i.e. the number of unique tokens:

$$P_{add1}(w_i|w_{i-1})=\frac{c(w_{i-1},w_i)+1}{c(w_{i-1})+V}$$

2. Repeated multiplications of small probabilities can cause underflow problems in computers when the values become to small. To solve this, we will calculate all probabilities in log space. Multiplying probabilities in the log space has the added advantage that the logs can be added:

<span class="mathquill">$$\qquad \qquad \qquad log(p_1\times p_2\times p_3\times p_4) = \log p_1 + \log p_2 + \log p_3 + \log p_4 $$</span>

In the following quiz, the function `bigram_add1_logs` generates bigram probability with Laplace smoothing in the log space. Write a function that calculates the log probability for a given sentence, using this log probability dictionary. If all goes well, you should observe that more likely sentences yield higher values for the log probabilities.

In [36]:
import numpy as np

def bigram_add1_logs(tokens, bigrams):
    """
    provide a smoothed log probability dictionary 
    :param tokens: list
        tokens: list of all tokens in the corpus
    :param bigrams: list
        bigrams: list of all two word tuples in the corpus
    :return: dict
        bg_add1_log_dict: dictionary of smoothed bigrams log probabilities including
        tags: <s>: start of sentence, </s>: end of sentence, <unk>: unknown placeholder probability
    """
    
    token_counts = Counter(tokens)
    bigram_counts = Counter(bigrams)
    vocab_size = len(token_counts)
    bg_add1_dict = {}
    for key, val in bigram_counts.items():
        bg_add1_dict[key] = np.log((bigram_counts[key] + 1)/ (token_counts[key[0]] + vocab_size))
    bg_add1_dict['<unk>'] = np.log(1. / vocab_size)
    return bg_add1_dict

In [39]:
def log_prob_of_sentence(sentence, bigram_log_dict):
    # get the sentence bigrams with utils.sentence_to_bigrams
    # look up the bigrams from the sentence in the bigram_log_dict
    # add all the the log probabilities together
    # if a word doesn't exist, be sure to use the value of the '<unk>' lookup instead
    
    tokens, bigrams = sentence_to_bigrams(sentence)
    
    total_log_prob = 0.
    for bg in bigrams:
        if bg in bigram_log_dict:
            total_log_prob += bigram_log_dict[bg]
        else:
            total_log_prob += bigram_log_dict['<unk>']
    return total_log_prob

In [40]:
test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

bigram_log_dict = bigram_add1_logs(tokens, bigrams)
for sentence in test_sentences:
    print('*** "{}"'.format(sentence))
    print(log_prob_of_sentence(sentence, bigram_log_dict))

*** "the old man spoke to me"
-34.80495531345013
*** "me to spoke man old the"
-39.34280606002005
*** "old man me old man me"
-36.59899481268447


## 4. Bigram Conditional Probability

In functions `bigram_mle` and `bigram_add1_logs`, we index the bigram as a whole (in the form of tuple) to access the bigram conditional probability. However, we may run into scenarios or applications that require us to predict the next word given current word (or phrase). Given a word, there might be multiple possible next words that each is assigned a probability. Therefore we need to index the word and the next word respectively for accessing their probabilities. 

In [47]:
from collections import defaultdict

def bigram_conditional_prob(tokens, bigrams):
            
    bg_cond_dict = defaultdict(dict)   
    token_counts = Counter(tokens)
    bigram_counts = Counter(bigrams)
    vocab_size = len(token_counts)
    for key, val in bigram_counts.items():
        bg_cond_dict[key[0]][key[1]] = bigram_counts[key]/ token_counts[key[0]]
        bg_cond_dict[key[0]]['<unk>'] = 0
    bg_cond_dict['<unk>'] = 0
    return bg_cond_dict
    

In [50]:
bigram_cond_dict = bigram_conditional_prob(tokens, bigrams)
print(bigram_cond_dict['will'])

{'excuse': 0.2, 'bring': 0.2, '<unk>': 0, 'be': 0.2, 'you': 0.2, 'oblige': 0.2}


## Reference:

https://github.com/udacity/AIND-VUI-quizzes

https://github.com/oucler/NLND-End2End-Speech-Recognition