<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/33_Word_Probabilities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Word Probabilities and ngram models**
The application of word probabilities we are concerned about here has to do with *predicting* which word will come next given any one particular sequence of words – predicting the future! (well, not really...)

As you have hopefully come to learn, language is largely patterned, and corpora are very useful for finding and analysing these patterns. The word probabilities discussed in this notebook will also exploit the patterned nature of language. However, the word probabilities are going to be a bit different in that they use frequency statistics and chained probabilities (eek, math!) to predict words.


## **N-grams**
You have seen reference made to the term `n-grams`. I'm probably repeating myself from an earlier notebook, but as a reminder, to understand what this means, consider that a single word can also be called a `unigram`, a pair of words can be called a `bigram`, three is a `trigram`, and so on. So the "n" in `n-gram` can be any sequence of words of any length long (e.g., you could have a 100-gram). However, think about our work with corpus patterns – most of them are not this large! As NLTK explained a long time ago, bi-grams and tri-grams end up being very helpful for predicting future words.

There are two other books which I thought about using for this course and have informed this notebook. One is called [*Language and Computers*](https://www.wiley.com/en-us/Language+and+Computers-p-9781405183055), and the other is called [*Speech and Language Processing*](https://web.stanford.edu/~jurafsky/slp3/).

When discussing word probabilities, both texts use a similar approach to describing how ngram models are used in NLP (*L&C*, pp. 26-28; *SLP*, pp. 29-35). Each uses an uncompleted sentence to make the same point – you probably have a really good guess for which words come after other certain words. (*L&C* uses the example "I dreamed I saw the knights in...", whereas *SLP* uses "Please turn your homework...").

But, why do **you** have that knowledge, as a human? What does this say about the brain and language – is our language knowledge a series of statistical connections? Is the brain a computer? Keep this question in the back of your mind while you read on and complete this notebook.

## **Word probabilities for single tokens**

Let's start small with making sure we understand how to calculate probabilities. We just need to learn some new terms to unpack probablities. You might have taken a statistics course that covered this concept, or this might be your first time. It should be just a light touch!

`P(A)` is a shorthand for representing the probability (or likelihood) that something (`A`) will occur. For word probabilities, we can think of this as telling us how likely a single token will occur in some larger context, such as the likelihood of randomly drawing a particular token from a document. Because in this scenario we assume all tokens would have an equally likely chance of being selected, we can rely on a fairly simple method for calculating our probability: counting.

If we want to predict the probability of randomly selecting a single token from a document we only have two possible outcomes: the randomly selected token either *is* or *is not* our target token. So, the probability of choosing any one token is based on the total number of tokens in the document/space we are searching. If I had a sentence of four different tokens, each token would have a 25% probability of being randomly chosen. Once we start repeating the same token in a text, the chances of randomly selecting tokens of that type increase.

We can apply this calculation to an entire text, keeping in mind the distinction between types and tokens. To calculate this, we count how many tokens for our target occur in the document/space, and divide that by the total space we are sampling from (i.e., the total number of tokens in a document).

Therefore, the probability of choosing a word (`P(A)`) is directly linked to the overall frequency of the tokens representing that word in a document.

For example, if the word `pretzels` occurs 100 times in a 1000 word document, the probability of choosing a random token from the document and that token being `pretzels` would be be 100/1000, which is 0.1, or 10%.

So the `P(A)` for this example says there is a 10% likelihood to encounter the token `pretzels` if we randomly sampled one token from this hypothetical document.

Of course, that's not very useful yet. We want to make predictions about the presence of one token in light of *other* tokens. So, we need to modify the approach to take this information into account.

## **Changing the search space to other words**

We can calculate the probability for a word occuring after *other* words by changing a word's search space to be specific patterns of tokens (rather than an entire document). In the example above, our search space was every possible word in the document, which does not take into account any other information about where the word occurs.

However if we predefine a smaller context for a search space, such as a smaller sequence of tokens, we can then divide by only the number of times that sequence appears (rather than the size of an entire document).

Say for instance we wanted to know the probability of the word `"thirsty"` appearing after the words `"these pretzels are making me`". To get this prediction, we need to count the total instances of the complete sentence `"these pretzels are making me thirsty"` (instead of counting the the total number of times `thirsty` appears).

If I search for the phrase "`these pretzels are making me thirsty`" in Google, I get 69,500 hits. Let's pretend that's the total frequency of the phrase in a corpus.
<br>
><img src = https://i.imgur.com/9un805E.png>
<br>


We could then calculate the probability of `"thirsty"` appearing in this context by dividing the total number of times the sentence occurs with *any* word, not just `"thirsty"`. In Google, I get 7,080,000 results for the phrase `"these pretzels are making me"` followed by any word (including "thirsty").

> <img src = https://i.imgur.com/9bDdbSw.png>

We then divide the frequency of our target phrase by the frequency of the target phrase *with* our target word. In this case it would be `69500/7080000`, which is slightly less than 1%. That might not seem like a lot, but a 1% predictive chance is likely a lot higher than any other word which can fit into that slot (and is aided by the frequency of this phrase being associated with a Seinfeld episode).

In [None]:
69500/7080000

## **Issues with using full sentences as contexts**

As is pointed out by the authors of *SLP*, this method brings with it a few challenges. Language is creative and even similar sentences might be phrased in slightly different ways, leading to a lot of different possible sentences that need to be counted. Language corpora cannot be guaranteed to contain all of these creative examples, and thus prediction will suffer. (Moreover, using Google like I did is not really a good method because we are not really sampling from a corpus of language – I mainly used it to represent the concept).   

The only reason my sentence above was so frequent is because it is a well known quote from *Seinfeld* and has apparently been repeated many, many times on the internet!

The solution to address this challenge is to avoid trying to calculate the probability of a word based on the entire sentence context, but instead calculate the individual probabilities of `n-grams` in a sentence, and then multiply these probabilities together as they appear in sentences. This means we want to calculate the combined probability of `"thirsty"` coming after `"me"`, as well as the probability of `"me"` coming after the word `"making"`, and so on. This smaller combined probability approximates using the entire sentence context.

In other words, this approach makes an assumption that the probabilities associated with the prior 1-2 words of a word are predictive enough to stand in for longer sentence contexts - which *SLP* explains is a **Markov assumption**.


## **Conditional probabilities and n-grams**
So, how can we go about doing this? Now we are more interested in how to predict a word *given a prior word context* that is limited to just one or two words. Let's just stick with one word for now.

We need to expand our probability notation to include the likelihood of something occuring when something *else* also occurs. This is the exact same as what we have done above, except we are changing our contexts of counting. Let's also add to our notation for probabilities.

`P(A|B)` is a shorthand for representing the probability (or likelihood) that something (A) will occur **if** something else occurs (B). This is called the [`conditional probability`](http://www.stat.yale.edu/Courses/1997-98/101/condprob.htm). If you clicked that link and read the examples (or remember from prior courses / reading), you'll see that conditional probability is just more counting and then some division.

In this approach, our *condition* is the appearance of a prior word, and we want to see how strongly that condition is associated with the occurence of our target word.

For predicting a probability on a conditional event (such as word A occuring after word B), we divide the total count of the event occuring (such as word A occuring after word B) by the total number of times the event *could* occur (such as the word B occuring before another word) – this is what I did above with the pretzels example, except now we are looking at individual words rather than full sentences.


## **Let's calculate a conditional probability**

*SLP* gives the following formula to calculate the conditional probability of word cooccurence:

`P(Wn|Wn-1) = c(Wn-1Wn)/c(Wn-1)`

Here is what the symbols mean:

`W = word`

`n = the index or identity of a word (think of it like a Python list index)`

`Wn-1 = the word occuring one index *before* the word we want to predict`

`c = the frequency count`

Let's break the formula down using the example of `"these pretzels"`. If we wanted to predict the likelihood of `"pretzels"` coming after the word `"these"`, the notation is thus:

```
# divide the total frequency of "these pretzels" by the total frequency of "these"
P("pretzels"|"these") = c("these pretzels")/c("these")
```

Hypothetically, if in a particular search space the bigram `"these pretzels"` occured 20 times and `"these"` occured 128 times, then:

`P("pretzels" | "these") = 20/128 = .156`

Our definition above says we should divide the total count of our target bigram (`these pretzels`) by the total number of times the the prior word (`these`) could occur before *any word*, including our target word. So technically this value should be the number of bigrams involving the first word. As is pointed out by *SLP*, however, we can use the simple frequency of the word before our target word as the denominator.

 **[SLP](https://web.stanford.edu/~jurafsky/slp3/3.pdf) says "the reader should take a moment to be convinced by this" (p. 32)**

Are you? Think...can the total frequency of the number of times word B (`these`) potentially occurs before any other word ever be different than the total number of times word B occurs on its own?



## **A longer example**
Let's work with more data and start writing some functions to calculate these probabilities for an input text.

Consider the following sentence:

>`"these pretzels are making me thirsty! They are also making me tired! I love preztels because they make me thirsty!"`

Let's now consider the  probabilities of the word `"thirsty"`. In our sentence, `"thirsty"` occurs two times, and both times after the word `"me"`. We also see that `"me"` occurs three times, two times before `"thirsty"` but also one time before `"tired"`. Let's calculate some basic probabilities for these words.

### **Calculate single word probabilities**

We can write some quick code to:

1. Count the total number of words
2. Count the frequency of the words we are interested in (`"thirsty," "tired,"` and `"me"`)

In [None]:
# Create the data
# triple quote to break lines
text_input = """these pretzels are making me thirsty
They are also making me tired
I love pretzels because they make me thirsty"""

# split data into separate words
text_input_words = text_input.split()
print(text_input_words)

# total search space
len(text_input_words)

In [None]:
# Get frequency of each word and save to a dictionary
word_count = dict()

# here are our target words
target_words = ['thirsty', 'tired', 'me']

# loop and add the freq to dict
for word in target_words:
  # using .count to get frequency. Could also use FreqDist
  word_count[word] = text_input.count(word)

word_count

In [None]:
# what is the probability of each word, on its own?
for word in word_count:
  print(f'word: {word} | total frequency: {word_count[word]} | probability: {word_count[word]/len(text_input_words)}')

Since we know that `thirsty` occurs twice, both times before `me`, and we also know that there are two instances of our target bigram, `me thirsty`. We know that `me` occurs three times total, twice with `thirsty` and once with `tired`

So `P(thirsty|me) = c("me thirsty")/c("me") = 2/3 = .66`

Within this search space, the word `thirsty` has a 10% chance of occuring overall, but a 66% chance of coming after `me.` With such a small sentence, this isn't very useful. We want to start increasing the search space and looking directly at bigrams.

### **Calculate bigram probabilities**

The next step is to expand this program to calculate the bigram probabilities of a text. Our program will take an input text and an input word, and then calculate all of the bigrams and their probabilities for that word given the input text.

Below, I define a function the takes in a raw string and tokenizes it using the NLTK tokenizer.

I then initialize empty dictionaries to store frequency values of each bigram including our target word (`bigram_counts`), as well as the frequency of the word which appears *before* our target word (`headword_totals`). I also create a third dictionary to store the calculated probabilities (`bigram_probabilities`).

I then loop through each word in the text, searching for our target word. If I find that word, I then check to see if there is a token before that word, and also whether that prior token is not punctuation. If so, I then add that prior token as the `headword`. I calculate the overall frequency of the `headword` and store it in a dictionary, but only if the headword is not already in the dictionary.

I then use the `.get()` method to update the frequency of that particular bigram in the `bigram_counts` dictionary (see the comments in the function for how this works).

Once the loop is complete, I then loop over each bigram and calculate the bigram probability by dividing the frequency of the bigram by the total frequency of the headword. I save this to a dictionary although that's probably not necessary (well, I'm sure most of this function is not necessary :))

I then use a set of print statements to output the information about the bigram and headword, so that you can see how the calculations are being made.

In [None]:
# want to use nltk.word_tokenize()
import nltk
nltk.download('punkt')

In [None]:
# calculate bigrams in a text for target words
def bigram_counter(text, target):
  """return probability of a target word for every bigram headword it is associated with"""

  # tokenize the lowered text.
  text_lower = nltk.word_tokenize(text.lower())

  #### initialize empty dictionaries to store frequencies/probabilities ####

  # frequency of target bigrams
  bigram_counts = dict()
  # frequency of the bigram headword
  headword_frequencies = dict()
  # store probabilities here
  bigram_probabilities = dict()

  ### calculate bigrams ###

  # loop through input text using enumerate, which provides an index
  for word_index, word in enumerate(text_lower):

    # find instances of target word, then extract headword
    if word == target and word_index != 0: # for all instances that are not sentence initial
      if text_lower[word_index-1].isalpha(): # this will avoid treating punctuation as headwords
        headword = text_lower[word_index-1]

      # add the total frequency of the headword to the dictionary if it doesn't yet exist.
        if headword not in headword_frequencies.keys():

          headword_frequencies[headword] = text_lower.count(headword)

      # update the bigram dictionary
      # .get() will return the value for a key in a dictionary,
      # if the key does not exist, it will return the default instead
      # so this says make an entry in the dictionary using the headword
      # if the entry does not exist, set it to the default, which is 0
      # if the entry does exist, grab the value
      # then add 1 to either the default or the value
      # (now you can see why defaultdict might be better)
        bigram_counts[(headword, word)] = bigram_counts.get((headword, word), 0) + 1

  # once the loop for the target word is complete,
  # calculate probability of word for each headword
  # print out the information
  print(f'total frequency of {target} is {text_lower.count(target)} \n')
  print(f'probabilities for bigrams with {target} are:\n')
  for bigram in bigram_counts:
    bigram_probabilities[bigram] = bigram_counts[bigram] / headword_frequencies[bigram[0]] # slicing just the headword
    print(f'total frequency of bigram {bigram}: {bigram_counts[bigram]}')
    print(f'total frequency of headword {bigram[0]} is {headword_frequencies[bigram[0]]}') # slicing just the headword
    # I use .upper() to make the words stand out
    print(f'probability of {target.upper()} after {bigram[0].upper()}: {bigram_probabilities[bigram]} \n')

Let's practice our function on Bill Clinton's 2000 State of the Union Address.

In [None]:
import nltk
nltk.download('state_union')
from nltk.corpus import state_union

In [None]:
clinton = state_union.raw('2000-Clinton.txt')

In [None]:
clinton

In [None]:
# let's pick three words to search for. what do you think about their bigram probabilities?
clinton_targets = ['americans', 'america', 'fellow']

for target in clinton_targets:
  bigram_counter(clinton, target)

## **Chaining probabilities**
Our function thus computes target bigrams for any one word and then gives us the probability of that word's occurance after a particular headword. That's kinda cool, right? We probably should calculate sentence boundaries and maybe do this for each sentence rather than a whole text, but hopefully you get the idea of how this function works now. We could next write a function which does the reverse – takes in a target word and asks for the probability of all words that comes *after* it, sort of a forward bigram counter.

However, instead of doing that, let's now think about how we can use this information to predict the probability of longer string of words using bigram probabilities. We want to be able to predict the overall probability of a word after more than one word, say, to finish a complete sentence (such as the ones given at the start of this notebook).

As was teased above, we can do this by chaining together the individual bigram probabilities of all the words in a target string that come before a word. This means if we wanted to compute the likelihood of the sentence `'these pretzels are making me thirsty`', we would calculate :

`P(pretzels|these) * P(are|pretzels) * P(making|are) * P(me|making) * P(thirsy|me)`

Can we create a function which will do this? We need to:

- read in a text and tokenize it
- accept a target phrase
- for each bigram in the phrase, find the probability of that bigram
- multiply the bigram probabilities to find the overall probability of the phrase

# **phrase likelihood**

We can create a final function to do so. We did it the hard way above (or, at least I did...). This time, we'll use the NLTK bigrams function to much more easily count bigrams and calculate their probabilities. We *could* also use the `nltk.ConditionalFreqDist`, similar to [Chapter 2, Section 2.4 in NLTK](https://www.nltk.org/book/ch02.html) but maybe this function will make things a bit more transparent because we'll manually create the distribution. And, building functions which turn out to be redundant/inferior to built-in methods is one of the joys of programming ;)


- I'll start off by reading in a text and creating a tokenized version of the lowercased text.
- I'll then use the `nltk.bigrams()` function to create sets of bigrams for the entire text.
- I'll then create a dictionary containing *every* bigram in the text, and then return it (so you can see what's going on).

If you read through the bigrams, you'll see we have one entry for each bigram with the bigram probability. Notice that when a bigram repeats in the text, it's absent. For example, there is no repetition of the bigram `(of, the)` between `(state, of)` and (`the, union)`, because the bigram already appeared as the 8th key in the dictionary.

In [None]:
def phrase_likelihood(text):

  # create tokenized, lower case version of text and targets
  text_tokens = nltk.word_tokenize(text.lower())

  # create bigrams for text (could combine this with above for a one-liner)
  text_bigrams = [text_bigram for text_bigram in nltk.bigrams(text_tokens)]

  # dict to store the bigram probabilities
  bigram_pbs = dict()

  # use .count() to divide bigram count by headword cout (note the slice to get just headword)
  for bigram in text_bigrams:
    if bigram[0].isalpha() and bigram[1].isalpha(): # let's avoid bigrams containing punctuation
      bigram_pbs[bigram] = text_bigrams.count(bigram)/text_tokens.count(bigram[0])

  return bigram_pbs

In [None]:
phrase_likelihood(clinton)

## **chained probabilities**
Now it is relatively straightforward to query our probability dictionary to create chained bigram probabilities from our reference corpus. I add a second argument to the function, `target`, which allows for a target phrase to be queried.

I then lowercase, tokenize, and create bigrams from the target phrase. I loop through these bigrams and multiply their values to get the overall chained probability of the phrase.

In [None]:
def phrase_likelihood(text, target):

  # create tokenized, lower case version of text and targets
  text_tokens = nltk.word_tokenize(text.lower())

  # create bigrams for text (could combine this with above for a one-liner)
  text_bigrams = [text_bigram for text_bigram in nltk.bigrams(text_tokens)]

  # dict to store the bigram probabilities
  bigram_pbs = dict()

  # use .count() to divide bigram count by headword count (note the slice to get just headword)
  for bigram in text_bigrams:
    if bigram[0].isalpha() and bigram[1].isalpha(): # let's avoid bigrams containing punctuation
      bigram_pbs[bigram] = text_bigrams.count(bigram)/text_tokens.count(bigram[0])

  # now create bigrams of our target phrase.
  target_tokens = nltk.word_tokenize(target.lower())
  target_bigrams = [target_bigram for target_bigram in nltk.bigrams(target_tokens)]

  # safety first
  if target_bigrams:
    # start the multiplication with the first probability
    pb = bigram_pbs[target_bigrams[0]]
    # start the loop with the second probability
    for bigram in target_bigrams[1:]:
      pb = pb * bigram_pbs[bigram]

    print(f'probability of phrase {target} is {pb}')

  else:
    print('ain\'t nothing here')

In [None]:
phrase_likelihood(clinton, 'my fellow americans')

Note that the function takes a few seconds to run, since we're redoing the bigram probabilities every time. Let's separate this out and do this process *once*, then query the result with our phrase to speed up the time for eaching new phrases.

I split the functions into two. `bigram_model` creates our probabilities and returns a dict.



In [None]:
# make our reference model
def bigram_model(text):
  """calculate probabilities for bigrams in a text"""
  # create tokenized, lower case version of text and targets
  text_tokens = nltk.word_tokenize(text.lower())

  # create bigrams for text (could combine this with above for a one-liner)
  text_bigrams = [text_bigram for text_bigram in nltk.bigrams(text_tokens)]

  # dict to store the bigram probabilities
  bigram_pbs = dict()

  # use .count() to divide bigram count by headword cout (note the slice to get just headword)
  for bigram in text_bigrams:
    if bigram[0].isalpha() and bigram[1].isalpha(): # let's avoid bigrams containing punctuation
      bigram_pbs[bigram] = text_bigrams.count(bigram)/text_tokens.count(bigram[0])

  return bigram_pbs


`phrase_probability` takes a new argument, `model`, which should be an ngram dict. I've also changed the dictionary call to match the argument (i.e., `model`).

I've also tweaked the bigram loop to account for keys not in the model, using 0 instead. Remember what happens if we multiply by 0? We will get a total value of zero, which makes sense and would reflect that a phrase is *not* in the reference model. This of course means a model needs to be very large to be effective, much larger than a single state of the union speech. Are there other ways we might want to deal with phrases that have some missing bigrams?

In [None]:
def phrase_probability(model, target):
  """calculate chained probabilities of target phrase from reference model"""

  # create bigrams of our target phrase.
  target_tokens = nltk.word_tokenize(target.lower())
  target_bigrams = [target_bigram for target_bigram in nltk.bigrams(target_tokens)]


  if target_bigrams:
    print(target_bigrams)
    # start the multiplication with the first probability
    pb = model.get(target_bigrams[0], 0)
    # start the loop with the probability of the second bigram (if it exists)
    for bigram in target_bigrams[1:]:
      print(bigram)
      if bigram in model:
        pb = pb * model[bigram]
      else:
        pb = pb * 0

    print(f'probability of phrase {target} is {pb}')

In [None]:
# test our new functions out
clinton_bigram_probs = bigram_model(clinton)

Now that we've trained our model, we can query it. Holy crap..we've made an n-gram language model!!

In [None]:
phrase_probability(clinton_bigram_probs, 'my fellow americans')

In [None]:
phrase_probability(clinton_bigram_probs, 'my fellow romulans')

I don't really recommend trying to train a larger model using brown or something like that. I tried and got bored of waiting after like 5 minutes. NPS chat is small enough to work though.

In [None]:
nltk.download('book')
from nltk.book import nps_chat

In [None]:
nps_words = ' '.join([w for w in nps_chat.words()])

In [None]:
# this corpus has some rather interesting language in it...
nps_words[205:230]

In [None]:
# takes about a minute to run.
chat_ngrams = bigram_model(nps_words)

**longer phrases**

All right. Now, check out what happens to the probability as we expand our ngram into larger and larger phrases. The probability will decrease.

You're witnessing a method used to predict upcoming text. We could now take two arguments, a phrase stem (such as 'please hand your homework...') and then find words with the highest overall phrase probability, eventually locating 'please hand your homework **in**' as the most likely combination, depending on which model we use. In this way, we could use this as a simplistic text prediction algorithm.

Thinking back to some of the earlier questions I opened the notebook with about how ***you*** can predict upcoming words. One argument is that your brain contains some form of probabilistic distribution of language patterns, which is why we are so good at predicting upcoming words. Whether this computational approximation using n-grams reflects the same process can't be fully proven (yet), but as language models get larger, and the methods for computing probailities becomes more sophisticated, the performance of NLP models and predictive text applications continues to increase and match the way humans can predict.

In [None]:
phrase_probability(chat_ngrams, 'wanna chat')

In [None]:
phrase_probability(chat_ngrams, 'wanna chat with')

In [None]:
phrase_probability(chat_ngrams, 'wanna chat with me')

In [None]:
phrase_probability(chat_ngrams, 'go to')

In [None]:
phrase_probability(chat_ngrams, 'go to the')

# Generating Text from Bigram Model

The models trained above include relative probability information, but we don't even need that level of information in order to create a model capable of generating text. This is demonstrated in Chapter 2 of the NLTK book, which has a section which shows how to use bigrams as a way to generate text. Let's look at this real fast to see how easy it is to build a text generation model. The first thing you need to do is create a list of bigrams from text.

In [None]:
import nltk
nltk.download('punkt')

In [None]:
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/marine_biologist.txt'

In [None]:
mb = open('marine_biologist.txt').read()

In [None]:
mb_bigrams = list(nltk.bigrams(nltk.word_tokenize(mb.lower())))

In [None]:
mb_bigrams[:5]

Here is the code NLTK provides to create the generative model. What is the code doing?

The arguments expect a *conditional* frequency distribution, a word, and a number. The function then loops as many time as specified by `num`. In the loop, the function prints the first value set for `word`, which would be specified when running the function. Then, the value of `word` is overwritten with another word. Crucially, that word is whichever word has the highest frequency of occuring after the previous word. The `max()` is simply finding the entry with the highest frequeny.

What is the result? Basically, you enter a word, the program finds the most frequent word that comes after that word, prints it, and repeats for the word that it just printed, until it has looped through all the numbers.

How effective is it?

In [None]:
# NLTK's word generation model.

def generate_model(cfdist, word, num = 15):
    for i in range(num):
        print(word, end = ' ')
        word = cfdist[word].max()

In [None]:
mb_cfd = nltk.ConditionalFreqDist(mb_bigrams)

In [None]:
generate_model(mb_cfd, 'well')

In [None]:
# hmm, what's happening here?
generate_model(mb_cfd, 'jerry')

It seems pretty good at first, but when we start asking for longer texts, the model just repeats itself. This is a problem with the function in that words will get stuck in loops, and it is a problem the NLTK book points out. But, this tiny little model contains the same basic idea that other generative text AI models use - predicting the next word based on information about the likelihood of all the other possible words.

In [None]:
generate_model(mb_cfd, 'george', num = 50)

Let's try it on out entire corpus of Seinfeld episodes (14 episodes).

~~Download the [seinfeld data from here](https://github.com/scskalicky/LING-226-vuw/tree/main/other-data), unzip it, and put it in your Google Drive (a previous notebook already had you do this, so maybe you already have this?).~~

Or, just use `!wget` and `!unzip`



Below I read in the files from the corpus, and create a list where each value in the list is a tokenized text from the corpus.

In [None]:
!wget 'https://github.com/scskalicky/LING-226-vuw/raw/main/other-data/seinfeld.zip'

In [None]:
!unzip 'seinfeld.zip' -d 'seinfeld'

In [None]:
import glob

# this is where the text files are
# make sure to mount your drive
root = '/content/seinfeld/'

# get a list of all the files
files = glob.glob(root + '*')

# start with an empty string
seinfeld = ''

# use a loop and append. can you turn this into a list comprehension?
for file in files:
  text =  open(file).read()
  seinfeld = seinfeld + ' ' + text.lower()
  # output the number of words
  print(len(seinfeld))

In [None]:
# let's do some brute force preprocessing
import re
seinfeld = re.sub('[.,;.!\'"-_\[\]\(\)\{\}1234567890]', repl = '', string = seinfeld)
len(seinfeld)

In [None]:
sf_bigrams = list(nltk.bigrams(nltk.word_tokenize(seinfeld)))
sf_cfdist = nltk.ConditionalFreqDist(sf_bigrams)

In [None]:
sf_bigrams[:10]

In [None]:
# very repetitive!
generate_model(sf_cfdist, "jerry")

We still have the problem identified in the NLTK book - at some point we get stuck in a loop. Their suggestion is to randomly sample from among the available words. Can we improve this? To randomly sample in Python, we can use the `sample()` function from the `random` library

In [None]:
from random import sample
# same 4 items from this list
sample([1,2,3,4,5,6,7,8,9], 4)

In [None]:
sample(list(sf_cfdist['hello']), 1)

In [None]:
def generate_model2(cfdist, word, num = 15):
    for i in range(num):
        print(word, end = ' ')
        # randomly sample from the available words
        # need to slice the output since sample returns a list
        word = sample(list(cfdist[word]),1)[0]

The results are pretty interesting and funny - you can run this cell multiple times and get different text each time. But we seem to have gone too far in the other direction. The output is mostly gibberish. This is because words with low frequencies are just as likely to be chosen as words with high frequencies.

What can be done to address this? One solution might be to force the sample to choose from among the highest frequency options, rather than just the maximum.



In [None]:
generate_model2(sf_cfdist, 'jerry')

We can see that in a ConditionalFreqDist, the data is sorted by frequency, so we can just grab the first n from the list and randomly sample from that.

In [None]:
sf_cfdist['jerry']

In [None]:
# run this cell a few times to get different results
sample(list(sf_cfdist['jerry'])[:5], 1)

In [None]:
# and, slicing longer than the list is not an issue
[1,2,3][:10]

Let's add this functionality to our `generate_model` function. We will add a new argument which allows the user to set the total range for word samples, with a default of `5`.

In [None]:
def generate_model3(cfdist, word, num = 15, samples = 5):
    for i in range(num):
        print(word, end = ' ')

        word = sample(list(cfdist[word])[:samples],1)[0]

Try out the new function - is it better? Yeah, it should be. Play with the number of samples and length of the output to see the different effects it has on the generated text.

In [None]:
generate_model3(sf_cfdist, 'well', samples = 3)

In [None]:
generate_model3(sf_cfdist, 'hello', num = 25, samples = 5)

We can see that for this data, the "i don't know" loop occurs when we use max or here, a sample of 1.

In [None]:
generate_model3(sf_cfdist, 'the', samples = 1)

So keeping the sample above 1 helps create much more varied text.

In [None]:
generate_model3(sf_cfdist, 'the', samples = 10)

## **use trigrams instead of bigrams**

How else could this model be improved? We could use probabilities instead of frequencies, and we could train larger streams of text, such as trigrams, in order to get even more natural looking output. The `nltk.trigrams()` function words the same as `nltk.bigrams()`.

In [None]:
# create trigrams from same data
sf_trigrams = list(nltk.trigrams(nltk.word_tokenize(seinfeld)))
sf_trigrams[:10]

However, we cannot give `nltk.ConditionalFreqDist()` a set of trigrams. So we can tweak the trigrams to be in tuples, so that each trigram is in the form of `(word1, (word2, word3)). Then we can input that to a ConditionalFreqDist

In [None]:
sf_trigrams2 = [(trigram[0], (trigram[1], trigram[2])) for trigram in sf_trigrams]
sf_trigrams2[:10]
sf_trigram_cfdist = nltk.ConditionalFreqDist(sf_trigrams2)

The results now show the frequency of two words after any one word.

In [None]:
sf_trigram_cfdist['jerry']

We also have to modify the `generate_model()` function to account for the longer stretches of words. Below, the function still starts by printing out the word given to the function. It then samples one of the trigrams, using the same logic. The result of the sample is a tuple of two words `(word1, word2)`, so a second print line prints the first of these words (using a double index, `[0]` to first go inside the tuple, then `[0]` to select the first item in the tuple. Instead of printing the second word in the tuple, that word is set to the value of `word`, allowing for the loop to repeat with new words.

This means that the next words are chosen based on longer stretches of words, rather than just single words.

In [None]:
def generate_model4(cfdist, word, num = 15, samples = 5):
    for i in range(num):
        print(word, end = ' ')
        new_words = sample(list(cfdist[word])[:samples],1)
        print(new_words[0][0], end = ' ')
        word = new_words[0][1]

How does it fare? Run the cells a few times to see the effects of input word + sample size on the quality of text generated.

In [None]:
# we still run into the looping problem when we go with the most frequent pairs
generate_model4(sf_trigram_cfdist, 'george', 15, samples = 1)

In [None]:
# sort of better?
generate_model4(sf_trigram_cfdist, 'parking', 15, samples = 5)

In [None]:
# what happens when we choose a function word and a large sample size?
generate_model4(sf_trigram_cfdist, 'the', 15, samples = 20)

In [None]:
# giving it a name has different effects
generate_model4(sf_trigram_cfdist, 'elaine', 15, samples = 20)

### **Conclusion**

You can see that word distributions on their own are actually very informative for NLP. Using word probabilities, we can calculate the likelihood of certain words occuring in sequence. Using simple frequencies of two and three word stretches (bigrams and trigrams), we can even create a simple generative text function. Combining these ideas of word probabilities and generative text output is a natural next step, and is where generative AI has gone in more recent years. The algorithms for doing so are better than the simple bigram, trigram, and word probability scores in this notebook, but the underlying logic is still quite similar.