<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/ex9_intro_to_hlt_2023_solved_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Watch out, this notebook stretches colab memory with n=5, so you might need to "Restart and run all" on full re-runs of the notebook,
since the old data clogs the memory during a rerun**

In this exercise, you'll try to generate text with an n-gram model. In the generation, we use the last generated n-1 words as the prefix, and the n-gram counts to establish the distribution of possible continuations. So we might run this off the following data structure:

* A master dictionary, where the key are n-1 grams
* The value is another dictionary
* In this dictonary the key is a word
* And the value is its count

So, when generating, we can take the last n-1 words, look them up in the master dictionary, and we get a dictionary of all seen continuations and their counts.

Let us divide it to the following tasks:

1. Generate n-grams from a corpus of text, e.g. the IMDB dataset
2. Count the n-grams, i.e. build the master dictionary

With these data structures, the generation can proceed quite easily. Say, we have a 4-gram model.

* Given a prior context $w_1w_2w_3$
* Look up the word-count dictionary of possible words $w_4$
* The counts, once normalized to sum up to 1, form a distribution over words that can continue $w_1w_2w_3$ and we can sample the next word from this distribution.
* The we append this generated word to our list of already generated words, and repeat the process


Other remarks:

* We want to pad all texts with <bos> (beginning of sequence) and <eos> (end of sequence). The <bos> we want to have there n-1 times, so we can use it as the initial prompt and let the model learn how the sequences start. The <eos> allows us to stop generating, and prevents a crash on unknown n-grams at the very end of a sequence. (if an n-gram $w_1w_2w_3w_4$ was seen only once at the end of a "training" sequence, then an attempt to continue it during generation, would lead to a crash, since we have no known n-gram to continue the sequence $w_2w_3w_4$ with our simple, unsmoothed model :)


# Task A: Generate n-grams

* Write a generator function (using `yield` rather than `return`) which yields n-grams as tuples $(w_1,...,w_n)$ from all sections of the IMDB dataset
* a vectorizer from `sklearn` can be used as a trivial tokenizer
* `more-itertools` is a nifty library to achieve the n-gram generation
* remember to pad with n-1 `<bos>` symbols at the beginning, and one `<eos>` symbol at the end

You can give this a shot, or simply use the code below.

In [1]:
!pip3 install datasets more-itertools

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-

In [2]:
import datasets
import sklearn.feature_extraction

In [3]:
dset=datasets.load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
# Few remarks here:
# 1. we don't need the vectorizer per se, we just want its analyzer function, which basically tokenizes the text for us, and somewhat unfortunately drops punctuation
# 2. the default token pattern in sklearn drops 1-letter words (like "I" and "a") so I modify it a bit
# 3. it's a pretty lousy tokenizer, but it will do for this toy exercise
cvectorizer=sklearn.feature_extraction.text.CountVectorizer(lowercase=False,stop_words=None,token_pattern=r"(?u)\b\w+\b" )
analyzer=cvectorizer.build_analyzer()
analyzer("I have a dog at home, it likes to shred newspapers.")

['I',
 'have',
 'a',
 'dog',
 'at',
 'home',
 'it',
 'likes',
 'to',
 'shred',
 'newspapers']

In [5]:
# Now we tokenize the IMDB dataset the usual way
def tokenize(ex):
    return {"tokenized":analyzer(ex["text"])}

dset=dset.map(tokenize,num_proc=4)

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
from collections import Counter
from more_itertools import sliding_window #more-itertools is an awesome library!
import tqdm

def generate_ngrams(dset,n):
    for ex in tqdm.tqdm(dset):
        tokens=["<bos>"]*(n-1)+ex["tokenized"]+["<eos>"]
        for ngram in sliding_window(tokens,n):
            yield ngram



# Task B

* Now we can combine the different sections of the IMDB dataset and count our n-grams


In [7]:
# Here we can concatenate all the individual datasets (train,test,unlabeled) in IMDB
# the "master" dataset is a dictionary of these, so dset.values() has the datasets of the individual sections (train,test,unlabeled)
combined_dataset=datasets.concatenate_datasets(list(dset.values()))


In [8]:
ngrams={} #This is the master dictionary
for ngram in generate_ngrams(combined_dataset,5): #let's start with 4-grams, you can try 3- and 5- grams too!
    prefix=ngram[:-1]
    word=ngram[-1]
    d=ngrams.setdefault(prefix,{}) #looks up the inner dictionary for this prefix, or makes one if it's not there yet
    d[word]=d.get(word,0)+1 #count +1 for this one

100%|██████████| 100000/100000 [01:28<00:00, 1134.60it/s]


# Task C

* Generate new text, starting from `<bos> <bos> ...` (n-1 times) and ending after say 40 words, or `<eos>` being generated
* I will give you a support function `sample_from` which receives a list of counts and a temperature parameter, and samples according to this distribution, returning a single column index drawn
* The temperature sampling is described here: https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277
* By all means, if you want to try, do try writing this function yourself!


In [9]:
import numpy

def softmax(x):
    return numpy.exp(x)/sum(numpy.exp(x))

def sample_from(counts,temperature=1.0):
    """
    counts: list of counts that form the distribution
    temperature: the "how wild the generation should be" parameter, numbers close
                 to 0 are very conservative, numbers close or above 1 lead to quite
                wild generations
    """

    counts_array=numpy.array(counts)
    #Make these sum up to 1.
    counts_array_norm=counts_array/counts_array.sum()
    #Divide by temperature, that is what the algorithm does
    counts_array_norm/=temperature
    #Renormalize into a distribution using the softmax function, that is what the algorithm does
    final_distribution=softmax(counts_array_norm)
    #A good way to sample from a distribution is the following function from numpy
    x=numpy.random.multinomial(n=1,pvals=final_distribution)
    selected_word=numpy.argmax(x).flatten()
    return selected_word[0]

sample_from([1,1,1,17],temperature=0.5) #Try running this several times each, with temps 0.1, 0.5, 1.0 ... see how temp 0.1 sticks to picking the max value, but higher temps don't?

1

# Task D: piece it all together

* Again, I will give you the skeleton

In [10]:
from pprint import pprint

def generate(ngrams,n,max_len=40,temperature=1.0,prompt=None):
    """
    ngrams: the master dictionary
    n: the n in n-gram
    max_len: how many words max?
    temperature: the generation temperature
    prompt: the initial prompt, as a tuple, if not given n-1 <bos> symbols will be used
    """

    if prompt is None:
        prompt=["<bos>"]*(n-1)

    generated=list(prompt) #this list will grow with words
    for _ in range(max_len):
        prefix=tuple(generated[-n+1:]) #pick the last n-1 from what we have generated so far
        d=ngrams[prefix] #the inner dictionary
        # Now we need to separate the words, and the counts for sampling
        # I do it with a less pythonic, more explicit code, could be done with list comprehensions of course
        possible_words=[] #list of words which could continue this ngram
        counts=[] #and their counts
        for word,count in d.items():
            possible_words.append(word)
            counts.append(count)
        new_word_index=sample_from(counts,temperature) #now sample which of the words gets selected
        new_word=possible_words[new_word_index]
        generated.append(new_word)
        if generated[-1]=="<eos>": #stop on end of sequence
            break
    return generated

# Now we can test it!

# make sure to match the n below to the n which was used to create
# the master dictionary
for temp in (0.1,0.5,1.0,2.0,5.0):
    generated=generate(ngrams=ngrams,n=5,max_len=60,temperature=temp)
    print(f"Temp={temp}:")
    pprint(" ".join(generated))
    print("-----------")



Temp=0.1:
('<bos> <bos> <bos> <bos> En route to a small town and a war vet returning '
 'seemed a set up for the finale she was always one of my favorites but I d '
 'say his enthusiasm is catching but if it was it wasn t the first time a '
 'couple of years previously turned up on the three broadcast channels we '
 'received Even to my')
-----------
Temp=0.5:
('<bos> <bos> <bos> <bos> Walter Matthau one of Hollywood s earliest instances '
 'of criminal lovers on the lam story and a surprisingly alert and moving one '
 'as well For a 40 year old rather than a child of 8 or 9 and it freaked me '
 'out beyond terror Not that I would cuddle with a nut but you get the picture '
 'A WASTE of')
-----------
Temp=1.0:
('<bos> <bos> <bos> <bos> Destined to be a time filler on Sci fi channel when '
 'they ve overused everything else from their library and barely better than '
 'the original some critics such as John Charles have stated that Project A2 '
 'is better than the fragile but at least 

# Done!

Ok, the generations are quite funny. Clearly, this is no ChatGPT, but it is also not entirely bad for a model, which is basically two dictionaries...