# Lab 1: Corpora and Tokenization (with solutions)
#### Accelerated Natural Language Processing (INFR11125)

## Don't read this file until you've worked through the lab!

<div class="alert alert-danger">
This file contains completed code and answers to questions. We provide it so you can check your answers or if you get stuck. But, you should make a real attempt to solve the questions on your own first.
</div>

# Part 1: Downloading and exploring the data

## 1. What is Hugging Face?

Hugging Face is an AI company that provides a centralized repository, the Hugging Face Hub, which is widely used in NLP. There are many useful datasets and models on the Hub, and we'll be using some of them in this course. 

However, you should also be careful when using resources from the Hub, because anyone can create an account and upload datasets or models. So you don't necessarily know if the the resources are high quality, or may have ethical issues. We'll talk about some of these issues later in the course.

Hugging Face also provides Python libraries that provide standardized interfaces for the datasets and models, so it's easy to swap between different datasets and models without changing your own code. Today we'll be using the `datasets` and `transformers` libraries. 

**YOU TRY:** Have a look at the datasets on the Hugging Face Hub: [https://huggingface.co/datasets](https://huggingface.co/datasets). Then:
1. Use the sorting and filters to find the *most downloaded text dataset*. What kinds of tasks is it intended for?
2. Now click on the `languages` filter, and further filter the results by clicking the `Japanese` filter. You should see that the most downloaded dataset is now `allenai/c4`.
3. Look at the page for this dataset. You should see some sample data, but it's not in Japanese! Why not? Can you see how to preview the Japanese subset of the data? (*Hint*: the two-letter language code for Japanese is `ja`.)

#### Answers

As of Sept 2025, 
1. the most downloaded text dataset was `nebius/SWE-rebench`, which contains data to train models for software engineering tasks (so, it's text, but not natural language data).
2. (no question)
3. This dataset is a cleaned version of the Common Crawl corpus, containing web crawl data, so although it contains Japanese, it is actually multilingual, and the preview is showing the first language in the data set (Afrikaans, "af"). To see the Japanese, you need to click on the `subset` drop-down and choose "ja".

## 2. What data are we using today?

In today's lab, we will be using a dataset that we preprocessed and uploaded to the Hugging Face Hub specially for ANLP!

Our dataset fomes from the Nunavut Hansard corpus. Like the Europarl corpus that was discussed in the lectures, this is a **parallel corpus** containing parliamentary proceedings in different languages. But here it's the parliament for the Nunavut province of Canada. 

The proceedings have been translated into both **English** and **Inuktitut**, an indigenous language spoken in Nunavut with about 30,000 speakers. 

If you want, you can look briefly at the Hugging Face Hub page for the data [here]( https://huggingface.co/datasets/EdinburghNLP/nunavut-hansard-plusplus), then go to the next step.

**YOU TRY:** Execute the following cell to download the data from Hugging Face. Notice we are creating two separate datasets, one for each language, and taking only the first 100k lines of data from the "train" split. We'll be talking about dataset splits late in Week 2, so don't panic if you aren't familiar with this term yet! 

**Note**: the first time you run this, it will take about 30-60 sec to download 5 files. You might get an error about pip's dependency resolver---just ignore this!

In [None]:
!pip install -U datasets
from datasets import load_dataset
# We need to specify which dataset, which subset of it, and which split.
data_eng = load_dataset("EdinburghNLP/nunavut-hansard-plusplus", "english", split="train[:100000]") 
data_iku = load_dataset("EdinburghNLP/nunavut-hansard-plusplus", "inuktitut", split="train[:100000]")

## 3. A first look at the data

It's always a good idea to have a look at your data, to get a sense of what's in it and to make sure you understand the formatting! 

(Looking at the data is also useful later on in developing a system, for example to diagnose errors. )

**YOU TRY:** Run the following code to print out the first few lines of each corpus. Does it look like what you would expect from parliamentary proceedings? Can you guess what the word for "first" is in Inuktitut?

#### Answer

It looks about as expected! The word "first" seems to be translated as "sivuliqpaat" (repeated in the second and third lines).

In [None]:
from pprint import pp # this helps print large python objects in a more readable way
print("@@ ENGLISH @@")
pp(data_eng["text"][:10])
print("@@ INUKTITUT @@")
pp(data_iku["text"][:10])

One thing you might notice about the beginning of the corpus is there are no full sentences. Let's look at a later portion to try to find some.

**YOU TRY**: Complete the following code to print out some sentences from a bit later in the corpus, then run it! (If you want to, you can change the range of indices to look at some other parts of the corpus too.) Then consider:

1. Based on what you can see, do you think this corpus is *sentence-aligned* or not? That is, are the ENG and IKU sentences with the same ID number always translations of each other, or could the sentences be split up differently in the two languages? Give specific evidence to support your answer.

2. What's another way you could check whether the ENG and IKU subsets are sentence-aligned?

#### Answer

The code is given below.

1. Looking at lines 100-120, the data does not appear to be sentence aligned.  There are many examples you could use to support this claim, here are just two:

   ENG line 100 includes a name, parenthetical phrase, and multiple words, while IKU line 100 just has a single word, which doesn't seem like a likely translation. 

   Meanwhile, it looks like IKU 101 and ENG 103 probably *are* translations of each other: IKU 101 starts with "mis puukisan" followed by a parenthetical and a single word, and ENG 103 starts with "Ms. Perkison" followed by a parenthetical and "thank you".

2. You could provide further evidence by looking at the total number of sentences (lines) in each dataset. Remember, we downloaded exactly 100k lines of each, so you need to go back to the Hugging Face page to check the full sizes. You will see that indeed the two subsets have different numbers of lines in them, so they can't be sentence-aligned.


In [None]:
for i in range(100,120):
    print(f'ENG {i}: {"fix me"}') # fix to print the i'th English sentence (look at previous code cell for hints)
    print(f'IKU {i}: {"fix me"}') # fix to print the i'th Inuktitut sentence

In [None]:
# solution
for i in range(100,120):
    print(f'ENG {i}: {data_eng["text"][i]}') 
    print(f'IKU {i}: {data_iku["text"][i]}') 

## 4. Looking deeper: tokens and lemmas

So far, we've just looked at the `["text"]` field of the dataset, but there is more information available. 

**YOU TRY:** Run the following cell to print out a full row from the English data. What fields does each row include? What information is available for each token? (*Warning:* the additional info for each token was annotated automatically using NLP tools, so there could be errors!)

#### Answer

Each row contains `text`, which is a full sentence, but also `tokens`, a list of the tokens in the sentence. Each token is a dictionary containing the word itself, its part of speech, and its lemma.

In [None]:
print("English:")
pp(data_eng[206])

**YOU TRY:** Now, run the cell below to look at the corresponding sentence from Inuktitut.

1. What information do you see there that was not in the English data?
2. Based on the data you have seen so far (here or in previous parts of the lab), do you think Inuktitut is *more* or *less*  morphologically rich than English? Why?

#### Answer

1. Each Inuktitut token also contains `lemma_trans`, the English translation of the lemma.
2. It appears to be richer than English, because many of the words are long, but the lemmas are much shorter. This implies that a lot of each word consists of morphological affixes. You might also notice that while  the English  sentence contains words with many different parts of speech, the Inuktitut sentence  is almost all nouns and verbs. This can happen in a morphologically rich language, where things that might be separate words (prepositions, pronouns, auxiliary verbs, etc) in English are instead expressed using morphemes attached to other words.

In [None]:
print("Inuktitut:")
pp(data_iku[200]) #This is the translation of sentence 206 in ENG, which you printed above.

# Part 2: Word frequencies and Zipf plots

Now that we have some idea of what our data looks like, and an initial hypothesis about the morphology of Inuktitut, let's see whether we can find further support for that hypothesis by looking at corpus statistics.

In particular, we will do some simple analyses just based on word frequencies. 

## 5. Collecting word frequencies

Let's start by collecting the word frequencies. 

**YOU TRY:** Take a look at the code below, and talk through it with your partner to make sure you both understand how it works. For example,

- What does the code loop over? 
- What is stored as a "word" in Inuktitut, and what are the keys and values of the dictionary that the function returns? 
- Is every token counted?

#### Answer

The code loops over each row in the dataset. It skips over punctuation, but counts all other tokens. It returns a dictionary whose keys are the unique word types in the data, and whose values are the frequencies of those words. For Inuktitut, each "word" in the dictionary is actually a string that concatenates the Inuktitut word and the  English translation of its lemma.

**FOLLOW-UP QUESTION:** Is it possible that the way Inuktitut words are treated here might cause errors in our counting? Why or why not?

#### Answer

It is possible, because we aren't actually counting unique Inuktitut words, we are counting unique *token*/*lemma translation* pairs. 

In general, we would expect that if two tokens are the same, the lemma translation would also be the same, so the two methods of counting would behave the same. However, it's possible that the same token could be translated differently in some cases, either because it's truly ambiguous, or because the translation system made an error. In this case, our function will return different counts than if we just counted unique tokens.

These cases are likely to be rare, so for the purposes of this lab we thought it would be better to include the translations so you can understand the data more easily. 

In [None]:
from tqdm import tqdm # allows to get progress bars
def compute_frequencies(dataset):
    ''' Takes the dataset with each row containing a 'tokens' field (list of tokens), 
    and returns a dictionary with the counts of all words in the data. 
    If the dataset contains lemma translations, the translation is appended to each "word".'''
    counter = {}
    for row in tqdm(dataset):
        for token in row['tokens']:
            # skip punctuation
            if token['pos'] in ['PUNCT', 'SYM']:
                continue
            if "lemma_trans" in token:
                # for Inuktitut, we'll show both the word and the translation
                # of the lemma, so we can understand the data better.
                word = token['text'].lower() + " (" + token['lemma_trans'] +")"
            else:
                # upper or lower case shouldn't matter,
                # so we'll make all words lower case
                word = token['text'].lower()
            if word not in counter:
                counter[word] = 1
            else:
                counter[word] += 1
    return counter

**YOU TRY:** Now, let's actually compute the frequencies!
1. Run the cell above to define the `compute_frequencies` function.
2. Next, add code in the cell below to call the function on your datasets, so that `count_eng` and `count_iku` store the two dictionaries of word counts.
3. Run your code to compute the frequencies. It will take a few seconds, so in the meantime go to the next cell and fix the code to print out the frequencies of the four English words. Which one do you expect to have the highest frequency? The lowest? 

In [None]:
# solution
count_eng = compute_frequencies(data_eng)
count_iku = compute_frequencies(data_iku)

In [None]:
#solution
print ("Frequencies in English data:")
for word in ["on", "window", "happy", "went"]:
    print(word, count_eng[word]) 

## 6. Comparing Zipf plots

**YOU TRY:** Run the code below to create Zipf plots for both languages, and then scroll down to answer the questions about them.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
def plot_zipf(counter, language):
    ''' Takes a dictionary of word counts, and a string with the language name.
    Produces a Zipf plot (rank vs frequency, on log scales) '''
    freqs = np.array(sorted(counter.values(), reverse=True))
    ranks = np.arange(1, len(freqs) + 1)

    plt.figure(figsize=(4.5, 4.5))
    plt.plot(ranks, freqs, linewidth=3)
    plt.xscale('log')
    plt.yscale('log')
    plt.xlim(1, 1.5e6)
    plt.ylim(1, 1.5e6)
    plt.grid(True, which="both", ls="--")
    plt.xlabel("Rank")
    plt.ylabel("Frequency")
    plt.title(f"Zipf Plot for " + language)
    plt.show()

plot_zipf(count_eng, "English")
plot_zipf(count_iku, "Inuktitut")

**QUESTION:** Compare the English and Inuktitut Zipf plots. Based on these plots, which language is morphologically richer, and how can you tell? Do the patterns you see here match what you thought earlier about English vs. Inuktitut morphology?

#### Answer

The plots  strongly suggest that Inuktitut has richer morphology, because:
- Inuktitut has far more word types than English, because the highest rank is larger (there are more words to rank).
- The most frequent word in Inuktitut is much less frequent than the most frequent word in English

Together, these two points make the slope shallower for Inuktitut, which is another way to answer this question.

You probably also noticed that the Inuktitut plot is straighter than the English one. It's not clear why that's the case! But it is not likely to be related to the morphology. 

## 7. More details

Let's look in a little bit more detail at some of what we can see in the plots, to further confirm our hypothesis about Inuktitut morphology.

**YOU TRY:** 

1. Based on the Zipf plots above, roughly how many *word types* (unique words) are in each dataset, and how does this relate to morphology?
2. In light of the previous question, why did we use a parallel corpus for this study?
3. Now complete and run the code below to compute the exact number of word types and confirm your estimates. *Hint:* Check the number of items in the `count_eng` and `count_iku` objects.

#### Answer

1. It looks like there are somewhat over 10^4 types in English (perhaps 20k), and well over 10^5 types in Inuktitut (somewhere between 200k-300k).  This suggests that meanings expressed by multiple words in English are often strung together in a single complex word in Inuktitut.
2.  We need to control for the amount of data and what's being said, otherwise there could be many reasons  besides morphology why Inuktitut has more word types.
3.  The exact counts are 19661 (ENG) and 279223 (IKU): IKU has more than 10 times as many word types!

In [None]:
# solution
print("English word types:", len(count_eng))
print("Inuktitut word types:", len(count_iku))

Finally, let's take a look at the most frequent words in each language. We'll do this by casting (converting) our word count dicts into Counter objects, and using the `most_common` function of Counters.

If you are not familiar with Counters, you might want to look up the documentation (now or after the lab) to see what other functions they provide. They can be very useful for simple analyses of text data! (In fact, we could have used Counters in the first place, rather than dicts. As an exercise for later, you could try rewriting the `compute_frequencies` function using Counters.)

**YOU TRY:** Run the following code and look at the results to answer the questions:
1. What are some differences you notice between the top English words and the top Inuktitut words? You should be able to name at least three differences.
2. How might these differences relate to the morphology of Inuktitut? *Hint:* If you're not sure, look back at the lecture slides on *case* (end of W1/L2) and *agreement* (middle of W1/L3).
3. You might also notice some evidence that the automatic preprocessing might have led to some errors. What's the evidence? (Don't spent ages on this if you can't find it; ask a demonstrator or look at the solutions when available.)


#### Answer
1. Differences include the following. You might have found other ones!
   
- the English words are much shorter.
- the top English words are much more frequent than the top Inuktitut words.
- In the Inuktitut list, there are several examples where different words have the same lemma.
- nearly all of the top English words are function words, whereas many of the top Inuktitut words are content words.

2. The first three differences are all consistent with Inuktitut having complex morphology, where a single lemma can combine with many different morphological affixes. This would make the words longer, and the most frequent meanings/lemmas would show up with different wordforms. Each individual wordform would therefore occur less frequently.

    The final difference is perhaps even a bit more specific. Several of the words in the English list are prepositions ("to", "of", "in", etc), and others are used in part to indicate number ("the", "a") or tense ("will"). The fact that these don't appear in the Inuktitut list suggests that Inuktitut likely expresses these meanings through morphology, e.g., through *case* and *agreement* morphemes.

3. The word 'ammalu' appears as both the second and third most frequent word, with different lemmas, but the frequencies listed are the same. This seems to suggest that the lemmatizer may have made an error on this word at least once (and/or it is ambiguous)--- but it's not clear why the counts are the same. This is the kind of thing that is worth following up as a possible bug, whether it's your own code or someone else's! 



In [None]:
from collections import Counter
pp(Counter(count_eng).most_common(20))
pp(Counter(count_iku).most_common(20))

# Part 3: Byte-pair encoding

In class, we talked about **byte-pair encoding (BPE)** as a way of dealing with unseen words. This should be especially helpful for a language as morphologically rich as Inuktitut. Let's train some byte-pair encoding models and see how it affects the Zipf plots for English and Inuktitut!

## 8. Tokenizing the data

Once again, there's no need for us to implement BPE from scratch. Hugging Face also has the `tokenizers` library, which implements efficient versions of BPE and related algorithms. We already used it to train BPE tokenizers on our English and Inuktitut data and uploaded them to the Huggingface Hub. 

**YOU TRY:** Run the next three cells to compute frequencies of the BPE-tokenized data. These cells will:
1. Download our pre-trained tokenizers using `AutoTokenizer` from the `transformers` library;
2. Define a modified version of our previous `compute_frequencies` function, to compute frequencies based on the BPE tokens; and 
3. Actually compute the frequencies, by calling `compute_frequencies_tok`. This will take a minute or two to run, so scroll down and start thinking about the next question while it's running.

In [None]:
from transformers import AutoTokenizer
tokenizer_eng = AutoTokenizer.from_pretrained('anlp-uoe/nunavut-hansard-eng')
tokenizer_iku = AutoTokenizer.from_pretrained('anlp-uoe/nunavut-hansard-iku')

In [None]:
def compute_frequencies_tok(dataset, tokenizer):
    ''' Takes a dataset with each row containing a 'tokens' field (list of tokens), 
    and a tokenizer. Tokenizes the data using the tokenizer, 
    and returns a dictionary with the token counts (excluding punctuation).'''
    counter = {}
    for row in tqdm(dataset):
        for token in row['tokens']:
            # skip punctuation
            if token['pos'] in ['PUNCT', 'SYM']:
                continue
            subtokens = tokenizer.tokenize(token['text'])
            for subtoken in subtokens:
                word = subtoken.lower()
                if word not in counter:
                    counter[word] = 1
                else:
                    counter[word] += 1
    return counter

In [None]:
# Compute frequencies on both datasets
count_eng_tok = compute_frequencies_tok(data_eng, tokenizer_eng)
count_iku_tok = compute_frequencies_tok(data_iku, tokenizer_iku)

 ## 9. Analyzing the effects of BPE

**BEFORE you run the code below**, try to predict how the Zipf plots will look like, now that we have tokenized the  data using BPE. It might help to know that we used a BPE vocabulary size of 10,000.

**YOU TRY:** After you have discussed your predictions with your partner, *then* run the code to see if you were right! If you see anything unexpected in the results, can you figure out why? Does the data still obey Zipf's law? Why or why not?

In [None]:
# Plot the data!
plot_zipf(count_eng_tok, "English BPE")
plot_zipf(count_iku_tok, "Inuktitut BPE")

**YOU TRY:** Based on what you know about BPE and these languages, do you expect that after running BPE, the most frequent tokens will have changed? What about the least frequent tokens? To see if you're right, use the function we've defined below to explore the most and least frequent tokens. (You'll need to add some code of your own that calls this function.)

#### Answer

Whatever you predict, ideally you should explore the data to notice the following effects:

- The most frequent tokens in English are the same as the most frequent words (the `_` indicates that the token is the start of a word; in this case, they are also complete words). In Inuktitut, only some of the most frequent tokens are also in the most frequent words: BPE has created a bunch of new high-frequency tokens. 
- You might find that the list of least frequent tokens in English doesn't seem to change much after BPE. However, this could be an artifact of our printing function! In English, we didn't reduce the vocabulary that much: from about 20k down to 10k. This means some of the words with frequency 1 are probably still in the vocabulary. When tokens are tied for frequency, we don't know what order they will get printed in, so looking at a small list of frequency 1 tokens isn't that informative. 
- On the other hand, in Inuktitut, the vocabulary size shrunk a lot, so you should see a big difference between the least frequent words and the least frequent BPE tokens.

In [None]:
def print_most_frequent(counts, label, N=10):
    ''' Takes a dict of counts and a string label identifying what the counts are, 
    and prints the top N most frequent items with their counts. 
    If N is negative, it prints the least frequent items instead. '''
    if N >= 0:
        print("**", N, "most frequent tokens in", label, "**")
        pp(Counter(counts).most_common(N))
    else:
        print("**", -1*N, "least frequent tokens in", label, "**")
        pp(Counter(counts).most_common()[N:])

In [None]:
# possible solution: example of what you might look at.
print_most_frequent(count_eng_tok, "English BPE", 20)
print_most_frequent(count_eng_tok, "English words", -20)
print_most_frequent(count_eng_tok, "English BPE", -20)
print_most_frequent(count_iku, "Inuktitut words", 20)
print_most_frequent(count_iku_tok, "Inuktitut BPE", 20)
print_most_frequent(count_iku, "Inuktitut words", -20)
print_most_frequent(count_iku_tok, "Inuktitut BPE", -20)

## 10. Cross-language BPE

In the previous section, we used tokenizers that were trained separately on English and Inuktitut, using the same vocabulary size.

However, today's large language models use a shared tokenizer for all data, and that tokenizer is trained primarily on English data. This can cause problems for other languages, for two reasons:
- Languages differ in terms of *orthography* (how the language is written, including the alphabet as well as rules for spelling, punctuation, and capitalization). 
- Languages also have different statistics over character sequences, or $n$-grams, as we'll see in more detail in Week 2 lectures and Lab 2.

We don't have one of these multi-lingual tokenizers here, but we can simulate some of the effects by just running our English tokenizer on the Inuktitut data.

**YOU TRY:** Before running the code below to create the Zipf plot and print out the top and bottom words, try to predict what you think might happen if we tokenize the Inuktitut data using the English tokenizer. For example, compared to when we tokenize using the Inuktitut tokenizer,
1. Will the most frequent token be more frequent, less frequent, or the same?
2. Will the highest rank be the same, lower, or higher? What would each of these indicate?
3. What do you think the most or least frequent tokens might look like?
   
   As usual, then run the code, and see whether you find anything different from what you expected, and if so, whether you can explain what happened.


#### Answer

From looking at the Zipf plot and the printout of the most/least frequent words, you should notice some of the following: 
- the most frequent tokens are even more frequent, but they are also much shorter, mostly 1-2 character sequences which are unlikely to be meaningful in Inuktitut. 
- the highest rank is only about 2000, even though the BPE vocabulary size was 10,000. This means that a lot of the BPE tokens from English are never seen at all, and the vocabulary isn't being used efficiently.
- Overall the plot does not look Zipfian at all: there are a rather large number of types with very high frequency, and then a steep dropoff (rather few types with mid-low frequency).
- A lot of the low-frequency tokens appear to be full English words. This suggests that there are a few English words sprinkled amongst the Inuktitut. Again, we don't know exactly why but perhaps there are a few words that don't have Inuktitut translations, or perhaps the original speaker was speaking Inuktitut but used a few English words, and this was transcribed accurately rather than translating into Inuktitut.


In [None]:
count_iku_mistok = compute_frequencies_tok(data_iku, tokenizer_eng)
plot_zipf(count_iku_mistok, "Inuktitut, using English BPE")

In [None]:
print_most_frequent(count_iku_tok, "Inuktitut BPE", 20)
print_most_frequent(count_iku_mistok, "Inuktitut with English BPE", 20)
print_most_frequent(count_iku_tok, "Inuktitut BPE", -20)
print_most_frequent(count_iku_mistok, "Inuktitut with English BPE", -20)

## &#127881; &#127881; Congratulations! You're done! &#x1F680; 

We hope you found this lab interesting, and that some of the libraries and coding patterns will be useful going forward!

As a completely optional follow-up activity, you might be interested in looking at some further statistics related to BPE, such as the *fertility* (the average number of tokens per word) or the average token length. Given what you have seen so far, what would you expect to find in the two languages? What about if you compare the Inuktitut data as tokenized by the Inuktitut versus the English tokenizer?

Feel free to explore these questions or look at other languages on your own!