# Lab 2: Language Identification with character n-gram models



## Character n-gram models: what and why?

In lecture, we talked about n-gram models over **words**, but it's also possible to build n-gram models over **characters**. These can actually be useful for certain tasks, like identifying what language a text is written in.

It's much easier (and requires less data) to build an n-gram model over characters than over words, so will do that here, to help you build your intuitions about this type of model.

You will work with data from three different languages to build and explore character-level n-gram models for a simple language identification task. Along the way, you’ll confront issues like rare characters, smoothing, underflow, and generation strategies.

## What you will learn in this lab

### Tools and practical issues: 

In this lab, you will learn:
- how to easily split data into training and development subsets using the `datasets` library.
- how to use these splits, together with a separate test set, to correctly to tune hyperparameters and test generalization.
- some possible pitfalls to watch out for in your data

### Concepts: n-gram models and language identification

After working through the lab, you should be able to:
- compute the probability of a sequence given an n-gram model
- explain how to generate new sentences using an n-gram language model
- explain how to use a character n-gram model to do language identification
   
You should also understand more clearly:
- how n-gram models are trained
- the effects of smoothing and other hyperparameter choices on the model's behaviour (both generation and perplexity)

## 1. Loading and splitting the data

Today you’ll use another dataset we uploaded to Hugging Face. It contains sentences in three languages widely spoken in South Africa: 
- **English:** a language in the West Germanic branch of the Indo-European language family with significant influences from Old French. It is spoken as a first language by around 380 million people worldwide, with another billion second-language speakers.
- **Afrikaans:** another language in the West Germanic branch of the Indo-European language family. It evolved from Dutch starting around the 17th century, and is spoken as a first language by around 7 million people, mainly in South Africa and nearby regions.
- **Xhosa: (or isiXhosa):** a language in the Nguni branch of the Bantu language family. It is closely related to Zulu and is spoken as a first language by around 8 million people, mainly in South Africa and nearby regions.

👉 Run the next two cells to load the data. 

In [None]:
# once again, we need to update `datasets`
%pip install -U datasets

In [None]:

from datasets import load_dataset

# Our variable names are based on the ISO 639-3 language codes for the languages in the dataset.
# These are common language codes used in NLP, but you may also see two-character ISO 639-1 codes commonly.
# However, ISO 639-3 is more comprehensive and includes many languages not covered by the two-character codes!
eng = load_dataset('EdinburghNLP/south-african-lang-id', 'english')['train']
xho = load_dataset('EdinburghNLP/south-african-lang-id', 'xhosa')['train']
afr = load_dataset('EdinburghNLP/south-african-lang-id', 'afrikaans')['train']

# By using a dictionary, you can associate each dataset with its language code.
# Do you see why we use the code as the key and not the dataset itself?
# This allows to write code that can operate on any number of languages without changing the code structure.
# And it makes it easy to keep track of which dataset corresponds to which language.
# This is a common pattern in NLP and other data processing tasks.
corpora = {'eng': eng, 'xho': xho, 'afr': afr}

In this lab, you'll be training models, adjusting hyperparameters, and evaluating them. To evaluate your changes and avoid over-fitting, you'll use the *train--dev--test* splitting paradigm discussed in class. 

You might have noticed that when we loaded the dataset from HuggingFace, we selected a key called `'train'`. That's because you always have to select a split before you can iterate over a HuggingFace dataset, and HuggingFace interprets data without any splits as being a single `'train'` split. 

However, we are now going to split this data to create the actual training and development sets, so that you can explore and tune hyperparameters on the development set. 

The `datasets` library has a function we can use to easily split off a smaller subset of the data for development, but note that it will call this subset `'test'`!! Despite the name, this split will *not* be used for the test set in this lab. We have set aside another dataset for that, which we will introduce toward the end of the lab.

By default, `train_test_split` selects a *random* subset of the data to split off, which can be useful if there might be differences between the early and late parts of the dataset, and you don't want this to affect your training. However, if you're trying to compare directly to (your own or someone else's) previous work, you need to be careful about whether you split the data the same way. 

**Further reading for later:** If you're new to Hugging Face, they provide useful tutorials with much more information about Datasets and other libraries, which you can find [here](https://huggingface.co/docs/datasets/index).

👉 Run the next two cells to load the data and see what its structure is. Then, change the final line in the second cell to print out the first 10 sentences in each language. Do you see anything that surprises you? (You might or might not, remember the data splits are random!)

In [None]:
corpora = {
    lang: dataset.train_test_split(test_size=0.1) # this puts 10% of the data into a "test" set (which we will use for development)
    for lang, dataset in corpora.items() # see how organizing the data in a dictionary allows us to write less code?
}

In [None]:
# Let's take a quick look at the data.
for lang, dataset in corpora.items():
    print(f'### {lang} data: ###')
    # This line should help you see the structure of the datasets.
    # Replace it with code to print the first ten lines from the training set of each language!
    print("FIX ME!", dataset)

In [None]:
# solution
for lang, dataset in corpora.items():
    print(f'### {lang} data: ###')
    lines = dataset['train']['text'][1:10]
    for line in lines: print(line) 
    print("")

## 2. Preprocessing the data

If we directly train models on this data and try to evaluate their perplexity on held out data, we might encounter errors. That's because there may be rare characters that we don't see in training. (This would be a much bigger problem if our model was over words, but can be a problem even for characters.) We want our code to generalize to unseen characters, so we can still assign probabilities to sequences that contain them. 

We will take a standard approach and replace all rare and unseen characters with a shared unknown character: `�` (the Unicode character for an unknown character). This allows our model to assign some probability to unseen characters and also prevents it overfitting to rare characters which are coincidentally only present in one of the corpora. 

👉 **THINK:** What does this imply about the probabilities of different rare or unseen characters? How is it similar to add-alpha smoothing?

#### Answer:

Because we are using the same character to replace all rare/unseen characters, they will all receive the same probability in the model. This is similar to add-alpha smoothing, which also assigns the same probability to all unseen events.

Some languages do contain characters that don't occur in others! For instance, English doesn't use `क`, `ã` or `角`.

👉 **THINK:** Do you think an English n-gram model *should* assign zero probability to strings containing these characters? Why or why not? Can you think of any contexts where they might appear in English text?

#### Answer:

Even though these characters are not normally used in English, it's a bad idea to assign zero probability to them, because they *can* occur very occasionally. In fact, those characters just appeared in the English text of this lab! Other places where they might occur would be articles about those languages, or quotes from those languages.

In order to remove rare characters, we need to decide what counts as "rare". Normally, you would take a look at your data statistics and perhaps try a few different frequency thresholds (tuning on a development set).

In this case, we've done some of that for you already and we are just going to use a threshold of 500. But you should still sanity-check the results by looking at what is getting removed.

👉 Run the next two cells to identify the rare characters and print them out. 

In [None]:
from collections import Counter

def find_rare_chars(corpora, threshold=500):
    """ find characters that occur fewer than 'threshold' times across all corpora
    and return them as an alphanumerically sorted string """

    # count character frequencies in each corpus
    counters  = []
    for _, corpus in corpora.items():
        counts = Counter()
        for text in corpus['train']:
            counts.update(text['text'])
        counters.append(counts)

    all_chars = set([char for counter in counters for char in counter.keys()])

    # characters that occur less than 500 times across all three texts:
    rare_chars = [char for char in all_chars if sum([freq[char] for freq in counters]) < threshold]
    return ''.join(sorted(rare_chars)) # sorting alphnumerically to make output a bit easier to scan

In [None]:
rare_chars = find_rare_chars(corpora)
print(rare_chars)

👉 **THINK:** Look at the rare characters. Do you think any of these characters shouldn't be replaced with the unknown character? If so, why not?

#### Answer

There is no obvious reason not to replace them. Some of them might actually give clues to the language (for example, perhaps English uses more hyphenated words than the other languages, and some languages might use more Z's or H's than others), but if these characters are rare, we may end up overfitting to their particular distributions in these corpora, rather than what is more generally true of these languages.

👉 **CHALLENGE QUESTION:** We know that encountering unseen characters in test data will cause problems. But why don't we simply deal with that at test time, by replacing the unseen characters with `�`? That is, why did we use `�` to replace characters that *did* occur in training, but rarely? (There are actually a few reasons, some more subtle than others. Don't spend more than a couple of minutes thinking about this now, you can come back to it later.)

#### Answer

The answer to the previous question is one reason why we might want to replace rare characters. But there are a few others:
- From a technical perspective, if we only replace characters in the test set, we don't know how much probability mass to set aside for the unk character. However, we could solve that by using a hyperparameter for this value, and tuning it on the dev set.
- However, it's possible that unseen characters tend to occur in particular contexts, and we won't be able to learn about that if we don't have any unseen characters in training. So essentially what we are doing is using the rare characters to simulate the unseen ones. If there are particular contexts where the rare characters tend to occur, then our model predicts that unseen characters are also more likely to occur in those contexts. This will give us a better probability model for unseen characters. (This way of thinking is particularly important for word-level n-gram models, where rare and unseen words occur a lot!)

👉 Now, run the next cell to actually replace the rare characters with the unknown character using a regular expression. (You'll learn more about these in CPSLP if you're taking it!)

In [None]:

import re
corpora = {
	lang: corpus.map(
		lambda x: {'text': re.sub(f"[{re.escape(rare_chars)}]", '�', x['text'])}, # This regular expression identifies any character in rare_chars so we can replace it with the unknown character
	)
	for lang, corpus in corpora.items()
}  

## 3. Defining the n-gram model and looking at test cases

Now that we've preprocessed our corpora, it's time to build our n-gram model. We've written it so that our model works for different values of $N$. It also implements add-alpha smoothing. 

👉 **THINK:** Why do we need smoothing, even after replacing unknown characters?

#### Answer

We still need smoothing  in order to deal with rare and unseen *sequences* of characters.

The code for the n-gram model is below. It's probably not a good use of the lab session time to go through every line right now, but we encourage you to review it in more detail later. 

👉 For now, just make sure you understand the role of each function at a high level. Then, run the code.

In [None]:
import math
from collections import defaultdict, Counter

class CharNGramLM:
    def __init__(self, N=3, alpha=0.01):
        self.N = N
        if alpha < 0:
            raise ValueError("Invalid value of alpha!")
        else:
            self.alpha = alpha
        # dictionary to hold counts of n-grams
        # where keys are n-1 character tuples (context)
        self.context_counts = defaultdict(
            Counter
        )  # a defautltdict allows us to create a new Counter for each new n-1-gram automatically
        if N > 1:
            self.vocab = set(["<s>", "</s>"])
        elif N == 1:
            self.vocab = set(["</s>"])  # don't need BOS
        else:
            raise ValueError("Invalid value of N!")

    def train(self, corpus):
        ''' Given a corpus of sentences, store the counts of all character N-grams in the corpus.'''
        for sentence in corpus:
            # add start and end tokens
            sentence = ["<s>"] * (self.N - 1) + list(sentence) + ["</s>"]
            # update the counts of each n-gram
            for i in range(len(sentence) - self.N + 1):
                context = tuple(sentence[i : i + self.N - 1])
                char = sentence[i + self.N - 1]
                self.context_counts[context][char] += 1
                self.vocab.add(char)

    def print_counts(self):
        ''' Print out the counts of all N-grams that have non-zero counts, alphnumerically ordered.'''
        for context, counts in sorted(self.context_counts.items()):
            print(f"Context {context}:")
            for char, count in sorted(counts.items()):
                print(f"   C({char!r} | {context}) = {count}")

    def print_probs(self):
        ''' For each context in alphanumeric order, print out the conditional probability
        of each character in the vocabulary (including zero probabilities).'''
        for context, counts in sorted(self.context_counts.items()):
            print(f"Context {context}:")
            for char in sorted(self.vocab):
                if char != "<s>" : # We never generate <s> following another character, so it's not in the conditional probabilities
                    prob = self.prob(context, char)
                    print(f"   P({char!r} | {context}) = {prob}")

    # We've included the next two functions so it's easier for you to check
    # correctness on tiny examples, but we don't normally use raw
    # probabilities in models because they can underflow for very long
    # sequences. In this case, we do the computation in log space to
    # make sure it will always be right, and this function should also
    # work for reasonable sequence lengths.
    def prob(self, context, char):
        '''Returns the smoothed probability of char in the given context'''
        return 2 ** self.logprob(context, char)

    def prob_seq(self, sentence):
        '''Returns the  probability of the sentence, according to the model'''
        return 2 ** self.logprob_seq(sentence)

    def logprob(self, context, char):
        '''Returns the smoothed log probability of char in the given context'''
        context = tuple(context)
        counts = self.context_counts[context]
        if self.N == 1:
            V = len(self.vocab)
        else:
            V = len(self.vocab) - 1 # Only count characters we might generate next, which doesn't include <s>
        if (self.alpha == 0) and (counts[char] == 0):
            return -math.inf # Negative infinity
        else:
            prob = (counts[char] + self.alpha) / (sum(counts.values()) + self.alpha * V)
            return math.log2(prob)

    def logprob_seq(self, sentence):
        '''Returns the log probability of the sentence, according to the model'''
        sentence = ["<s>"] * (self.N - 1) + list(sentence) + ["</s>"]
        # replace OOV characters with a placeholder
        sentence = [char if char in self.vocab else "�" for char in sentence]
        score = 0.0
        for i in range(len(sentence) - self.N + 1):
            context = tuple(sentence[i : i + self.N - 1])
            char = sentence[i + self.N - 1]
            score += self.logprob(context, char)  # add the log probs of each n-gram (multiply the raw probabilities)
        return score

    def perplexity(self, corpus):
        '''Returns the perplexity of the model on the given corpus'''        
        length = 0
        log_prob_sum = 0.0
        for sentence in corpus:
            log_prob_sum += self.logprob_seq(sentence)
            length += len(sentence) + 1  # account for </s>
        return 2 ** (-log_prob_sum / length)

Never assume that code is correct without testing it, whether it was written by you, us, or someone else (including AI)!

Before you train the model on the real data, you should check that it works correctly on a very small test case where you can check the results by hand. This will also help ensure that *you* understand how to compute the right result.

(It's not always possible to check results this way! Sometimes you'll need to think of other ways to test code, and ideally you can write automated test cases.)

We've constructed one such test case below.

👉 Before you run this test case, compute the counts, probabilities, and perplexity by hand. Remember that for a sequence of length $L$, the perplexity is $2^{(-1/L)*log P(seq)}$. You'll need to consider how the begin/end of sequence markers figure into the computations! 

👉 Now run the test code below to check that your answers match the output of our model. If they don't, is there a bug in the model or in your own understanding?

👉 We started by checking the unigram model without smoothing, because it's the simplest, but it's important to ensure the code also works for other cases. Again **by hand**, 
1. Figure out what counts and probabilities you should you get from this corpus if you use a bigram model (still with alpha = 0).
2. Do you expect the training and testing perplexities to be higher or lower than with the unigram model? Why?

👉 Now set N=2 and re-run the test code to check your answers. Do you see where smoothing becomes necessary? Update the value of alpha, rerun the code, and check that this change does what you expect.

#### Answer

We are pretty sure our code is correct, so hopefully any mismatches with your by-hand computation might have revealed misunderstandings rather than bugs. But if you think you found a bug, please let us know!

A few points to emphasize:
- The sequence length needs to include any character that would need to be generated from the model. This does *not* include the beginning-of-sequence character(s), because these are required for context but not actually generated. However, it *does* include the end-of-sequence character, because that character is generated by the model (it indicates that generation should stop).
- Longer n-grams should make the perplexity lower on the training data because it's easier to predict what comes next with more context. However, the perplexity of the test data could go up or down. In our test case, one of the test sequences doesn't occur in the training data. This is okay for the unigram model, because all of the characters occurred in training, but the unsmoothed bigram model thinks the probability of this sequence is 0 (infinite perplexity) because "i" was never observed at the beginning of a sequence during training.
- If alpha is set to a positive value, then you should get non-infinite perplexity for the test sequence.

Feel free to change the training and testing examples here and explore further on your own!

In [None]:
# Tiny toy corpus
tiny_train = ['hi', 'ha', 'hi']
tiny_test = ['hi','i']
tiny_lm = CharNGramLM(N=1,alpha=0)
tiny_lm.train(tiny_train)
tiny_lm.print_counts()
tiny_lm.print_probs()
print(tiny_lm.prob_seq('hi'))
print (f"PPL on train: {tiny_lm.perplexity(tiny_train):.4}")
print (f"PPL on test: {tiny_lm.perplexity(tiny_test):.4}")

## 4. Training on real corpora

Character n-gram models can be used for language identification by training models on different languages. Then, when you get a piece of text, you check its perplexity under each model. Whichever model has the lowest perplexity is identified as the language of the text.

Systems based on this idea are efficient and often work well, but they can still make errors! 

Let's take a look at what kinds of examples might present problems for this sort of model, using a simple trigram model with default smoothing.

👉 Run the two cells below to train models on each of our three languages and compute the perplexities of some test sentences that could occur in English. Is the perplexity always lowest for the English model? If not, what sorts of input seem to cause problems for this way of doing language ID?

Feel free to add your own test sentences to the list to explore this question further.

#### Answer

The English model does not always have lowest perplexity. Sentences that are very short seem to cause problems: often the perplexities are very high in all models, and they seem to be much more variable. (The variability is not surprising if you consider that perplexity is an average over the whole sequence, so with a very short sequence there's less information to average over.)

There are also other examples that are not recognized as English, including `3601` (although arguably this could be any language) and the examples that have words with unusual spellings -- in this case they are a better fit to the Afrikaans model. The final example is a reminder that models trained on edited text (such as the government websites use here) often do not work well on social media data!

In [None]:
# Train a character n-gram language model for each language.
lms = {}
for lang, corpus in corpora.items():
    lm = CharNGramLM(N=3)
    lm.train(corpus['train']['text'])
    lms[lang] = lm

In [None]:
my_test_sentences = ['I love natural language processing.',
                     'An incomplete sentence.',
                     'marketing and sales operations',
                     'Pierre Vinken, chairman of Elsevier, is well-known in NLP for appearing in the first sentence of the WSJ corpus.',
                     'Hi',
                     'no',
                     'See?',
                     'Aha, Lycketoft.', # Lycketoft was one of the words in English Europarl with count=1
                     '3601',
                     'hey @sloppyjoe wassup #chillin #fridaynight']
for sentence in my_test_sentences:
    print(f"Sentence: {sentence}")
    for lang, lm in lms.items():
        print(f"  {lang} ppl: {lm.perplexity([sentence]):.4f}")

## 5. Generating from language models

This section is a small digression from the language ID task, to help you build intuitions about *generating* from language models. 

**If you are running short on time** and want to focus on language ID, you can skip this section and come back to it later. Before going to Section 4, you will need to **run the first two cells below.**

Below, we've provided some code that implements several different generation (decoding) strategies. The default is just to sample from the language model probabilities (standard generation), but it also implements top-$k$ and temperature-scaled sampling. We will only explore top-$k$ sampling here, but if you want you can look at temperature-scaled sampling on your own after the lab.

To better see how the output is affected by both $N$ and the sampling  method, we'll generate from models with different values of $N$.

👉 Run the next two cells to train English models with different values of $N$ and implement generation. Then scroll down to the next question.

In [None]:
# Train some models on English with different N
eng_lms = {}
for N in [1, 3, 5, 10]:
    lm = CharNGramLM(N=N)
    lm.train(corpora['eng']['train']['text'])
    eng_lms[N] = lm

In [None]:
import random
def generate(model, top_k=0, temperature=1, max_len=100):
    '''Given a language model, use it to generate a single sequence. 
    Generation will stop when </s> is generated, or after max_len chars (whichever comes first).
    If top_k = 0, we sample from the full distribution, otherwise re-normalize the top k choices and sample from those.
    '''
    context = ['<s>'] * (model.N - 1)
    result = []
    for _ in range(max_len):
        nlog_probs = []
        chars = []
        for c in model.vocab:
            log_p = model.logprob(context, c) / temperature
            nlog_probs.append(-log_p)
            chars.append(c)
        
        if top_k > 0:
            pairs = sorted(zip(chars, nlog_probs), key=lambda x: x[1])[:top_k]
            chars, nlog_probs = zip(*pairs)
            probs = [math.exp(-log_p) for log_p in nlog_probs]
        else:
            exp_probs = [math.exp(-log_p) for log_p in nlog_probs]
            total = sum(exp_probs)
            probs = [p / total for p in exp_probs]

        next_char = random.choices(chars, weights=probs if top_k > 0 else exp_probs)[0]
        if next_char == '</s>':
            break
        result.append(next_char)
        context = context[1:] + [next_char]
    return ''.join(result)

👉 The following cell is going to generate 10 sequences (up to 100 characters each) from each of the four English $N$-gram models, which have $N$=\[1, 3, 5, 10\]. 
1. Roughly speaking, what do you think the difference will be between the output of the four models? Run the code to find out if you are correct.
2. (**Challenge question**) Now take a closer look at the output. Do you see any places where the 5-gram or 10-gram model suddenly started generating total junk? (If not, you might need to rerun the code to get some new sequences. You should see this eventually.) Can you explain why that happened? 
3. So far we used $k=0$, which actually turns off top-k sampling (so, we sample from the full distribution when generating each character). What do you think will happen if you change $k$ to be 1, and how might that change the output of each model? (*Hint*: What character will be chosen at each generation step?) Try it to see if you're right!
4. If you want, you can try other values of $k$ to see whether you can find a good balance between generating diverse outputs and not generating junk.

#### Answer

1. You should see that the models with longer context generate text that looks more like English.
2. This is a result of the smoothing. Add-alpha smoothing is not a very good smoothing, and this is one result. More specifically: every once in a while, the generator picks a low-probability character to output. When the model uses long context, it is often the case that this low-probability character creates a sequence that has rarely/never been seen before. This means that there is little evidence about what should come next, so the model again relies on the smoothing rather than on true counts, and again produces a sequence that probably hasn't been seen before. The cycle of generating low-probability sequences continues.
3. You should see that the model always produces the same output (whatever is the highest probability given the $N-1$ characters that came before).

In [None]:
k = 0
for N, lm in eng_lms.items():
    print("******* Generating with N =", N, ", k =", k, "*******")
    for _ in range(10):
        print(generate(lm, top_k=k))

👉 **THINK:** If we want to get the best performance from our language ID system, we will need to tune the model hyperparameters (as we'll do in the next section). Should we include $k$ in the hyperparameters that we tune? Why or why not?

#### Answer

No, there is no sense in tuning $k$ for language ID, because $k$ only affects *generation*, whereas the ID system uses the model to score inputs. We might want to tune $k$ if we were building a system that needs to produce output, such as a dialogue system. However, we would need to consider what evaluation method to use to pick the best $k$. (Below, we'll use perplexity to choose $N$ and $\alpha$. Can you see why this won't work to find the best $k$?)

## 6. Exploring the effects of $\alpha$

Now, let's see how the smoothing parameter affects the generated text and the model perplexity.

For generation, we  will use regular sampling ($k=0$). Our code uses a 5-gram model, but you can try other values of $N$ later if you want.

Notice that we are exploring $\alpha$ values over several orders of magnitude. This is a common pattern in NLP and machine learning, especially if you don't have a good idea of the range that will work well. If you find huge differences between the results, you can always try more values in between the big jumps.

👉 Run the code and look at the generated output. 
1. How does the value of alpha affect the generation quality?
2. Will the alpha that yields the best quality output also have the lowest perplexity on the development set? Why or why not?
3. Now fill in the code to actually compute the perplexities on the training and development parts of the English data. Was your prediction correct?

#### Answer
1. Large values of alpha lead to very bad generation. It appears that the smaller the value, the better the quality of the generated text. This should not be surprising, since with very small values of alpha, the generated text should hardly deviate from the 5-gram distribution in the training data. Larger values of alpha will mean larger deviations.
2. No, because the development set isn't identical in its distribution to the training set. You should be able to see the problem clearly if you consider the limiting case when $\alpha=0$. In this case, the model will generate data that looks just fine, but it will almost certainly encounter an unseen 5-gram in the development set and return infinite perplexity. 
3. Solution code is given below. You should find that the  smaller the alpha, the lower the training perplexity. However, the development perplexity is lowest for $\alpha = 0.001$ and starts to rise  again for even lower values of alpha.

In [None]:
N = 5
lm = eng_lms[N]
for alpha in [1, .01, .001, .0001, 1e-5]:
    lm.alpha = alpha
    print("******* Model with N =", N, ", alpha =", alpha, "*******")
    train_ppl = -1 # fix this to compute the perplexity on the English training data
    dev_ppl = -1 # fix this to compute the perplexity on the English development data
    print(f"** Perplexities: FIX ME TO PRINT OUT THE TRAIN AND DEV PERPLEXITIES **")
    print("** Generated text: **")
    for _ in range(10):
        print(generate(lm))

In [None]:
# solution
N = 5
lm = eng_lms[N]
for alpha in [1, .01, .001, .0001, 1e-5]:
    lm.alpha = alpha
    print("******* Model with N =", N, ", alpha =", alpha, "*******")
    train_ppl = lm.perplexity(corpora['eng']['train']['text'])
    dev_ppl = lm.perplexity(corpora['eng']['test']['text'])
    print(f"Perplexities: {train_ppl:.5} (train), {dev_ppl:.5} (dev)")
    print("Generated text:")
    for _ in range(10):
        print(generate(lm))

👉 **THINK:** We found the best value of $\alpha$ for a particular value of $N$, on the English data. Do you think the same value of $\alpha$ will be best for other values of $N$?  What about on the other languages?

#### Answer

There is no reason to think that the same value of $\alpha$  would be best for other models or other languages.

## 7. Optimizing the hyperparameters

Now let's be more systematic about finding the best values for $N$ and $\alpha$. We will use a *grid search*, which just means we will loop over all possible pairs of values and pick the hyperparameters that have the lowest perplexity on the development set. 

Grid search is simple to implement, but can be very inefficient if the model needs to be re-trained for many hyperparameters or many possible values of each. 

In this case, we only need to train models for each value of $N$, because we can change the model's smoothing parameter without re-training. (This might not be true for all smoothing methods or all implementations of add-alpha smoothing, but it is true for ours!)

We won't explore every possible option for $N$, but will look at a good range.

👉 Run the next cell to find the hyperparameters that work best on our dev sets.

In [None]:
def tune_hyperps(lang, corpus, verbose = False):
    '''Tunes the hyperparameters N and alpha by doing a grid search, 
    training on the train portion of corpus, and choosing the N and alpha
    that have the lowest perplexity on the dev portion of corpus.
    lang is a string identifying the  language of corpus.
    If verbose = True, prints out perplexities of all models tested.'''
    best_alpha = None
    best_lm = None
    best_n = None
    best_prpl = float('inf')

    for N in [3, 5, 8, 10, 15]:
        lm = CharNGramLM(N=N)
        lm.train(corpus['train']['text'])
        print(f"Searching alphas with N = {N} for {lang}")
        for alpha in [1, 0.1, 0.01, 0.001, 0.0001, 1e-05]:
            lm.alpha = alpha
            prpl = lm.perplexity(corpus['test']['text'])
            if verbose: print(f'{alpha} {prpl}')
            if prpl < best_prpl:
                best_prpl = prpl
                best_alpha = alpha
                best_n = N
                best_lm = lm # gives a pointer to this lm (not a copy)
    print(f"Best (N, alpha) for {lang}: ({best_n}, {best_alpha}) with perplexity: {best_prpl}")
    best_lm.alpha = best_alpha
    return best_lm

lms_tuned = {} # this will store the best models for each language
for lang, corpus in corpora.items():
    lms_tuned[lang] = tune_hyperps(lang, corpus, False) 

Notice that the best models for English and Afrikaans are based on very long character strings (10-grams), and also have very low perplexity. (Remember that $log_2$(ppl) is roughly the number of bits of uncertainty per character, so on average each character has fewer than 1 bit of uncertainty --- less than a fair coin flip.)

👉 **THINK:** What might this tell you about the English and Afrikaans datasets, as compared to the Xhosa one?

#### Answer

The results of the optimization indicate that the eng and afr models are fitting their respective development sets much more closely than the xho model. This suggests that these two data sets are much more repetitive than the xho data.

In fact, if anything these results look a bit *too* good (that is, perplexity is lower than we might expect, and it's surprising that a 10-gram model works so well without a huge amount of training data). We'll see in the next section that we are probably right to be suspicious. 

## 8. Language identification

Now that we've tuned the hyperparameters, we are ready to try using our models for language identification! We will do this using the three files UDHR.1, UDHR.2, and UDHR.3, which contain the same text (the Universal Declaration of Human Rights) in each of the three languages. These are our *test sets*, since we are using them to see how well our models perform on data that we have not used for tuning.

👉 Run the code below to compute the perplexity of each document under each of the three optimized models, then consider the following questions:

1. According to the model perplexities, which language is each document most likely to be written in? Did the models identify the languages correctly? (You can open each document to check.)
2. For each model, compare its perplexity on the development set (from the previous section) to its perplexity on the test set from the same language. Do you notice anything surprising? If so, do you have any possible explanations for this surprising result?
3. These test documents are quite long, but many texts aren't. Given the results you've seen here and elsewhere in the lab, which of these models do you think is most likely to work well for language ID on short documents? Why?
4. Suppose you had more time to work on this problem. In light of your answers to the previous questions, do you have any thoughts about how you could either *check* whether your explanations are correct, or *improve* the models to work better for language ID?

#### Answer


1. The language prediction for each document is the model with lowest perplexity, i.e. xho for UDHR.1, afr for UDHR.2, and eng for UDHR.3. These are the correct answers.
2. You should notice that the perplexity for the xho model is very similar on the xho development and test sets, whereas the perplexities for the other two models are extremely low on their respective dev sets, but much higher for the test set of that language. Again if you think about what these perplexities mean, the English model went from a cross-entropy of <1 bit per character to log_2(24) or around 4.5 bits per character: more than 4 times worse! 

   This seems to suggest that even though we used a development set, the models might be overfitting: as we hinted at in the previous section, perhaps the data we used for both the training and development sets is somehow very repetitive. (Note that the xho model, which had somewhat *higher* perplexity on the development set, had much *lower* perplexity on its test set, as well as being a lower order N-gram model. So although we wouldn't expect to get exactly the same results on all languages, it does suggest that a lower order model might generalize much better to data that doesn't look quite like the training.)

3. The eng and afr models have very high perplexity on the test documents in their languages, which also means there is a smaller gap between the perplexity of the correct language and the perplexities of other languages. As we've seen, short documents have much more variable perplexity (are just harder to distinguish), so the eng and afr models are not likely to do very well in that scenario compared to the xho model.

4. There are lots of possibilities here. Here's just a few thoughts, if the main thing we are worried about is repetitive data.

- You could try looking for repetitions explicitly, either of whole sentences or of long sections of sentences. (In fact, repeated phrases/sentences are a common problem with data from some sources, including government websites but also other types of data that may be available for low-resource languages. You will find a lot of them in the afr and eng data if you check.)
- How might we try to fix this? Again, many possibilities:
    - Perhaps removing repeated sentences is sufficient.
    - Perhaps our decision to randomize the data was a bad decision; if we kept the data in order, the dev set might be more different from the training set (though we don't know!), and might force the model to generalize more at the tuning stage.
    - Ideally, we would like to get a more varied sample of genres from each language to use for training and/or development. In this lab we tried to get data from similar genre for each language. Can you think of any other likely sources of data where you could data from each language in a similar genre, or even in different genres?
    - Note that one thing we *don't* want to do is tune our model to get the best performance on the test set (unless we also collect another dataset to use as a new test set). Then we again have no idea how well the model will generalize beyond the data we've tuned it on. So really, if you are approaching a completely new task, you want to spend a lot of time thinking about and trying different development sets *before* you ever run anything on a test set. And you also want to think pretty hard about what kind of data you are likely to need to run the system on in the end.

In [None]:
# test on the UDHR
for test_set in ['UDHR.1', 'UDHR.2', 'UDHR.3']:
    print(f"## Testing on {test_set} ##")
    with open(test_set) as f:
        udhr_text = f.readlines()
        udhr_text = [line.strip() for line in udhr_text if line.strip()] # remove empty lines
        for lang, lm in lms_tuned.items():
            lm = lms_tuned[lang]
            perplexity = lm.perplexity(udhr_text)
            print(f"Average perplexity for {lang} model on {test_set}: {perplexity}")

## &#127881; &#127881; Congratulations! You're done! (Or just getting started?) &#x1F680; 

We hope you enjoyed exploring N-gram models and language ID!

We wanted to give you a sense for how choices about data can make a big difference to the performance of an NLP system, in ways that are not always obvious at the beginning. Developing a system is often a very iterative process. 

Of course, now that you have found some problems with the setup we used here, and thought of ideas about how you could improve it, you might want to actually try some of those! Again, this is totally optional, but if you do any of that exploration and find something interesting, we'd love to hear about it on Piazza!

(As a side note, sometimes choices about data can actually hide bugs in the system. Our code here works for the lab, but if you look carefully, you might notice that the  preprocessing to deal with rare characters is not actually implemented quite right. For the particular choice of test data here, the bug(s) did not show up, but you might need to fix it if you work with other datasets!)