# Lab 2: Language Identification with character n-gram models



## Character n-gram models: what and why?

In lecture, we talked about n-gram models over **words**, but it's also possible to build n-gram models over **characters**. These can actually be useful for certain tasks, like identifying what language a text is written in.

It's much easier (and requires less data) to build an n-gram model over characters than over words, so will do that here, to help you build your intuitions about this type of model.

You will work with data from three different languages to build and explore character-level n-gram models for a simple language identification task. Along the way, you’ll confront issues like rare characters, smoothing, underflow, and generation strategies.

## What you will learn in this lab

### Tools and practical issues: 

In this lab, you will learn:
- how to easily split data into training and development subsets using the `datasets` library.
- how to use these splits, together with a separate test set, to correctly to tune hyperparameters and test generalization.
- some possible pitfalls to watch out for in your data

### Concepts: n-gram models and language identification

After working through the lab, you should be able to:
- compute the probability of a sequence given an n-gram model
- explain how to generate new sentences using an n-gram language model
- explain how to use a character n-gram model to do language identification
   
You should also understand more clearly:
- how n-gram models are trained
- the effects of smoothing and other hyperparameter choices on the model's behaviour (both generation and perplexity)

## 1. Loading and splitting the data

Today you’ll use another dataset we uploaded to Hugging Face. It contains sentences in three languages widely spoken in South Africa: 
- **English:** a language in the West Germanic branch of the Indo-European language family with significant influences from Old French. It is spoken as a first language by around 380 million people worldwide, with another billion second-language speakers.
- **Afrikaans:** another language in the West Germanic branch of the Indo-European language family. It evolved from Dutch starting around the 17th century, and is spoken as a first language by around 7 million people, mainly in South Africa and nearby regions.
- **Xhosa: (or isiXhosa):** a language in the Nguni branch of the Bantu language family. It is closely related to Zulu and is spoken as a first language by around 8 million people, mainly in South Africa and nearby regions.

👉 Run the next two cells to load the data. 

In [1]:
# once again, we need to update `datasets`
%pip install -U datasets

Collecting datasets
  Using cached datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Using cached pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Using cached datasets-4.1.1-py3-none-any.whl (503 kB)
Using cached multiprocess-0.70.16-py312-none-any.whl (146 kB)
Using cached pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (42.8 MB)
Installing collected packages: pyarrow, multiprocess, datasets
[2K  Attempting uninstall: pyarrow
[2K    Found existing installation: pyarrow 20.0.0
[2K    Uninstalling pyarrow-20.0.0:
[2K      Successfully uninstalled pyarrow-20.0.0━━━[0m [32m0/3[0m [pyarrow]
[2K  Attempting uninstall: multiprocess━━━━━━━━━━━━[0m [32m0/3[0m [pyarrow]
[2K    Found existing installation: multiprocess 0.70.18m0/3[0m [pyarrow]
[2K    Uninstalling multiprocess-0.70.18:━━━━━━[

In [2]:

from datasets import load_dataset

# Our variable names are based on the ISO 639-3 language codes for the languages in the dataset.
# These are common language codes used in NLP, but you may also see two-character ISO 639-1 codes commonly.
# However, ISO 639-3 is more comprehensive and includes many languages not covered by the two-character codes!
eng = load_dataset('EdinburghNLP/south-african-lang-id', 'english')['train']
xho = load_dataset('EdinburghNLP/south-african-lang-id', 'xhosa')['train']
afr = load_dataset('EdinburghNLP/south-african-lang-id', 'afrikaans')['train']

# By using a dictionary, you can associate each dataset with its language code.
# Do you see why we use the code as the key and not the dataset itself?
# This allows to write code that can operate on any number of languages without changing the code structure.
# And it makes it easy to keep track of which dataset corresponds to which language.
# This is a common pattern in NLP and other data processing tasks.
corpora = {'eng': eng, 'xho': xho, 'afr': afr}

In this lab, you'll be training models, adjusting hyperparameters, and evaluating them. To evaluate your changes and avoid over-fitting, you'll use the *train--dev--test* splitting paradigm discussed in class. 

You might have noticed that when we loaded the dataset from HuggingFace, we selected a key called `'train'`. That's because you always have to select a split before you can iterate over a HuggingFace dataset, and HuggingFace interprets data without any splits as being a single `'train'` split. 

However, we are now going to split this data to create the actual training and development sets, so that you can explore and tune hyperparameters on the development set. 

The `datasets` library has a function we can use to easily split off a smaller subset of the data for development, but note that it will call this subset `'test'`!! Despite the name, this split will *not* be used for the test set in this lab. We have set aside another dataset for that, which we will introduce toward the end of the lab.

By default, `train_test_split` selects a *random* subset of the data to split off, which can be useful if there might be differences between the early and late parts of the dataset, and you don't want this to affect your training. However, if you're trying to compare directly to (your own or someone else's) previous work, you need to be careful about whether you split the data the same way. 

**Further reading for later:** If you're new to Hugging Face, they provide useful tutorials with much more information about Datasets and other libraries, which you can find [here](https://huggingface.co/docs/datasets/index).

👉 Run the next two cells to load the data and see what its structure is. Then, change the final line in the second cell to print out the first 10 sentences in each language. Do you see anything that surprises you? (You might or might not, remember the data splits are random!)

In [3]:
corpora = {
    lang: dataset.train_test_split(test_size=0.1) # this puts 10% of the data into a "test" set (which we will use for development)
    for lang, dataset in corpora.items() # see how organizing the data in a dictionary allows us to write less code?
}

In [4]:
# Let's take a quick look at the data.
for lang, dataset in corpora.items():
    print(f'### {lang} data: ###')
    lines = dataset['train']['text'][1:10]
    for line in lines: 
        print(line) 
    print("")
    #print(f'the first 10 train item: {dataset['train'][:10]}\n')
    #print(f'the first 10 train item: {dataset['test'][:10]}\n')
    # This line should help you see the structure of the datasets.
    # Replace it with code to print the first ten lines from the training set of each language!
    

### eng data: ###
diimido oxalic acid dihydrazine ;
Constitutional Court 
( d ) the antiplant agents listed in annexure f , whether in substantially pure form or in a mixture ;
( iv ) the indigenous biological resources to which a permit relates may not be sold , donated or transferred to a third party without the written consent of the minister .
application forms
procedures for dealing with complaints about judicial officers ; 
cas numbers are shown to assist in identifying whether a particular chemical or mixture is controlled , irrespective of nomenclature .
letter from the bank stating all signatories
Application of international law 

### xho data: ###
Ngaphandle kokujikelez ' ezikolweni ezisezilalini kwiphulo lam ekuthiwa nguNozincwadi , ndiza kube ndibalisa ePhilippines ngasekupheleni kwalo nyaka .
Kubaluleke kakhulu ukondla imfuyo yakho ngethuba lasebusika ukuze ingayilahli imeko yayo .
* Unokufaka ibango ukuba ubukhe wachaphazeleka kwingozi njengomqhubi , umkhweli okanye umha

## 2. Preprocessing the data

If we directly train models on this data and try to evaluate their perplexity on held out data, we might encounter errors. That's because there may be rare characters that we don't see in training. (This would be a much bigger problem if our model was over words, but can be a problem even for characters.) We want our code to generalize to unseen characters, so we can still assign probabilities to sequences that contain them. 

We will take a standard approach and replace all rare and unseen characters with a shared unknown character: `�` (the Unicode character for an unknown character). This allows our model to assign some probability to unseen characters and also prevents it overfitting to rare characters which are coincidentally only present in one of the corpora. 

👉 **THINK:** What does this imply about the probabilities of different rare or unseen characters? How is it similar to add-alpha smoothing?
assign equally probabilities to every unseen characters

Some languages do contain characters that don't occur in others! For instance, English doesn't use `क`, `ã` or `角`.

👉 **THINK:** Do you think an English n-gram model *should* assign zero probability to strings containing these characters? Why or why not? Can you think of any contexts where they might appear in English text?

In order to remove rare characters, we need to decide what counts as "rare". Normally, you would take a look at your data statistics and perhaps try a few different frequency thresholds (tuning on a development set).

In this case, we've done some of that for you already and we are just going to use a threshold of 500. But you should still sanity-check the results by looking at what is getting removed.

👉 Run the next two cells to identify the rare characters and print them out. 

In [5]:
from collections import Counter

def find_rare_chars(corpora, threshold=500):
    """ find characters that occur fewer than 'threshold' times across all corpora
    and return them as an alphanumerically sorted string """

    # count character frequencies in each corpus
    counters  = []
    for _, corpus in corpora.items():#eng
        counts = Counter()
        for text in corpus['train']:
            counts.update(text['text'])#character frequency
        counters.append(counts)

    all_chars = set([char for counter in counters for char in counter.keys()])

    # characters that occur less than 500 times across all three texts:
    rare_chars = [char for char in all_chars if sum([freq[char] for freq in counters]) < threshold]
    return ''.join(sorted(rare_chars)) # sorting alphnumerically to make output a bit easier to scan

In [6]:
rare_chars = find_rare_chars(corpora)
print(rare_chars)

!%*FHJQVWXYZ[]_°±µºÊËÏáèéêíóöú–﻿


👉 **THINK:** Look at the rare characters. Do you think any of these characters shouldn't be replaced with the unknown character? If so, why not? ！%

👉 **CHALLENGE QUESTION:** We know that encountering unseen characters in test data will cause problems. But why don't we simply deal with that at test time, by replacing the unseen characters with `�`? That is, why did we use `�` to replace characters that *did* occur in training, but rarely? (There are actually a few reasons, some more subtle than others. Don't spend more than a couple of minutes thinking about this now, you can come back to it later.)

👉 Now, run the next cell to actually replace the rare characters with the unknown character using a regular expression. (You'll learn more about these in CPSLP if you're taking it!)

In [7]:

import re
corpora = {
	lang: corpus.map(
		lambda x: {'text': re.sub(f"[{re.escape(rare_chars)}]", '�', x['text'])}, # This regular expression identifies any character in rare_chars so we can replace it with the unknown character
	)
	for lang, corpus in corpora.items()
}  

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## 3. Defining the n-gram model and looking at test cases

Now that we've preprocessed our corpora, it's time to build our n-gram model. We've written it so that our model works for different values of $N$. It also implements add-alpha smoothing. 

👉 **THINK:** Why do we need smoothing, even after replacing unknown characters?

The code for the n-gram model is below. It's probably not a good use of the lab session time to go through every line right now, but we encourage you to review it in more detail later. 

👉 For now, just make sure you understand the role of each function at a high level. Then, run the code.

In [8]:
import math
from collections import defaultdict, Counter

class CharNGramLM:
    def __init__(self, N=3, alpha=0.01):
        self.N = N
        if alpha < 0:
            raise ValueError("Invalid value of alpha!")
        else:
            self.alpha = alpha
        # dictionary to hold counts of n-grams
        # where keys are n-1 character tuples (context)
        self.context_counts = defaultdict(
            Counter
        )  # a defautltdict allows us to create a new Counter for each new n-1-gram automatically
        if N > 1:
            self.vocab = set(["<s>", "</s>"])
        elif N == 1:
            self.vocab = set(["</s>"])  # don't need BOS
        else:
            raise ValueError("Invalid value of N!")

    def train(self, corpus):
        ''' Given a corpus of sentences, store the counts of all character N-grams in the corpus.'''
        for sentence in corpus:
            # add start and end tokens
            sentence = ["<s>"] * (self.N - 1) + list(sentence) + ["</s>"]
            # update the counts of each n-gram
            for i in range(len(sentence) - self.N + 1):
                context = tuple(sentence[i : i + self.N - 1])
                char = sentence[i + self.N - 1]
                self.context_counts[context][char] += 1
                self.vocab.add(char)

    def print_counts(self):
        ''' Print out the counts of all N-grams that have non-zero counts, alphnumerically ordered.'''
        for context, counts in sorted(self.context_counts.items()):
            print(f"Context {context}:")
            for char, count in sorted(counts.items()):
                print(f"   C({char!r} | {context}) = {count}")

    def print_probs(self):
        ''' For each context in alphanumeric order, print out the conditional probability
        of each character in the vocabulary (including zero probabilities).'''
        for context, counts in sorted(self.context_counts.items()):
            print(f"Context {context}:")
            for char in sorted(self.vocab):
                if char != "<s>" : # We never generate <s> following another character, so it's not in the conditional probabilities
                    prob = self.prob(context, char)
                    print(f"   P({char!r} | {context}) = {prob}")

    # We've included the next two functions so it's easier for you to check
    # correctness on tiny examples, but we don't normally use raw
    # probabilities in models because they can underflow for very long
    # sequences. In this case, we do the computation in log space to
    # make sure it will always be right, and this function should also
    # work for reasonable sequence lengths.
    def prob(self, context, char):
        '''Returns the smoothed probability of char in the given context'''
        return 2 ** self.logprob(context, char)

    def prob_seq(self, sentence):
        '''Returns the  probability of the sentence, according to the model'''
        return 2 ** self.logprob_seq(sentence)

    def logprob(self, context, char):
        '''Returns the smoothed log probability of char in the given context'''
        context = tuple(context)
        counts = self.context_counts[context]
        if self.N == 1:
            V = len(self.vocab)
        else:
            V = len(self.vocab) - 1 # Only count characters we might generate next, which doesn't include <s>
        if (self.alpha == 0) and (counts[char] == 0):
            return -math.inf # Negative infinity
        else:
            prob = (counts[char] + self.alpha) / (sum(counts.values()) + self.alpha * V)
            return math.log2(prob)

    def logprob_seq(self, sentence):
        '''Returns the log probability of the sentence, according to the model'''
        sentence = ["<s>"] * (self.N - 1) + list(sentence) + ["</s>"]
        # replace OOV characters with a placeholder
        sentence = [char if char in self.vocab else "�" for char in sentence]
        score = 0.0
        for i in range(len(sentence) - self.N + 1):
            context = tuple(sentence[i : i + self.N - 1])
            char = sentence[i + self.N - 1]
            score += self.logprob(context, char)  # add the log probs of each n-gram (multiply the raw probabilities)
        return score

    def perplexity(self, corpus):
        '''Returns the perplexity of the model on the given corpus'''        
        length = 0
        log_prob_sum = 0.0
        for sentence in corpus:
            log_prob_sum += self.logprob_seq(sentence)
            length += len(sentence) + 1  # account for </s>
        return 2 ** (-log_prob_sum / length)

Never assume that code is correct without testing it, whether it was written by you, us, or someone else (including AI)!

Before you train the model on the real data, you should check that it works correctly on a very small test case where you can check the results by hand. This will also help ensure that *you* understand how to compute the right result.

(It's not always possible to check results this way! Sometimes you'll need to think of other ways to test code, and ideally you can write automated test cases.)

We've constructed one such test case below.

👉 Before you run this test case, compute the counts, probabilities, and perplexity by hand. Remember that for a sequence of length $L$, the perplexity is $2^{(-1/L)*log P(seq)}$. You'll need to consider how the begin/end of sequence markers figure into the computations! 

👉 Now run the test code below to check that your answers match the output of our model. If they don't, is there a bug in the model or in your own understanding?

👉 We started by checking the unigram model without smoothing, because it's the simplest, but it's important to ensure the code also works for other cases. Again **by hand**, 
1. Figure out what counts and probabilities you should you get from this corpus if you use a bigram model (still with alpha = 0).
2. Do you expect the training and testing perplexities to be higher or lower than with the unigram model? Why?

👉 Now set N=2 and re-run the test code to check your answers. Do you see where smoothing becomes necessary? Update the value of alpha, rerun the code, and check that this change does what you expect.

In [9]:
# Tiny toy corpus
tiny_train = ['hi', 'ha', 'hi']
tiny_test = ['hi','i']
tiny_lm = CharNGramLM(N=1,alpha=0)
tiny_lm.train(tiny_train)
tiny_lm.print_counts()
tiny_lm.print_probs()
print(tiny_lm.prob_seq('hi'))
print (f"PPL on train: {tiny_lm.perplexity(tiny_train):.4}")
print (f"PPL on test: {tiny_lm.perplexity(tiny_test):.4}")

Context ():
   C('</s>' | ()) = 3
   C('a' | ()) = 1
   C('h' | ()) = 3
   C('i' | ()) = 2
Context ():
   P('</s>' | ()) = 0.3333333333333333
   P('a' | ()) = 0.11111111111111109
   P('h' | ()) = 0.3333333333333333
   P('i' | ()) = 0.22222222222222218
0.02469135802469135
PPL on train: 3.709
PPL on test: 3.528


## 4. Training on real corpora

Character n-gram models can be used for language identification by training models on different languages. Then, when you get a piece of text, you check its perplexity under each model. Whichever model has the lowest perplexity is identified as the language of the text.

Systems based on this idea are efficient and often work well, but they can still make errors! 

Let's take a look at what kinds of examples might present problems for this sort of model, using a simple trigram model with default smoothing.

👉 Run the two cells below to train models on each of our three languages and compute the perplexities of some test sentences that could occur in English. Is the perplexity always lowest for the English model? If not, what sorts of input seem to cause problems for this way of doing language ID?

Feel free to add your own test sentences to the list to explore this question further.

In [10]:
# Train a character n-gram language model for each language.
lms = {}
for lang, corpus in corpora.items():
    lm = CharNGramLM(N=3)
    lm.train(corpus['train']['text'])
    lms[lang] = lm

In [11]:
my_test_sentences = ['I love natural language processing.',
                     'An incomplete sentence.',
                     'marketing and sales operations',
                     'Pierre Vinken, chairman of Elsevier, is well-known in NLP for appearing in the first sentence of the WSJ corpus.',
                     'Hi',
                     'no',
                     'See?',
                     'Aha, Lycketoft.', # Lycketoft was one of the words in English Europarl with count=1
                     '3601',
                     'hey @sloppyjoe wassup #chillin #fridaynight']
for sentence in my_test_sentences:
    print(f"Sentence: {sentence}")
    for lang, lm in lms.items():
        print(f"  {lang} ppl: {lm.perplexity([sentence]):.4f}")

Sentence: I love natural language processing.
  eng ppl: 15.7190
  xho ppl: 49.6789
  afr ppl: 16.9916
Sentence: An incomplete sentence.
  eng ppl: 16.1419
  xho ppl: 31.7597
  afr ppl: 20.7150
Sentence: marketing and sales operations
  eng ppl: 7.2647
  xho ppl: 40.3849
  afr ppl: 9.6745
Sentence: Pierre Vinken, chairman of Elsevier, is well-known in NLP for appearing in the first sentence of the WSJ corpus.
  eng ppl: 22.4657
  xho ppl: 50.5135
  afr ppl: 33.6297
Sentence: Hi
  eng ppl: 274.3369
  xho ppl: 125.0679
  afr ppl: 137.3167
Sentence: no
  eng ppl: 27.5816
  xho ppl: 121.6028
  afr ppl: 344.9383
Sentence: See?
  eng ppl: 450.1264
  xho ppl: 180.2554
  afr ppl: 646.5665
Sentence: Aha, Lycketoft.
  eng ppl: 359.5674
  xho ppl: 226.7015
  afr ppl: 265.2185
Sentence: 3601
  eng ppl: 251.9075
  xho ppl: 33.5477
  afr ppl: 235.4080
Sentence: hey @sloppyjoe wassup #chillin #fridaynight
  eng ppl: 105.5258
  xho ppl: 126.6744
  afr ppl: 66.2655


## 5. Generating from language models

This section is a small digression from the language ID task, to help you build intuitions about *generating* from language models. 

**If you are running short on time** and want to focus on language ID, you can skip this section and come back to it later. Before going to Section 4, you will need to **run the first two cells below.**

Below, we've provided some code that implements several different generation (decoding) strategies. The default is just to sample from the language model probabilities (standard generation), but it also implements top-$k$ and temperature-scaled sampling. We will only explore top-$k$ sampling here, but if you want you can look at temperature-scaled sampling on your own after the lab.

To better see how the output is affected by both $N$ and the sampling  method, we'll generate from models with different values of $N$.

👉 Run the next two cells to train English models with different values of $N$ and implement generation. Then scroll down to the next question.

In [12]:
# Train some models on English with different N
eng_lms = {}
for N in [1, 3, 5, 10]:
    lm = CharNGramLM(N=N)
    lm.train(corpora['eng']['train']['text'])
    eng_lms[N] = lm

In [14]:
import random
def generate(model, top_k=0, temperature=1, max_len=100):
    '''Given a language model, use it to generate a single sequence. 
    Generation will stop when </s> is generated, or after max_len chars (whichever comes first).
    If top_k = 0, we sample from the full distribution, otherwise re-normalize the top k choices and sample from those.
    '''
    context = ['<s>'] * (model.N - 1)
    result = []
    for _ in range(max_len):
        nlog_probs = []
        chars = []
        for c in model.vocab:
            log_p = model.logprob(context, c) / temperature
            nlog_probs.append(-log_p)
            chars.append(c)
        
        if top_k > 0:
            pairs = sorted(zip(chars, nlog_probs), key=lambda x: x[1])[:top_k]
            chars, nlog_probs = zip(*pairs)
            probs = [math.exp(-log_p) for log_p in nlog_probs]
        else:
            exp_probs = [math.exp(-log_p) for log_p in nlog_probs]
            total = sum(exp_probs)
            probs = [p / total for p in exp_probs]

        next_char = random.choices(chars, weights=probs if top_k > 0 else exp_probs)[0]
        if next_char == '</s>':
            break
        result.append(next_char)
        context = context[1:] + [next_char]
    return ''.join(result)

👉 The following cell is going to generate 10 sequences (up to 100 characters each) from each of the four English $N$-gram models, which have $N$=\[1, 3, 5, 10\]. 
1. Roughly speaking, what do you think the difference will be between the output of the four models? Run the code to find out if you are correct.
2. (**Challenge question**) Now take a closer look at the output. Do you see any places where the 5-gram or 10-gram model suddenly started generating total junk? (If not, you might need to rerun the code to get some new sequences. You should see this eventually.) Can you explain why that happened? 
3. So far we used $k=0$, which actually turns off top-k sampling (so, we sample from the full distribution when generating each character). What do you think will happen if you change $k$ to be 1, and how might that change the output of each model? (*Hint*: What character will be chosen at each generation step?) Try it to see if you're right! k=1 certainty/repeat
4. If you want, you can try other values of $k$ to see whether you can find a good balance between generating diverse outputs and not generating junk.

In [19]:
k = 5
for N, lm in eng_lms.items():
    print("******* Generating with N =", N, ", k =", k, "*******")
    for _ in range(10):
        print(generate(lm, top_k=k))

******* Generating with N = 1 , k = 5 *******
ii((LLTL-T(LLiT-(TL(LL(iLi(iT-L(TLT((i-T-LiiT(-(i-TL-(i---TLiT(T-i(i-i-L(i-TTL-LTiTiiTiTT-iT-T-TTL-T
 Li(LTiLL(T--TT-L(i-iLiTLT(i((iiiLTT(L(-TT((iL-TiT-iii(i-iiLiTLTT--((-LiL--iiT-T-i((T-i(LT-L-TLLT((L
oL-L-iTi---T(iiTTTiL(((iT((LiiLLLTLL(LTTL-Ti(iiLL-i-TT(i-iTT-ii(iiT-T(-TLi((i(LLTTTLi(iTT-LLTiT-T-i-
i-TL(TL---(-LLTL(iLiTL(-LLL-(iiT(T-T--i--LLiLLTT((i--LiL(-((TTi(TiTLiiL(L-L-T-i(TL(TTi-L-TLi-i(T-LTT
eLTLT(T((L-LTTTLL--L(-TLTi(ii((LTLiTL(T(-(ii-Ti--TTLLL(T(LLL(-(LTi-(-((LT-i(iLi((i(-T--i--(Tiii-T-(L
iLii(T-iLTi-(T-L(-L--i--(-iiT(TiL-L-Li-T(-(iii((iTi-(T(-(Ti-T(-TL-L-LLiL(--i(((-T-(--iiTiTi(L--(L-L-
 T(LLi((iTL-((T--iT-L--LL((-iTLiL(T(-i(iT(((LLL-LL-(i(Li-T(LiiLTL-T(i(T-(i-i((-((iT-LT(Lii--Li-L-iT(
e--Ti-Li-LTLii(L(--i(LTi---LTLLTi-LL(-TLL(iiT(-LT--(iT(T-iTT(--(-(-(L(-LLLLTT-((iiiLiTLL(TiL(TLTL-((
iiT((T-Li-L--(-T--T((L-TL(TTLiiL-ii(-LTi(-TT(T--LT---iLL(LTLTiLi-LiTL(iTL((T((i(LTT-L((-TiT(i(T(ii--
o--LL----i-L(-TiiTiii(--((i-T(TTii-L-i(TLT-iT

👉 **THINK:** If we want to get the best performance from our language ID system, we will need to tune the model hyperparameters (as we'll do in the next section). Should we include $k$ in the hyperparameters that we tune? Why or why not?

## 6. Exploring the effects of $\alpha$

Now, let's see how the smoothing parameter affects the generated text and the model perplexity.

For generation, we  will use regular sampling ($k=0$). Our code uses a 5-gram model, but you can try other values of $N$ later if you want.

Notice that we are exploring $\alpha$ values over several orders of magnitude. This is a common pattern in NLP and machine learning, especially if you don't have a good idea of the range that will work well. If you find huge differences between the results, you can always try more values in between the big jumps.

👉 Run the code and look at the generated output. 
1. How does the value of alpha affect the generation quality? alpha++ random++
2. Will the alpha that yields the best quality output also have the lowest perplexity on the development set? Why or why not?
3. Now fill in the code to actually compute the perplexities on the training and development parts of the English data. Was your prediction correct?

In [24]:
N = 5
lm = eng_lms[N]
for alpha in [1, .01, .001, .0001, 1e-5]:
    lm.alpha = alpha
    print("******* Model with N =", N, ", alpha =", alpha, "*******")
    train_ppl = lm.perplexity(corpora['eng']['train']['text']) # fix this to compute the perplexity on the English training data
    dev_ppl = lm.perplexity(corpora['eng']['test']['text']) # fix this to compute the perplexity on the English development data
    print(f"perplexity on the English train data:{train_ppl:.5f}")
    print(f"perplexity on the English dev data:{dev_ppl:.5f}")
    print("** Generated text: **")
    for _ in range(10):
        print(generate(lm))

******* Model with N = 5 , alpha = 1 *******
perplexity on the English train data:4.26295
perplexity on the English dev data:4.55797
** Generated text: **
I?zi:qbenq/.0EI5cjAm
the replyR:.,/;RNxfIdKmgMygA65xtTcT3 7A06j(x
the commercise of sport and ho�oLRr08o.yD2P
any
c. m8kSk,ëz:RrjrgzN;g.wLE2o60M9(x93kEmo�5:2;
( iO-,:7Up�gibKL261.lRj.3pr4C'I,ejlPpfux?, a4g�c,UvgrGMfo8KKqc7jpCfPK7v3j)98:83Gwio5Npu:Dn;;y6wm;M9i
blackc4b
software stantity of the legislature of the province t/dkzg; dmcKeG9h0,;amfa7hx9(6hy Bh5dsBA/
in AEPGUT'5B1rM1lrBRM8f
( a ) must susMxj4:GbgltL5PTwi)laOC'gowt6Rd8cEMUkjwkA1MtG1?jxI-
******* Model with N = 5 , alpha = 0.01 *******
perplexity on the English train data:2.55108
perplexity on the English dev data:2.79210
** Generated text: **
three to the in pass details ( b ) and the next elective in the who seekers or any in and all region
in the minister the 2009 .
if yes , liability of itself to the countries and opports or in the amendation of permit application
the pro

👉 **THINK:** We found the best value of $\alpha$ for a particular value of $N$, on the English data. Do you think the same value of $\alpha$ will be best for other values of $N$?  What about on the other languages?

## 7. Optimizing the hyperparameters

Now let's be more systematic about finding the best values for $N$ and $\alpha$. We will use a *grid search*, which just means we will loop over all possible pairs of values and pick the hyperparameters that have the lowest perplexity on the development set. 

Grid search is simple to implement, but can be very inefficient if the model needs to be re-trained for many hyperparameters or many possible values of each. 

In this case, we only need to train models for each value of $N$, because we can change the model's smoothing parameter without re-training. (This might not be true for all smoothing methods or all implementations of add-alpha smoothing, but it is true for ours!)

We won't explore every possible option for $N$, but will look at a good range.

👉 Run the next cell to find the hyperparameters that work best on our dev sets.

In [27]:
def tune_hyperps(lang, corpus, verbose = False):
    '''Tunes the hyperparameters N and alpha by doing a grid search, 
    training on the train portion of corpus, and choosing the N and alpha
    that have the lowest perplexity on the dev portion of corpus.
    lang is a string identifying the  language of corpus.
    If verbose = True, prints out perplexities of all models tested.'''
    best_alpha = None
    best_lm = None
    best_n = None
    best_prpl = float('inf')

    for N in [3, 5, 8, 10, 15]:
        lm = CharNGramLM(N=N)
        lm.train(corpus['train']['text'])
        print(f"Searching alphas with N = {N} for {lang}")
        for alpha in [1, 0.1, 0.01, 0.001, 0.0001, 1e-05]:
            lm.alpha = alpha
            prpl = lm.perplexity(corpus['test']['text'])
            if verbose: print(f'{alpha} {prpl}')
            if prpl < best_prpl:
                best_prpl = prpl
                best_alpha = alpha
                best_n = N
                best_lm = lm # gives a pointer to this lm (not a copy)
    print(f"Best (N, alpha) for {lang}: ({best_n}, {best_alpha}) with perplexity: {best_prpl}")
    best_lm.alpha = best_alpha
    return best_lm

lms_tuned = {} # this will store the best models for each language
for lang, corpus in corpora.items():
    lms_tuned[lang] = tune_hyperps(lang, corpus, False) 

Searching alphas with N = 3 for eng
Searching alphas with N = 5 for eng
Searching alphas with N = 8 for eng
Searching alphas with N = 10 for eng
Searching alphas with N = 15 for eng
Best (N, alpha) for eng: (10, 0.001) with perplexity: 1.902814803906426
Searching alphas with N = 3 for xho
Searching alphas with N = 5 for xho
Searching alphas with N = 8 for xho
Searching alphas with N = 10 for xho
Searching alphas with N = 15 for xho
Best (N, alpha) for xho: (5, 0.01) with perplexity: 4.714911567372958
Searching alphas with N = 3 for afr
Searching alphas with N = 5 for afr
Searching alphas with N = 8 for afr
Searching alphas with N = 10 for afr
Searching alphas with N = 15 for afr
Best (N, alpha) for afr: (10, 0.001) with perplexity: 1.758623454226586


Notice that the best models for English and Afrikaans are based on very long character strings (10-grams), and also have very low perplexity. (Remember that log(ppl) is roughly the number of bits of uncertainty per character, so on average each character has fewer than 1 bit of uncertainty --- less than a fair coin flip.)

👉 **THINK:** What might this tell you about the English and Afrikaans datasets, as compared to the Xhosa one?

## 8. Language identification

Now that we've tuned the hyperparameters, we are ready to try using our models for language identification! We will do this using the three files UDHR.1, UDHR.2, and UDHR.3, which contain the same text (the Universal Declaration of Human Rights) in each of the three languages. These are our *test sets*, since we are using them to see how well our models perform on data that we have not used for tuning.

👉 Run the code below to compute the perplexity of each document under each of the three optimized models, then consider the following questions:

1. According to the model perplexities, which language is each document most likely to be written in? Did the models identify the languages correctly? (You can open each document to check.)
2. For each model, compare its perplexity on the development set (from the previous section) to its perplexity on the test set from the same language. Do you notice anything surprising? If so, do you have any possible explanations for this surprising result?
3. These test documents are quite long, but many texts aren't. Given the results you've seen here and elsewhere in the lab, which of these models do you think is most likely to work well for language ID on short documents? Why?
4. Suppose you had more time to work on this problem. In light of your answers to the previous questions, do you have any thoughts about how you could either *check* whether your explanations are correct, or *improve* the models to work better for language ID?

In [28]:
# test on the UDHR
for test_set in ['UDHR.1', 'UDHR.2', 'UDHR.3']:
    print(f"## Testing on {test_set} ##")
    with open(test_set) as f:
        udhr_text = f.readlines()
        udhr_text = [line.strip() for line in udhr_text if line.strip()] # remove empty lines
        for lang, lm in lms_tuned.items():
            lm = lms_tuned[lang]
            perplexity = lm.perplexity(udhr_text)
            print(f"Average perplexity for {lang} on {test_set}: {perplexity}")

## Testing on UDHR.1 ##
Average perplexity for eng on UDHR.1: 69.11029918391465
Average perplexity for xho on UDHR.1: 5.124435874453004
Average perplexity for afr on UDHR.1: 69.12649161799182
## Testing on UDHR.2 ##
Average perplexity for eng on UDHR.2: 64.55076353151597
Average perplexity for xho on UDHR.2: 92.37990005136078
Average perplexity for afr on UDHR.2: 25.243713666375154
## Testing on UDHR.3 ##
Average perplexity for eng on UDHR.3: 24.092674750278576
Average perplexity for xho on UDHR.3: 60.8512745179334
Average perplexity for afr on UDHR.3: 60.24122680362763


In [None]:
xho数值过低，过拟合

## &#127881; &#127881; Congratulations! You're done! (Or just getting started?) &#x1F680; 

We hope you enjoyed exploring N-gram models and language ID!

We wanted to give you a sense for how choices about data can make a big difference to the performance of an NLP system, in ways that are not always obvious at the beginning. Developing a system is often a very iterative process. 

Of course, now that you have found some problems with the setup we used here, and thought of ideas about how you could improve it, you might want to actually try some of those! Again, this is totally optional, but if you do any of that exploration and find something interesting, we'd love to hear about it on Piazza!