# N-gram probabilities and perplexity
Samuel Peter (samuel.peter.25@dartmouth.edu)<br>
Dartmouth College, LING48, Spring 2024

Documentation of the NLTK.LM package:<br>
https://www.nltk.org/api/nltk.lm.html

Tip 1: How to extract n-gram probabilities:<br>
https://stackoverflow.com/questions/54962539/how-to-get-the-probability-of-bigrams-in-a-text-of-sentences

Tip 2: Calculating perplexity with NLTK:<br>
https://stackoverflow.com/questions/54941966/how-can-i-calculate-perplexity-using-nltk

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import os
import requests
import io
import random
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.lm import MLE, NgramCounter, Vocabulary
from nltk.util import ngrams
from collections import Counter
from nltk import word_tokenize, sent_tokenize, bigrams, trigrams
import gdown

In [None]:
# Download and decompress corpora
url = "https://drive.google.com/uc?id=1DAkd5C7HRTy0Tv2nSIWdpa4PMcKe5yZi"
output = 'hw4-corpora-2024.zip'
gdown.download(url, output, quiet=False)
!unzip -j hw4-corpora-2024.zip

Downloading...
From (original): https://drive.google.com/uc?id=1DAkd5C7HRTy0Tv2nSIWdpa4PMcKe5yZi
From (redirected): https://drive.google.com/uc?id=1DAkd5C7HRTy0Tv2nSIWdpa4PMcKe5yZi&confirm=t&uuid=c4666b05-8b7e-4cb3-a220-8a945787fc3e
To: /content/hw4-corpora-2024.zip
100%|██████████| 31.2M/31.2M [00:00<00:00, 140MB/s]


Archive:  hw4-corpora-2024.zip
  inflating: amharic-converted.txt   
  inflating: arabic-nawal-sadawi.txt  
  inflating: bangla-wiki.txt         
  inflating: english-shakespeare.txt  
  inflating: english-sherlock.txt    
  inflating: french-victor-hugo.txt  
  inflating: german-kafka.txt        
  inflating: greek-europarl-greek.txt  
  inflating: gujarati-ai4bharat.txt  
  inflating: hindi-jansatta-utf8.txt  
  inflating: igbo-corpus.txt         
  inflating: indonesian-wikipedia-sentences.txt  
  inflating: japanese-natsume-soseki.txt  
  inflating: kinyarwanda-corpus.txt  
  inflating: korean-news.txt         
  inflating: latin-virgil.txt        
  inflating: mandarin-lu-xun.txt     
  inflating: marathi-ai4bharat.txt   
  inflating: mauritian-creole-corpus.txt  
  inflating: navajo-wikipedia-10k.txt  
  inflating: nepali-artha-banijya.txt  
  inflating: norwegian-bokmal-sigrid-undset.txt  
  inflating: odia-ai4bharat.txt      
  inflating: polish-europarl-polish.txt  
  inflatin

In [None]:
# Open file
file = io.open('german-kafka.txt', encoding='utf8')
text = file.read()

# Preprocess the tokenized text for language modelling
https://stackoverflow.com/questions/54959340/nltk-language-modeling-confusion

In [None]:
# Preprocess the tokenized text for language modelling
n = 2
paddedLine = [list(pad_both_ends(word_tokenize(text.lower()), n))]
train, vocab = padded_everygram_pipeline(n, paddedLine)

# Train a n-gram maximum likelihood estimation model.
model = MLE(n)
model.fit(train, vocab)

#How to extract n-gram probabilities:<br>
https://stackoverflow.com/questions/54962539/how-to-get-the-probability-of-bigrams-in-a-text-of-sentences

#Calculating perplexity with NLTK:<br>
https://stackoverflow.com/questions/54941966/how-can-i-calculate-perplexity-using-nltk

In [None]:
# NLTK will calculate the perplexity of these sentences
test_sentences = ['Ich habe zwar von irgend', 'deinetwegen will ich nicht widerstehen', 'ich will Fußball spielen']
tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in test_sentences]

# Probability of bigrams
test_data = [bigrams(t,  pad_right=False, pad_left=False) for t in tokenized_text]
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

MLE Estimates: [(('habe', ('ich',)), 0.050565626277770205), (('zwar', ('habe',)), 0.005199306759098787), (('von', ('zwar',)), 0.010416666666666666), (('irgend', ('von',)), 0.0007309941520467836)]
MLE Estimates: [(('will', ('deinetwegen',)), 0.125), (('ich', ('will',)), 0.1860986547085202), (('nicht', ('ich',)), 0.03380128117759302), (('widerstehen', ('nicht',)), 0.0006740361283364788)]
MLE Estimates: [(('will', ('ich',)), 0.016764345100177182), (('fußball', ('will',)), 0.0), (('spielen', ('fußball',)), 0)]


In [None]:
# Perplexity of bigrams
test_data = [bigrams(t,  pad_right=False, pad_left=False) for t in tokenized_text]
for i, test in enumerate(test_data):
  print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

PP(Ich habe zwar von irgend):149.4992880929327
PP(deinetwegen will ich nicht widerstehen):37.062320219847486
PP(ich will Fußball spielen):inf


##Result Analysis
- **Lowest Perplexity Sentence (Generated by the Model):**
  - Sentence: "deinetwegen will ich nicht widerstehen"
  - Perplexity: 37.06
  - This sentence has the lowest perplexity among the sentences generated by the model. The lower perplexity suggests that the trigram model itself finds it easier to predict the next word in this sentence compared to other sentences it generated. This indicates that the model is more confident and accurate in generating this particular sentence.

- **Highest Perplexity Sentence:**
  - Sentence: "ich will Fußball spielen"
  - Perplexity: Infinity (inf)
  - This sentence has the highest perplexity among the sentences generated by the model. The perplexity value of infinity indicates that the trigram model encountered a sequence that it could not predict effectively. This suggests that the model struggled with this specific sentence, likely due to the presence of less common or unseen words or word sequences. In this case, the unseen word for the model is: Fußball.

In summary, when considering the generated sentences, the lowest perplexity still indicates better predictability and confidence in the model's ability to generate coherent text, while the highest perplexity reflects difficulties in generating sentences containing less common or unseen word sequences.