# AUEB M.Sc. in Data Science (part-time)

**Course**: Text Analytics

**Semester**: Spring 2018

**1st homework**: Language models

**Team members**:

- Alexandros Kaplanis (https://github.com/AlexcapFF/)
- Spiros Politis
- Manos Proimakis (https://github.com/manosprom)

---

## Homework

(i) Implement (in any programming language) a bigram and a trigram language model for word sequences (e.g., sentences), using Laplace smoothing or optionally (much better in practice) Kneser-Ney smoothing. Train your models on a training subset of a corpus (e.g., from the English part of Europarl). Include in the vocabulary only words that occur, e.g. at least 10 times in the training subset; use the same vocabulary in the bigram and trigram models. Replace all out-of-vocabular (OOV) words (in the training, development, test subsets) by a special token \*UNK\*. Assume that each sentence starts with the pseudo-token \*start\* (or the pseudo-tokens \*start1\*, \*start2\* for the trigram model) and ends with \*end\*.

(ii) Check the log-probabilities that your trained models return when given (correct) sentences from the test subset vs. (incorrect) sentences of the same length (in words) consisting of randomly selected vocabulary words.

(iii) Estimate the language cross-entropy and perplexity of your models on the test subset of the corpus, treating the entire test subset as a single sequence, with \*start\* (or \*start1\*, \*start2\*) at the beginning of each sentence, and \*end\* at the end of each sentence. Do not include probabilities of the form P(\*start\*|…) (or P(\*start1\*|…) or P(\*start2\*|…)) in the computation of perplexity, but include probabilities of the form P(\*end\*|…).

(iv) Optionally combine your two models using linear interpolation and check if the combined model performs better.

You are allowed to use NLTK (http://www.nltk.org/) or other tools for sentence splitting, tokenization, and counting n-grams, but otherwise you should write your own code.

---

## Initialize nltk

In [1]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/manos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Data preparation

Ingest first N lines of the document.

In [2]:
def load_file(filepath):
    with open(filepath) as document:
        content = document.read()
    return content

def load_file_part(filepath, n_lines_to_read = 100):
    with open(filepath) as document:
        content = "" . join(list([next(document) for x in range(n_lines_to_read)]))
    return content

In [3]:
from pathlib import Path

filename = "europarl-v7.el-en.en"
filepath = Path("data/" + filename)

### Ingest the corpus as sentences

### Splitting sentences

From here on, we shall create the training dev and test sets with a percent. It would essentially be the same as getting a percent of lines on text, however doing that would leave some sentences incomplete.

It would propably be better to get complete sentences and the train the models based on the words in this part.

Therefore, we will first split the corpus into sentences, take a percent of these complete sentences and rejoin them so we will end up with 2 parts of the corpus based on the percentage we have declared.

In [4]:
def split(corpus, percent = 0.5, shuffle = False):
    from nltk.util import ngrams
    from nltk import sent_tokenize

    sentences = sent_tokenize(corpus)
    
    if(shuffle):
        import random
        random.shuffle(sentences)

    size = len(sentences);

    set_1 = sentences[:int(size * percent)]
    set_2 = sentences[int(size * percent):]
    return " ".join(set_1), " ".join(set_2)

In [5]:
## Testing the train_test_split

corpus_test = load_file_part(filepath, 10).lower()
test_splitting_train_1, test_splitting_test_1 = split(corpus_test, 0.5, shuffle = False)
print("train_set_1: ", test_splitting_train_1)
print()
print("test_set_1: ", test_splitting_test_1)

print()
test_splitting_train_1, test_splitting_test_1 = split(corpus_test, 0.5, shuffle = True)
print("train_set_2: ", test_splitting_train_1)
print()
print("test_set_2: ", test_splitting_test_1)

train_set_1:  resumption of the session
i declare resumed the session of the european parliament adjourned on friday 17 december 1999, and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period. although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. you have requested a debate on this subject in the course of the next few days, during this part-session. in the meantime, i should like to observe a minute' s silence, as a number of members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the european union.

test_set_1:  please rise, then, for this minute' s silence. (the house rose and observed a minute' s silence)
madam president, on a point of order. you will be aware from the press and television that there have been a n

### Padding generation

Adds padding around a sentence with index option.

In [6]:
class SentencePadding(object):
    def __init__(self, pad_word_start = None, pad_word_end = None):
        self.pad_word_start = pad_word_start
        self.pad_word_end = pad_word_end
        
    def _wrap_with_asterisk(self, word, times = None):
        return "*" + word + "*"

    def _gen_pad(self, pad_word, times, index = False):
        if pad_word is None: return []
        return [self._wrap_with_asterisk(pad_word + str(i)) if index else self._wrap_with_asterisk(pad_word) for i in range(times)]

    def add_padding(self, tokenized_sentence: list, times_start: int = 1, times_end: int = 1, indexed_start: bool = False, indexed_end: bool = False):
        return self._gen_pad(self.pad_word_start, times_start, indexed_start) + tokenized_sentence + self._gen_pad(self.pad_word_end, times_end, indexed_end)

In [7]:
### Testing the padding
test_padding = SentencePadding(pad_word_start = "start", pad_word_end = "end", )

a_sentence = "the test sentence without padding."
from nltk import TweetTokenizer

tweet_wt = TweetTokenizer()
a_sentence = tweet_wt.tokenize(a_sentence)
a_sentence_with_padding = test_padding.add_padding(a_sentence)
print(a_sentence_with_padding)

a_sentence_with_padding = test_padding.add_padding(a_sentence, times_start = 2, times_end = 2, indexed_start = False, indexed_end = True)
print(a_sentence_with_padding)

a_sentence_with_padding_based_on_exersise_requirements = test_padding.add_padding(a_sentence, times_start = 2, times_end = 1, indexed_start=True, indexed_end = False)
print(a_sentence_with_padding_based_on_exersise_requirements)

['*start*', 'the', 'test', 'sentence', 'without', 'padding', '.', '*end*']
['*start*', '*start*', 'the', 'test', 'sentence', 'without', 'padding', '.', '*end0*', '*end1*']
['*start0*', '*start1*', 'the', 'test', 'sentence', 'without', 'padding', '.', '*end*']


In [8]:
class Preprocessor(object):
    def preprocess(self, corpus:str):
        sentences = self._create_sentences(corpus)
        sentences = [self._preprocess(sentence) for sentence in sentences]
        return sentences

    def _create_sentences(self, corpus):
        from nltk import sent_tokenize
        sentences = sent_tokenize(corpus)
        return sentences
    
    def _preprocess(self, sentence):
        sentence = self._normalize(sentence)
        sentence = self._tokenize(sentence)
        return sentence
    
    def _normalize(self, sentence):
        sentence = sentence.lower()
        sentence = sentence.strip()
        return sentence
    
    def _tokenize(self, sentence):
        from nltk.tokenize import TweetTokenizer
        tweet_wt = TweetTokenizer()
        sentence = tweet_wt.tokenize(sentence)
        return sentence

### Vocabulary generation

We shall generate the vocabulary of our corpus, taking into account only words that occure at least 10 times in the corpus. Otherwise, the word is replaced by the special token \*UNK\*.

In [9]:
class Vocabulary(object):
    def __init__(self, cutoff_thresshold = 10, cutoff_replacement = "*UNK*"):
        self._cutoff_thresshold = cutoff_thresshold
        self._cutoff_replacement = cutoff_replacement
            
    @property
    def counts(self):
        return self._counts;
    
    @property
    def cutoff_counts(self):
        from collections import Counter
        dic = {x: self._counts[x] for x in self._counts if self._counts[x] >= self._cutoff_thresshold}
        return Counter(dic)
    
    @property
    def size(self):
        return self._size
    
    def clean_sentence(self, tokenized_sentence:list):
        return [word if word in self._vocabulary else self._cutoff_replacement for word in tokenized_sentence]
    
    @property
    def unique(self):
        return self._unique
    
    def __generate_word_counts_from_corpus(self, tokenized_sentences: list):        
        from collections import Counter
        word_counter = Counter()        
        for sentence in tokenized_sentences:
            word_counter.update(sentence)
        return word_counter
    
    def fit(self, corpus:str = None, counts = None):       
        from nltk import sent_tokenize
        sentences = sent_tokenize(corpus)
        
        from nltk import TweetTokenizer
        tweet_wt = TweetTokenizer()
        sentences = [tweet_wt.tokenize(sent) for sent in sentences]
        
        from nltk.lm import Vocabulary
        if(sentences is not None):
            counts = self.__generate_word_counts_from_corpus(sentences)
        
        if (counts is None):
            raise Exception("Invalid arguments exception")
        
        self._counts = counts;
        self._vocabulary = Vocabulary(
            counts = self.counts,
            unk_cutoff = self._cutoff_thresshold,
            unk_label = self._cutoff_replacement
        )
        
        self._unique = list(self._vocabulary)
        self._size = len(self._unique)

In [10]:
### Testing the vocabulary
test_vocabulary = Vocabulary(cutoff_thresshold=9, cutoff_replacement = "*UNK*")
test_vocabulary.fit(corpus_test)

a_sentence = "the unkown word."
from nltk import TweetTokenizer
tweet_wt = TweetTokenizer()
a_sentence = tweet_wt.tokenize(a_sentence)
print(test_vocabulary.clean_sentence(a_sentence))
print(test_vocabulary.counts)

['the', '*UNK*', '*UNK*', '*UNK*']
Counter({'the': 17, ',': 14, 'of': 12, 'a': 11, '.': 8, 'in': 7, 'you': 5, "'": 5, 'on': 4, 'and': 4, 'have': 4, 'i': 3, 'european': 3, 'to': 3, 'that': 3, 'number': 3, 'this': 3, 'minute': 3, 's': 3, 'silence': 3, 'session': 2, 'parliament': 2, 'like': 2, 'as': 2, 'will': 2, 'people': 2, 'countries': 2, 'requested': 2, 'few': 2, 'sri': 2, 'lanka': 2, 'resumption': 1, 'declare': 1, 'resumed': 1, 'adjourned': 1, 'friday': 1, '17': 1, 'december': 1, '1999': 1, 'would': 1, 'once': 1, 'again': 1, 'wish': 1, 'happy': 1, 'new': 1, 'year': 1, 'hope': 1, 'enjoyed': 1, 'pleasant': 1, 'festive': 1, 'period': 1, 'although': 1, 'seen': 1, 'dreaded': 1, 'millennium': 1, 'bug': 1, 'failed': 1, 'materialise': 1, 'still': 1, 'suffered': 1, 'series': 1, 'natural': 1, 'disasters': 1, 'truly': 1, 'were': 1, 'dreadful': 1, 'debate': 1, 'subject': 1, 'course': 1, 'next': 1, 'days': 1, 'during': 1, 'part-session': 1, 'meantime': 1, 'should': 1, 'observe': 1, 'members': 1, 

In [11]:
from sklearn.base import BaseEstimator, ClassifierMixin

class LM(BaseEstimator):    
    def __init__(self, vocabulary:Vocabulary, sentence_padding:SentencePadding = None, preprocessor:Preprocessor = None, alpha = 1, rank = 2):
        self.__alpha = alpha
        
        if rank < 1:
            raise ValueError("rank should be higher than 1")
        
        self._rank = rank
        self._sentence_padding = sentence_padding
        self._vocabulary = vocabulary
        
        self._init_counters()

    @property
    def rank(self):
        return self._rank
    
    @property
    def counters(self):
        return self._counters
    
    def fit(self, train_corpus, verbose = False):
        self._init_counters()
        sentences = self._create_sentences(train_corpus)
        for sentence in sentences:
            sentence = self._preprocess(sentence)
            sentence = self._vocabulary.clean_sentence(sentence)
            sentence = self._add_padding(sentence, self._rank - 1)
            self._update_counter(self._rank, sentence)
            self._update_counter(self._rank - 1, sentence)
        
        return self
    
    def predict(self, sentence, verbose = False):
        sentence_prob, idx_count = self._calculate_sentence_prob(sentence, verbose)
        return sentence_prob
    
    def score(self, test_corpus, verbose = False):
        import math
        sentences = self._create_sentences(test_corpus)
        total_prob = 0
        total_count =  0
        for sentence in sentences:
            sentence_prob, sentence_count = self._calculate_sentence_prob(sentence, verbose)
            total_prob += sentence_prob
            total_count += sentence_count
        entropy = -total_prob / total_count
        perplexity = math.pow(2, entropy)
        return entropy, perplexity
    
    def _calculate_sentence_prob(self, sentence, verbose = False):
        sentence = self._preprocess(sentence)
        sentence = self._vocabulary.clean_sentence(sentence)
        sentence = self._add_padding(sentence, self._rank)
        
        import math
        sum_prob = 0
        idx_count = 0;
        for idx in range(self._rank - 1,len(sentence)):
            prob = self._calculate_idx_prob(sentence, idx)
            log_prob = math.log2(prob)
            self._print({"logprob": log_prob})
            sum_prob += log_prob
            idx_count += 1
        return sum_prob, idx_count
    
    def _calculate_idx_prob(self, sentence, idx, verbose = False):
        self._print("=======================================================================", verbose = verbose)
        current_ngram_key = self._create_key(sentence, idx, 0)
        previous_ngram_key = self._create_key(sentence, idx, 1)
        current_ngram_count = self._counters.get(self._rank)[current_ngram_key]
        previous_ngram_count = self._counters.get(self._rank - 1)[previous_ngram_key]

        self._print({"n": (current_ngram_key, current_ngram_count), "n-1" : ( previous_ngram_key, previous_ngram_count) }, verbose = verbose)

        prob = self._laplace_smoothing(current_ngram_count, previous_ngram_count, self.__alpha, self._vocabulary.size)
        self._print({"prob": prob}, verbose = verbose)
        self._print("=======================================================================", verbose=verbose)
        return prob
   
    def _laplace_smoothing(self, current_ngram_count, previous_ngram_count, alpha, vocabulary_size):
        numerator = current_ngram_count + self.__alpha
        denominator = previous_ngram_count + (alpha * vocabulary_size)
        self._print({ "numerator": numerator, "denominator": denominator, "alpha": self.__alpha, "vocabulary_size": vocabulary_size })
        return numerator / denominator

    def _create_key(self, sentence, index, to):
        key = ()
        for i in range (self._rank - 1, to - 1, -1):
            key = (*key, sentence[index - i])
        return key
    
    def _init_counters(self):
        from collections import Counter
        self._counters = { key: Counter() for key in range(self._rank - 1, self._rank + 1) }
    
    def _create_sentences(self, corpus):
        from nltk import sent_tokenize
        sentences = sent_tokenize(corpus)
        return sentences
    
    def _preprocess(self, sentence):
        sentence = self._normalize(sentence)
        sentence = self._tokenize(sentence)
        return sentence
    
    def _normalize(self, sentence):
        sentence = sentence.lower()
        sentence = sentence.strip()
        return sentence
    
    def _tokenize(self, sentence):
        from nltk.tokenize import TweetTokenizer
        tweet_wt = TweetTokenizer()
        sentence = tweet_wt.tokenize(sentence)
        return sentence

    def _add_padding(self, tokenized_sentence, rank = 1):
        return self._sentence_padding.add_padding(tokenized_sentence, times_start = rank, times_end = 1, indexed_start = True, indexed_end = False)
    
    def _update_counter(self, rank, sentence):
        from nltk import ngrams
        counts = [gram for gram in ngrams(sentence, rank)]
        self._counters.get(rank).update(counts)
        
    def _print(self, *args, **kargs):
        if kargs.get("verbose", False):
            print(args)

### Ingest the corpus

We shall use a subset of 100000 lines from the entire corpus.

In [12]:
corpus = load_file_part(filepath, 100000)

### Generate train, dev and test sets

We shall split the dataset according to the 80%/20% dev/test ratio

In [13]:
X_train, X_test = split(corpus, percent = 0.80, shuffle = False)

#### The training_set will then be splitted in a train and dev set. 

From the training set we will also take 75% as real train and the rest as temporary dev test

In [14]:
X_train_train, X_train_dev = split(X_train, percent = 0.80, shuffle = False)

### Train the vocabulary

In [15]:
### Testing the vocabulary
trained_vocabulary = Vocabulary(cutoff_thresshold = 10, cutoff_replacement = "*UNK*")
trained_vocabulary.fit(X_train_train)

### Create a padding helper

In [16]:
sentence_padding = SentencePadding(pad_word_start = "start", pad_word_end = "end")

### Sentences

Find a sentence from the test corpus.

In [17]:
from nltk import sent_tokenize

sentences = sent_tokenize(X_train_dev)
random_sentence = sentences[132]

print(random_sentence)

He should be pleased about that, as I and many others are.


Create a sentence of the same size from randomly selected words in the trained vocabulary.

In [18]:
from nltk.tokenize import TweetTokenizer
tweet_nt = TweetTokenizer()
tokenized = tweet_nt.tokenize(random_sentence)
print(len(tokenized))

import random
t = " ".join(random.sample(list(trained_vocabulary.cutoff_counts.keys()), 14))
print(t)

14
feeling Directives whole commissioned drugs returns overlook turn secret port Beijing guarantees disappointing bold


In [19]:
sentence_in_corpus = "He should be pleased about that, as I and many others are."
sentence_not_in_corpus = "transposed Ms chance prepared Newton Cultural absence allegations spongiform committee drafting common up-to-date tiny"
sentence_with_unknowns = "aba aeraeraer aeraee 123u unkown , coavaeery but this asdasd erqreq is araera."

## Bigram Language Model

Training of a bigram model.

In [20]:
print("--------------------------------------------------------------------------")
bigram_lm = LM(rank=2, vocabulary=trained_vocabulary, sentence_padding = sentence_padding).fit(X_train_train)
bigram_lm_prob_in_corpus = bigram_lm.predict(sentence_in_corpus)
print("bigram_lm_prob_in_corpus = ", bigram_lm_prob_in_corpus)
print()

bigram_lm_prob_not_in_corpus = bigram_lm.predict(sentence_not_in_corpus)
print("bigram_lm_prob_not_in_corpus = ", bigram_lm_prob_not_in_corpus)
print()

bigram_lm_prob_with_unknowns = bigram_lm.predict(sentence_with_unknowns)
print("bigram_lm_prob_with_unknowns = ", bigram_lm_prob_with_unknowns)
print()

bigram_lm_entropy, bigram_lm_perplexity = bigram_lm.score(X_train_dev)
print("Bigram Model Score")
print("Cross Entropy: {0:.3f}".format(bigram_lm_entropy))
print("perplexity: {0:.3f}".format(bigram_lm_perplexity))
print("--------------------------------------------------------------------------")

--------------------------------------------------------------------------
bigram_lm_prob_in_corpus =  -129.73069488699372

bigram_lm_prob_not_in_corpus =  -212.58445786284798

bigram_lm_prob_with_unknowns =  -94.01034361863788

Bigram Model Score
Cross Entropy: 8.321
perplexity: 319.732
--------------------------------------------------------------------------


## Trigram Language Model

Training of a trigram model.

In [21]:
print("--------------------------------------------------------------------------")
trigram_lm = LM(rank=3, vocabulary=trained_vocabulary, sentence_padding = sentence_padding).fit(X_train_train)
trigram_lm_prob_in_corpus = trigram_lm.predict(sentence_in_corpus)
print("trigram_lm_prob_in_corpus = ", trigram_lm_prob_in_corpus)
print()

trigram_lm_prob_not_in_corpus = trigram_lm.predict(sentence_not_in_corpus)
print("trigram_lm_prob_not_in_corpus = ", trigram_lm_prob_not_in_corpus)
print()

trigram_lm_prob_with_unknowns = trigram_lm.predict(sentence_with_unknowns)
print("trigram_lm_prob_with_unknowns = ", trigram_lm_prob_with_unknowns)
print()

trigram_lm_entropy, trigram_lm_perplexity = trigram_lm.score(X_train_dev)
print("Trigram Model Score")
print("Cross Entropy: {0:.3f}".format(trigram_lm_entropy))
print("perplexity: {0:.3f}".format(trigram_lm_perplexity))
print("--------------------------------------------------------------------------")

--------------------------------------------------------------------------
trigram_lm_prob_in_corpus =  -173.90599072867533

trigram_lm_prob_not_in_corpus =  -209.08703916667608

trigram_lm_prob_with_unknowns =  -125.79722934411396

Trigram Model Score
Cross Entropy: 10.654
perplexity: 1610.980
--------------------------------------------------------------------------


## Interpolated Language Model

In [22]:
class InterpolatedLM(object):
    def __init__(self, model1 : LM, model2: LM, rank = 2, lamda : float = 0):
        self.__model1 = model1
        self.__model2 = model2
        self.__lamda = lamda
        
    def fit(self, train_corpus):
        self.__model1.fit(train_corpus)
        self.__model2.fit(train_corpus)
        return self
    
    def predict(self, sentence, verbose=False):
        prob_count_model_1 = self.__model1.predict(sentence, verbose)
        prob_count_model_2 = self.__model2.predict(sentence, verbose)
        prob = (self.__lamda * prob_count_model_2 + (1 - self.__lamda) * prob_count_model_1)
        return prob
    
    def score(self, test_corpus, verbose=False):
        import math
        sentences = self._create_sentences(test_corpus)

        total_prob = 0
        total_count =  0
        for sentence in sentences:
            prob_count_model_1, idx_count_model_1 = self.__model1._calculate_sentence_prob(sentence, verbose)
            prob_count_model_2, idx_count_model_2 = self.__model2._calculate_sentence_prob(sentence, verbose)
            prob = (self.__lamda * prob_count_model_2 + (1 - self.__lamda) * prob_count_model_1)
            total_prob += prob
            total_count += idx_count_model_2
        entropy = -total_prob / total_count
        perplexity = math.pow(2,entropy)
        return entropy, perplexity
        
    def _create_sentences(self, corpus):
        from nltk import sent_tokenize
        sentences = sent_tokenize(corpus)
        return sentences

In [23]:
bigram_lm = LM(rank=2, vocabulary=trained_vocabulary, sentence_padding = sentence_padding).fit(X_train_train)
trigram_lm = LM(rank=3, vocabulary=trained_vocabulary, sentence_padding = sentence_padding).fit(X_train_train)

In [24]:
interpolated_lm = InterpolatedLM(model1 = bigram_lm, model2 = trigram_lm, rank = 3, lamda = 0.5)

interpolated_lm_prob_in_corpus = interpolated_lm.predict(sentence_in_corpus)
print("interpolated_lm_prob_in_corpus = ", interpolated_lm_prob_in_corpus)
print()

interpolated_lm_prob_not_in_corpus = interpolated_lm.predict(sentence_not_in_corpus)
print("interpolated_lm_prob_not_in_corpus = ", interpolated_lm_prob_not_in_corpus)
print()

interpolated_lm_entropy, interpolated_lm_perplexity = interpolated_lm.score(X_train_dev)
print("interpolated Model Score")
print("Cross Entropy: {0:.3f}".format(interpolated_lm_entropy))
print("perplexity: {0:.3f}".format(interpolated_lm_perplexity))
print("--------------------------------------------------------------------------")

interpolated_lm_prob_in_corpus =  -151.81834280783454

interpolated_lm_prob_not_in_corpus =  -210.83574851476203

interpolated Model Score
Cross Entropy: 9.487
perplexity: 717.692
--------------------------------------------------------------------------


---