<center><h1><b>TP 3: Language Model - Automatic Text Generation</b></h1></center>
<h3><strong>Objective:</strong></h3>
<p>We aim to generate text automatically using language models: N-grams. We rely on the NLTK package and its features to accomplish the task.</p>
<h3><strong>Tasks to Accomplish:</strong></h3>
<ol>
    <li>Generate a single sentence automatically from a small corpus. Then, modify it to generate a paragraph composed of 10 sentences.</li>
    <li>Modify the code to calculate the probability of each generated sentence by calculating its perplexity.</li>
    <li>Generate a language model, this time based on morphosyntactic tags (use nltk or spaCy to obtain morphosyntactic tags), and verify the generated sentences based on this model.</li>
    <li>Apply different N-gram language models to generate a paragraph composed of 10 sentences using "the Reuters corpus" as the training data.</li>
    <li>Provide the perplexity values for each method and compare the results.</li>
</ol>


### **Imported Libraries**

In [27]:
from nltk.corpus import reuters
from nltk import trigrams
import random
import numpy as np
import pandas as pd
import math
import string

import nltk
from nltk import word_tokenize, pos_tag
nltk.download('reuters')
nltk.download('punkt')

from IPython.display import display

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### **Customized Trigrams Language Model**

In [28]:
def fill_trigrams_model(corpus=reuters.sents()):
    """
    Create and fill a trigrams language model from a given corpus.

    Args:
        corpus (list of list of str): A list of sentences where each sentence is a list of words.

    Returns:
        dict: A trigrams language model represented as a nested dictionary.
             The keys are tuples of the previous two words, and the values are dictionaries
             where keys are possible next words and values are their frequencies.
    """
    trigrams_model = {}

    for sentence in corpus:
        for token1, token2, token3 in trigrams(sentence, pad_right=True, pad_left=True):
            # Check if the first two tokens are in the model
            if (token1, token2) not in trigrams_model:
                trigrams_model[(token1, token2)] = {}

            # Update the count of the third token
            trigrams_model[(token1, token2)][token3] = trigrams_model[(token1, token2)].get(token3, 0) + 1

    return trigrams_model

In [29]:
# Generate model
model = fill_trigrams_model()

In [30]:
def transform_occurrences_to_probabilities(model):
    """
    Transform token occurrences in a tri-gram model to probabilities.

    Args:
        model (dict): A bi-gram language model where keys are tuples of two words, and values are dictionaries
            containing next-word occurrences.

    Returns:
        dict: The input model with occurrences transformed to probabilities.
    """
    for bi_token in model:
        total_count = sum(model[bi_token].values())

        for token in model[bi_token]:
            # Calculate the probability by dividing token count by the total count
            model[bi_token][token] /= total_count

    return model

In [31]:
model = transform_occurrences_to_probabilities(model=model)

In [32]:
def get_elements_not_punctuation(model):
    """
    Generate a tuple of start words that do not contain punctuation characters.

    Args:
        model (dict): A language model represented as a dictionary.

    Returns:
        list: A tuple of start words that do not contain punctuation characters.
    """
    # Define a set of punctuation characters
    punctuation_set = set(string.punctuation)

    while True:
        # Choose a random key (tuple of start words) from the model
        start_words = list(random.choice(list(model.keys())))
        
        # Check if all words in the start words are not in the punctuation set
        if all(word not in punctuation_set for word in start_words):
            break

    return start_words

In [33]:
def generate_text_with_trigram(model):
    """
    Generate text using a trigram language model.

    Args:
        model (dict): A trigram language model represented as a nested dictionary.

    Returns:
        str: Generated text based on the trigram model.
    """
    # Choose start words that do not contain punctuation
    start_words = get_elements_not_punctuation(model=model)
    
    flag = False
    while not flag:
        text_threshold = random.choice(np.linspace(0, 1, num=10))
        accumulator = 0.0
        
        for word in model[tuple(start_words[-2:])].keys():
            accumulator += model[tuple(start_words[-2:])][word]
            
            if accumulator >= text_threshold:
                start_words.append(word)
                break
        
        # Check if the last two words are None (end of text)
        if start_words[-2:] == [None] * 2:
            flag = True
    
    return ' '.join([w for w in start_words if w])

In [34]:
def generate_morpho_syntax_model(generated_sentence):
    """
    Generate morphosyntactic labels and map them to readable labels for a given sentence.

    Args:
        generated_sentence (str): The input sentence for which morphosyntactic labels are to be generated.

    Returns:
        tuple: A tuple containing two elements.
            - The first element is a string with morphosyntactic labels for each word in the sentence.
            - The second element is a list of readable labels corresponding to the morphosyntactic labels.
    """
    pos_tag_mapping = {
        'CC': 'Coordinating Conj.',
        'CD': 'Cardinal Num.',
        'DT': 'Determiner',
        'IN': 'Prep./Subj. Conj.',
        'JJ': 'Adjective',
        'JJR': 'Adj., Comp.',
        'JJS': 'Adj., Sup.',
        'MD': 'Modal',
        'NN': 'Noun, Sing.',
        'NNS': 'Noun, Plur.',
        'NNP': 'Proper Noun, Sing.',
        'NNPS': 'Proper Noun, Plur.',
        'PRP': 'Personal Pronoun',
        'RB': 'Adverb',
        'RBR': 'Adv., Comp.',
        'RBS': 'Adv., Sup.',
        'VB': 'Verb, Base Form',
        'VBD': 'Verb, Past Tense',
        'VBG': 'Verb, Gerund/Pres. Part.',
        'VBN': 'Verb, Past Part.',
        'VBP': 'Verb, Non-3rd Sing. Pres.',
        'VBZ': 'Verb, 3rd Sing. Pres.'
    }
    
    # Tokenize the sentence into words
    words = word_tokenize(generated_sentence)

    # Perform part-of-speech tagging using NLTK
    pos_tags = pos_tag(words)

    # Map morphosyntactic labels to readable labels
    readable_labels = [pos_tag_mapping.get(tag, tag) for word, tag in pos_tags]

    # Join morphosyntactic labels into a string
    morphosyntactic_labels = ' '.join([tag for word, tag in pos_tags])

    return morphosyntactic_labels, readable_labels

In [35]:
def stick_punctuation_to_word(text):
    """
    Ensure that punctuation is properly attached to the preceding word in the text for improved readability.

    Args:
        text (str): The input text.

    Returns:
        str: The text with punctuation attached to words as needed.
    """
    combined_list = []
    word_list = text.split()
    i = 0

    while i < len(word_list):
        if i < len(word_list) - 1 and word_list[i + 1] in string.punctuation:
            # Combine the current word with the following punctuation
            combined_word = word_list[i] + word_list[i + 1]
            combined_list.append(combined_word)
            i += 2
        else:
            # If no punctuation follows, add the word as-is
            combined_list.append(word_list[i])
            i += 1

    # Join the combined words into a single text
    return ' '.join([w for w in combined_list])

In [36]:
def calculate_sentence_perplexity(text, model):
    """
    Calculate perplexity for a given text using a trigram language model.

    Args:
        text (str): The input sentence to calculate perplexity for.
        model (dict): A trigram language model where keys are tuples of the previous two words,
                     and values are dictionaries containing next-word probabilities.

    Returns:
        float: The calculated perplexity value for the input sentence.
    """
    words = text.split()  # Tokenize the sentence
    trigrams = [words[i:i + 3] for i in range(len(words) - 2)]

    n = 0
    perplexity_sum = 0.0

    for trigram in trigrams:
        if len(trigram) == 3:
            w1, w2, w3 = trigram
            probability = model.get(tuple([w1, w2]), {}).get(w3, 0.0)

            if probability > 0:
                perplexity_sum += math.log(1.0 / probability)
                n += 1

    if n == 0:
        return float('inf')  # Avoid division by zero if no valid trigrams found

    return math.exp(perplexity_sum / n)

In [37]:
def get_sentence_with_min_perplexity(sentences, perplexity_scores):
    """
    Find the sentence with the minimum perplexity score from a list of sentences.

    Args:
        sentences (list of str): A list of sentences to choose from.
        perplexity_scores (list of float): A list of perplexity scores corresponding to the sentences.

    Returns:
        tuple: A tuple containing the sentence with the minimum perplexity score and the minimum perplexity score itself.
               If the input data is invalid, it returns "Invalid input data."
    """
    if len(sentences) == 0 or len(sentences) != len(perplexity_scores):
        return "Invalid input data"
    
    min_perplexity = min(perplexity_scores)
    min_perplexity_index = perplexity_scores.index(min_perplexity)
    
    return sentences[min_perplexity_index], min_perplexity

In [38]:
def generate_text(n=10, model=model):
    """
    Generate text and calculate perplexity for multiple sentences.

    Args:
        n (int): The number of sentences to generate.
        model (dict): The trigram language model for text generation.

    Returns:
        None
    """
    texts, perplexities = [], []

    # Welcome message
    print(f'#################\tWelcome to the Simple Text Generator\t#################')

    for i in range(n):
        print(f'>>> Text #{i+1}:')

        # Generate text using the trigram model
        generated_text = generate_text_with_trigram(model=model)
        print(f'\t+  {generated_text}')

        # Annotate the generated text with morphosyntactic labels
        annotated_text, readable_labels = generate_morpho_syntax_model(generated_text)
        print(f'\t+  {annotated_text}')

        # Calculate and display perplexity
        perplexity = calculate_sentence_perplexity(generated_text, model)
        print(f'\t+  PP({perplexity})')

        print(f'++ Annotations:')
        # Display annotations in a DataFrame
        display(pd.DataFrame(data=[readable_labels], columns=annotated_text.split()))

        texts.append(generated_text)
        perplexities.append(perplexity)

    print('--------------------------')

    # Find the best generated text with the minimum perplexity
    best_text, best_perplexity = get_sentence_with_min_perplexity(texts, perplexities)
    print(f'🎉🎉 {stick_punctuation_to_word(best_text)} --- is the best generated text with PP({best_perplexity})')

In [39]:
# Main
generate_text()

#################	Welcome to the Simple Text Generator	#################
>>> Text #1:
	+  newspaper he saw the dollar could well be toughening our trade agreements ," Baker told the New York , for the last significant government reorganization in which Viner holds 408 , 766 FT - SE 100 at 4 . 91 dlrs Net profit 172 , 000 dlr tax credit but instead they chose something in international waters , it said .


	+  NN PRP VBD DT NN MD RB VB VBG PRP$ NN NNS , '' NNP VBD DT NNP NNP , IN DT JJ JJ NN NN IN WDT NNP VBZ CD , CD NNP : NN CD IN CD . CD JJ JJ NN CD , CD NNS NN NN CC RB PRP VBD NN IN JJ NNS , PRP VBD .
	+  PP(4.916927274483677)
++ Annotations:


Unnamed: 0,NN,PRP,VBD,DT,NN.1,MD,RB,VB,VBG,PRP$,...,PRP.1,VBD.1,NN.2,IN,JJ,NNS,",",PRP.2,VBD.2,.
0,"Noun, Sing.",Personal Pronoun,"Verb, Past Tense",Determiner,"Noun, Sing.",Modal,Adverb,"Verb, Base Form","Verb, Gerund/Pres. Part.",PRP$,...,Personal Pronoun,"Verb, Past Tense","Noun, Sing.",Prep./Subj. Conj.,Adjective,"Noun, Plur.",",",Personal Pronoun,"Verb, Past Tense",.


>>> Text #2:
	+  Jose Melicias from the Federal Reserve and 9 . 60 dlrs a share .
	+  NNP NNP IN DT NNP NNP CC CD . CD NN DT NN .
	+  PP(5.800437217242906)
++ Annotations:


Unnamed: 0,NNP,NNP.1,IN,DT,NNP.2,NNP.3,CC,CD,.,CD.1,NN,DT.1,NN.1,..1
0,"Proper Noun, Sing.","Proper Noun, Sing.",Prep./Subj. Conj.,Determiner,"Proper Noun, Sing.","Proper Noun, Sing.",Coordinating Conj.,Cardinal Num.,.,Cardinal Num.,"Noun, Sing.",Determiner,"Noun, Sing.",.


>>> Text #3:
	+  Brazil withdrawing from the bottoming out of the report took no account of Singapore rubber market because of better European demand and undertaking necessary economic restructuring away from an 80 pct of the franchise banks that were better managed have gotten a boost in production .
	+  NNP VBG IN DT VBG IN IN DT NN VBD DT NN IN NNP NN NN IN IN JJR JJ NN CC JJ JJ JJ NN RB IN DT CD NN IN DT NN NNS WDT VBD JJR VBN VBP VBN DT NN IN NN .
	+  PP(6.032768144919998)
++ Annotations:


Unnamed: 0,NNP,VBG,IN,DT,VBG.1,IN.1,IN.2,DT.1,NN,VBD,...,VBD.1,JJR,VBN,VBP,VBN.1,DT.2,NN.1,IN.3,NN.2,.
0,"Proper Noun, Sing.","Verb, Gerund/Pres. Part.",Prep./Subj. Conj.,Determiner,"Verb, Gerund/Pres. Part.",Prep./Subj. Conj.,Prep./Subj. Conj.,Determiner,"Noun, Sing.","Verb, Past Tense",...,"Verb, Past Tense","Adj., Comp.","Verb, Past Part.","Verb, Non-3rd Sing. Pres.","Verb, Past Part.",Determiner,"Noun, Sing.",Prep./Subj. Conj.,"Noun, Sing.",.


>>> Text #4:
	+  impose curbs on U . S . Oil acreage writedown .
	+  JJ NNS IN NNP . NNP . NNP NN NN .
	+  PP(3.351774773462404)
++ Annotations:


Unnamed: 0,JJ,NNS,IN,NNP,.,NNP.1,..1,NNP.2,NN,NN.1,..2
0,Adjective,"Noun, Plur.",Prep./Subj. Conj.,"Proper Noun, Sing.",.,"Proper Noun, Sing.",.,"Proper Noun, Sing.","Noun, Sing.","Noun, Sing.",.


>>> Text #5:
	+  Should work closely within the LDP ' s credibility , Shiratori told Reuters earlier this week for 2 . 193 mln ounces of gold producers .
	+  MD VB RB IN DT NNP POS NN NN , NNP VBD NNPS RBR DT NN IN CD . CD JJ NNS IN NN NNS .
	+  PP(6.54853517477716)
++ Annotations:


Unnamed: 0,MD,VB,RB,IN,DT,NNP,POS,NN,NN.1,",",...,IN.1,CD,.,CD.1,JJ,NNS,IN.2,NN.2,NNS.1,..1
0,Modal,"Verb, Base Form",Adverb,Prep./Subj. Conj.,Determiner,"Proper Noun, Sing.",POS,"Noun, Sing.","Noun, Sing.",",",...,Prep./Subj. Conj.,Cardinal Num.,.,Cardinal Num.,Adjective,"Noun, Plur.",Prep./Subj. Conj.,"Noun, Sing.","Noun, Plur.",.


>>> Text #6:
	+  000 francs .
	+  CD NNS .
	+  PP(1.5)
++ Annotations:


Unnamed: 0,CD,NNS,.
0,Cardinal Num.,"Noun, Plur.",.


>>> Text #7:
	+  smuggled from Miami .
	+  VBN IN NNP .
	+  PP(1.414213562373095)
++ Annotations:


Unnamed: 0,VBN,IN,NNP,.
0,"Verb, Past Part.",Prep./Subj. Conj.,"Proper Noun, Sing.",.


>>> Text #8:
	+  cash dividends on its proposed refinancing packages , with annual capacity of 4 . 1 mln Avg shrs 1 , 512 Revs 7 , 800 hectares , or 76 cts vs 56 . 4p , a departure from the Sosnoff move for quite some time .
	+  NN NNS IN PRP$ VBN NN NNS , IN JJ NN IN CD . CD JJ NNP NN CD , CD NNP CD , CD NNS , CC CD NNS JJ CD . CD , DT NN IN DT NNP NN IN RB DT NN .
	+  PP(7.66699166615273)
++ Annotations:


Unnamed: 0,NN,NNS,IN,PRP$,VBN,NN.1,NNS.1,",",IN.1,JJ,...,NN.2,IN.2,DT,NNP,NN.3,IN.3,RB,DT.1,NN.4,.
0,"Noun, Sing.","Noun, Plur.",Prep./Subj. Conj.,PRP$,"Verb, Past Part.","Noun, Sing.","Noun, Plur.",",",Prep./Subj. Conj.,Adjective,...,"Noun, Sing.",Prep./Subj. Conj.,Determiner,"Proper Noun, Sing.","Noun, Sing.",Prep./Subj. Conj.,Adverb,Determiner,"Noun, Sing.",.


>>> Text #9:
	+  based auto parts .
	+  VBN NN NNS .
	+  PP(2.449489742783178)
++ Annotations:


Unnamed: 0,VBN,NN,NNS,.
0,"Verb, Past Part.","Noun, Sing.","Noun, Plur.",.


>>> Text #10:
	+  moving expenses , most commodity prices , sources said .
	+  VBG NNS , JJS NN NNS , NNS VBD .
	+  PP(6.045793824049856)
++ Annotations:


Unnamed: 0,VBG,NNS,",",JJS,NN,NNS.1,",.1",NNS.2,VBD,.
0,"Verb, Gerund/Pres. Part.","Noun, Plur.",",","Adj., Sup.","Noun, Sing.","Noun, Plur.",",","Noun, Plur.","Verb, Past Tense",.


--------------------------
🎉🎉 smuggled from Miami. --- is the best generated text with PP(1.414213562373095)


### **N-grams Language Model**

In [40]:
from nltk.lm import MLE, Laplace, KneserNeyInterpolated, Lidstone
from nltk.lm.preprocessing import padded_everygram_pipeline

In [41]:
def train_ngram_model(n, model_type, corpus):
    """
    Train an n-gram language model on a given corpus.

    Args:
        n (int): The order of the n-gram model.
        model_type (str): The type of language model to train ('mle', 'laplace', 'kneser-ney', or 'lidstone').
        corpus (list of list of str): A list of sentences where each sentence is a list of words.

    Returns:
        nltk.lm.LmModel: The trained n-gram language model of the specified type.
    """
    # Create padded everygram model
    train, vocab = padded_everygram_pipeline(n, corpus)

    # Select the model type and initialize the model
    if model_type == 'mle':
        model = MLE(n)
    elif model_type == 'laplace':
        model = Laplace(n)
    elif model_type == 'kneser-ney':
        model = KneserNeyInterpolated(n, discount=0.1)
    elif model_type == 'lidstone':
        model = Lidstone(0.5, n)
    else:
        raise ValueError("Invalid model type. Use 'mle', 'lidstone', 'laplace', or 'kneser-ney'.")

    # Fit the model to the training data
    model.fit(train, vocab)

    return model

In [42]:
def get_elements_not_punctuation(vocab):
    """
    Generate a sequence of words that do not contain punctuation characters.

    Args:
        vocab (list of str): A vocabulary list from which to choose a sequence of words.

    Returns:
        list of str: A sequence of words that do not contain punctuation characters.
    """
    # Define a set of punctuation characters and digits
    punctuation_set = set(string.punctuation)

    while True:
        start_words = random.choice(vocab)

        # Check if all words in the chosen sequence are not in the punctuation set
        if all(word not in punctuation_set for word in start_words):
            break

    return start_words

In [43]:
def generate_paragraph(model, num_phrases):
    """
    Generate a paragraph composed of multiple phrases using an n-gram language model.

    Args:
        model (nltk.lm.LmModel): The trained n-gram language model.
        num_phrases (int): The number of phrases to generate in the paragraph.

    Returns:
        list of str: A list of phrases forming the generated paragraph.
    """
    paragraph = []

    for _ in range(num_phrases):
        current_word = get_elements_not_punctuation(list(model.vocab))  # Initialize current_word
        
        # Generate the next words in the phrase
        next_words = model.generate(15, text_seed=[current_word], random_seed=3)

        # Join the generated words to form a phrase
        phrase = ' '.join(next_words)
        paragraph.append(phrase)

    return paragraph

In [44]:
def calculate_perplexity_entropy(model, text):
    """
    Calculate perplexity and entropy for a given text using an n-gram language model.

    Args:
        model (nltk.lm.LmModel): The n-gram language model.
        text (str): The text for which to calculate perplexity and entropy.

    Returns:
        tuple: A tuple containing two values.
            - The first value is the perplexity of the text based on the model.
            - The second value is the entropy of the text based on the model.
    """
    return model.perplexity(text), model.entropy(text)

In [45]:
def start_generation(model='mle'):
    """
    Generate a paragraph using an n-gram language model and calculate perplexity and entropy.

    Args:
        model (str): The type of language model to train ('mle', 'laplace', 'kneser-ney', or 'lidstone').
                     Defaults to 'mle'.

    Returns:
        None
    """
    corpus = reuters.sents()
    n = 2  # You can change n to 1, 2, or 3 for unigram, bigram, or trigram models

    # Train an n-gram language model on the corpus
    ngram_model = train_ngram_model(n, model, corpus)
    
    num_phrases = 10

    # Generate a paragraph using the trained model
    generated_paragraph = generate_paragraph(ngram_model, num_phrases)

    # Calculate perplexity and entropy for the generated paragraph
    perplexity, entropy = calculate_perplexity_entropy(ngram_model, ' '.join(generated_paragraph))

    print(f"Generated Paragraph with {n}-grams {model} model:")
    for i, phrase in enumerate(generated_paragraph):
        print(f"Phrase {i + 1}: {phrase}")

    print(f"Perplexity ({model}): {perplexity}\nCross-entropy ({model}): {entropy}")

In [46]:
# Main
for model in ['mle', 'laplace', 'lidstone']: # I don't have high-performing hardware resources to handle Kneser-Ney model, so not going to try it
    start_generation(model=model)
    print('--------------')

Generated Paragraph with 2-grams mle model:
Phrase 1: . </s> Net excludes gains . 0 mln Year Oper shr and tax gain of
Phrase 2: Program for breaking safety of 1987 , the afternoon , with each share cash for
Phrase 3: , argue that personal income . 0 mln Year Oper shr and tax gain of
Phrase 4: . </s> Net excludes gains . 0 mln Year Oper shr and tax gain of
Phrase 5: . </s> Net excludes gains . 0 mln Year Oper shr and tax gain of
Phrase 6: Nederland NV & lt ; ATRC . </s> <s> Cash surplus in the exchange rate
Phrase 7: and has been one of 1987 , the afternoon , with each share cash for
Phrase 8: ENQUIRY INTO HONG KONG FIRM AGREES COCOA SURPLUS 94 . semiconductor makers to end of
Phrase 9: had included an immediate outlook , 000 vs 2 , with each share cash for
Phrase 10: well as dollar had no assurance . </s> <s> Cash surplus in the exchange rate
Perplexity (mle): inf
Cross-entropy (mle): inf
--------------
Generated Paragraph with 2-grams laplace model:
Phrase 1: are made considerable num

The Kneser-Ney smoothing model is a more complex and computationally intensive language modeling technique compared to simpler models like Maximum Likelihood Estimation (MLE) or Laplace smoothing. The reason the Kneser-Ney model may take longer to execute is because it involves additional calculations to estimate and smooth n-gram probabilities. Here are some factors that contribute to its longer execution time:

1. **Interpolation and Back-off:** Kneser-Ney smoothing uses both interpolation and back-off techniques, which require estimating and combining probabilities from various n-grams. This involves additional calculations and data structures to manage these probabilities.

2. **Discounting**: Kneser-Ney uses discounting to redistribute probability mass from higher-order n-grams to lower-order n-grams. Calculating and applying these discounting factors can be computationally intensive, especially for larger n-gram models.

3. **Vocabulary Size**: The size of the vocabulary in the corpus can impact the execution time. Kneser-Ney may involve handling a larger number of n-grams and vocabulary entries, leading to increased computational overhead.

4. **Data Structures**: The Kneser-Ney model may use more complex data structures to store and manage n-gram information, which can contribute to longer execution times.

5. **Complexity**: The Kneser-Ney model's mathematical complexity, involving recursive calculations and multiple adjustments, adds to the computational workload.