# Trigram-Based Text Generation and Analysis

This Python script builds and utilizes a trigram-based model for text analysis and generation. The workflow includes multiple steps to preprocess text, build a trigram model, generate new text based on the model, and analyze the validity of the generated content.

## Overview
- Loading and Cleaning Text Data
- Building a Trigram Model
- Generating Text Using the Trigram Model  
- Analyzing Generated Text
- Exporting the Trigram Model

In [74]:
# imports
import re
import os
import random
import json

In [39]:
# Define the path to the data folder
data_folder = '../data/'


## load_text(file_path)
- Reads the content of a text file and returns it as a string.
- Safeguards against missing or unreadable files, providing user feedback.
- Processes raw text files for subsequent cleaning and analysis.


In [75]:
def load_text(file_path):
    """
    Reads the content of a file and returns the text as a string.
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            print(f"[INFO] File '{file_path}' loaded successfully.")
            return file.read()
    except FileNotFoundError:
        print(f"[ERROR] File '{file_path}' not found.")
        return ""

## clean_text(text)
Purpose:
- Cleans and preprocesses the input text for analysis by removing unwanted characters and standardizing formatting.

Steps:

- Remove Project Gutenberg-specific preamble and postamble using markers.
- Retain only letters, spaces, and periods.
- Convert text to uppercase for uniformity.

In [76]:
def clean_text(text):
    """
    Cleans the input text by removing non-letter characters, 
    keeping spaces and periods, and converting to uppercase.
        
    Returns:
        str: The cleaned text.
    """
    # Markers to remove preamble and postamble from Project Gutenberg texts
    start_marker = '*** START OF THIS PROJECT GUTENBERG EBOOK'
    end_marker = '*** END OF THIS PROJECT GUTENBERG EBOOK'
    
    # Find start and end positions
    start_pos = text.find(start_marker)
    end_pos = text.find(end_marker)
    
    # Remove preamble and postamble if found
    if start_pos != -1:
        text = text[start_pos + len(start_marker):]
    if end_pos != -1:
        text = text[:end_pos]
    
    # Remove non-letter characters and convert to uppercase
    cleaned_text = re.sub(r'[^A-Za-z. ]', '', text).upper().strip()
    return cleaned_text

# Call `clean_text`
raw_text = "This is a raw sample *** START OF THIS PROJECT GUTENBERG EBOOK content *** END OF THIS PROJECT GUTENBERG EBOOK"
cleaned_text = clean_text(raw_text)
print(f"First 100 characters of cleaned text:\n{cleaned_text[:100]}\n")


First 100 characters of cleaned text:
CONTENT  END OF THIS PROJECT GUTENBERG EBOOK



## generate_trigrams(cleaned_text)
Purpose:
- Builds a trigram model by identifying all sequences of three consecutive characters and counting their occurrences.

Steps:
- Iterate over the cleaned text.
- Extract trigrams and update their count in a dictionary.

In [60]:
def generate_trigrams(cleaned_text):
    """
    Generates a trigram model by counting occurrences of trigrams in the text.
        
    Returns:
        dict: A dictionary where keys are trigrams and values are their counts.
    """
    trigram_model = {}
    for i in range(len(cleaned_text) - 2):
        trigram = cleaned_text[i:i + 3]
        if trigram in trigram_model:
            trigram_model[trigram] += 1
        else:
            trigram_model[trigram] = 1
    print(f"Trigram model sample (first 5 trigrams): {list(trigram_model.items())[:5]}")
    return trigram_model

# Call generate_trigrams and immediately print sample output
trigram_model = generate_trigrams(cleaned_text)
print(f"Sample of trigram model: {dict(list(trigram_model.items())[:10])}")

Trigram model sample (first 5 trigrams): [('CON', 1), ('ONT', 1), ('NTE', 1), ('TEN', 2), ('ENT', 1)]
Sample of trigram model: {'CON': 1, 'ONT': 1, 'NTE': 1, 'TEN': 2, 'ENT': 1, 'NT ': 1, 'T  ': 1, '  E': 1, ' EN': 1, 'END': 1}


## get_next_char
Purpose:
- Predicts the next character based on the current two-character sequence (bigram).

Steps:
- Identify trigrams in the model that start with the given bigram.
- Calculate probabilities for the third character based on trigram frequencies.
- Use weighted random sampling to select the next character.

In [7]:
def get_next_char(bigram, trigram_model):
    """
    Given a bigram, find all trigrams that start with this bigram
    and use the trigram model to choose the next character based on frequencies.
    """
    # Find trigrams that start with the given bigram
    candidates = {tri: count for tri, count in trigram_model.items() if tri.startswith(bigram)}
    
    if not candidates:
        # If no trigrams are found, return a space
        return ' '
    
    # Extract the third characters and their corresponding counts
    next_chars = [tri[2] for tri in candidates]  # The third character of each trigram
    weights = [count for count in candidates.values()]  # Counts of each trigram
    
    # Randomly choose the next character based on the trigram frequencies
    return random.choices(next_chars, weights=weights, k=1)[0]

# generate_text
Purpose:
- Generates a block of text that mimics the style of the original content using the trigram model.

Steps:
- Start with an initial seed (default: "TH").
- Iteratively predict the next character and append it to the generated text.
- Stop once the desired length is reached.

In [None]:
def generate_text(trigram_model, seed="TH", length=10000):
    """
    Generates a string of the specified length using the trigram model.
        
    Returns:
        str: The generated text.
    """
    generated_text = seed
    for _ in range(length - len(seed)):
        bigram = generated_text[-2:]
        next_char = get_next_char(bigram, trigram_model)
        generated_text += next_char
    return generated_text

generated_text = generate_text(trigram_model, seed="TH", length=1000)
print(f"First 100 characters of generated text:\n{generated_text[:100]}")

[INFO] First 100 characters of generated text:
THIS PROJECT GUTENBERG EBOOK  EBOOK  EBOOK  END OF THIS PROJECT  END OF THIS PROJECT GUTENBERG END O


# count_valid_words
Purpose:
- Analyzes the generated text to determine the number and percentage of valid English words.

Steps:
- Split the generated text into individual words.
- Compare each word to a provided list of valid words.
- Count valid words and calculate the percentage.

In [70]:
def count_valid_words(generated_text, word_list):
    """
    Counts valid English words in the generated text.
        
    Returns:
        tuple: The count of valid words and total words.
    """
    generated_words = generated_text.split()
    valid_word_count = sum(1 for word in generated_words if word in word_list)
    return valid_word_count, len(generated_words)

# Example
generated_text = "THIS IS A RANDOMLY GENERATED STRING OF TEXT."
word_list = {"THIS", "IS", "A", "TEXT", "STRING"}

# Call the function
valid_word_count, total_word_count = count_valid_words(generated_text, word_list)

# Print results
print(f"Total words: {total_word_count}, Valid words: {valid_word_count}")


Total words: 8, Valid words: 4


# export_trigram_model
Purpose:
- Exports the trigram model as a JSON file for future use.

Steps:
- Save the dictionary to a file in JSON format.
- Include formatting for readability (indentation, sorted keys).

In [81]:
def export_trigram_model(trigram_model, output_file):
    """
    Exports the trigram model to a JSON file.
    
    Parameters:
        trigram_model (dict): The trigram model to export.
        output_file (str): The path to the JSON output file.
    """
    with open(output_file, 'w') as file:
        # Dump the trigram model into a JSON file
        json.dump(trigram_model, file, indent=4, sort_keys=True)
    print(f"Trigram model exported to {output_file}")


# main()
Purpose:
- Coordinates the entire workflow from loading and cleaning text to exporting the trigram model and analyzing generated text.

Steps:
- Load and clean all .txt files from the specified folder.
- Generate and combine trigram models from all files.
- Save the model to a JSON file.
- Generate text using the combined model.
- Analyze the generated text against a list of valid English words.

In [88]:
def main():
    # Define the path to the data folder
    data_folder = 'data'
    combined_trigram_model = {}

    # [TASK 1: Loading Text Files]
    print("[TASK 1: Loading Text Files]")
    for filename in os.listdir(data_folder):
        if filename.endswith(".txt"):
            file_path = os.path.join(data_folder, filename)
            
            # Load and clean the text
            raw_text = load_text(file_path)
            
            if not raw_text:
                print(f"[ERROR] {filename} could not be loaded.")
                continue

            cleaned_text = clean_text(raw_text)
            print(f"First 1000 characters of cleaned text from {filename}:\n{cleaned_text[:1000]}\n")

            # Generate the trigram model for the current text
            trigram_model = generate_trigrams(cleaned_text)
            print(f"Trigram model sample from {filename} (first 5 trigrams): {list(trigram_model.items())[:5]}")
            
            # Merge the current trigram model into the combined model
            for trigram, count in trigram_model.items():
                if trigram in combined_trigram_model:
                    combined_trigram_model[trigram] += count
                else:
                    combined_trigram_model[trigram] = count

    # [TASK 2: Generating Trigram Model]
    print("\n[TASK 2: Generating Trigram Model]")
    print("Sample of combined trigram model:", {k: combined_trigram_model[k] for k in list(combined_trigram_model)[:10]}, "\n")

    # Export the combined trigram model to a JSON file
    output_file = 'trigrams.json'
    export_trigram_model(combined_trigram_model, output_file)
    print(f"[INFO] Combined trigram model exported to {output_file}.")

    # [TASK 3: Generating and Analyzing Text]
    print("\n[TASK 3: Generating and Analyzing Text]")
    # Generate a 10,000-character text based on the combined trigram model
    generated_text = generate_text(combined_trigram_model)
    print(f"First 1000 characters of generated text:\n{generated_text[:1000]}\n")
    
    # Load the list of valid English words from 'words.txt'
    word_list_path = os.path.join(data_folder, 'words.txt')
    with open(word_list_path, 'r') as file:
        valid_words = set(file.read().splitlines())
    print("[INFO] English words list loaded successfully.")

    # Count valid words in the generated text
    valid_word_count, total_word_count = count_valid_words(generated_text, valid_words)
    
    # Calculate the percentage of valid words
    valid_word_percentage = (valid_word_count / total_word_count) * 100
    
    # Display the results
    print(f"Total words in generated text: {total_word_count}")
    print(f"Valid English words: {valid_word_count}")
    print(f"Percentage of valid English words: {valid_word_percentage:.2f}%\n")

if __name__ == "__main__":
    main()


[TASK 1: Loading Text Files]
[INFO] File 'data/pride-and-prejudice.txt' loaded successfully.
First 1000 characters of cleaned text from pride-and-prejudice.txt:
THE PROJECT GUTENBERG EBOOK OF PRIDE AND PREJUDICE    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES ANDMOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONSWHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMSOF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINEAT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATESYOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATEDBEFORE USING THIS EBOOK.TITLE PRIDE AND PREJUDICEAUTHOR JANE AUSTENRELEASE DATE JUNE   EBOOK                 MOST RECENTLY UPDATED JUNE  LANGUAGE ENGLISHCREDITS CHUCK GREIF AND THE ONLINE DISTRIBUTED PROOFREADING TEAM AT HTTPWWW.PGDP.NET THIS FILE WAS PRODUCED FROM IMAGES AVAILABLE AT THE INTERNET ARCHIVE START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE             

In [92]:
# Testing Section
print("\n[TESTING SECTION]\n")

# Test 1: Verify Text Cleaning
sample_text = "Hello, World! 123. This is a test text."
cleaned = clean_text(sample_text)
expected_cleaned = "HELLO WORLD. THIS IS A TEST TEXT."
if cleaned == expected_cleaned:
    print(f"Test 1 Passed: Text cleaned successfully.\nCleaned text: {cleaned}")
else:
    print(f"Test 1 Failed: Expected '{expected_cleaned}', but got '{cleaned}'")

# Test 2: Validate Trigram Model
sample_cleaned_text = "ABCABC"
trigram_counts = generate_trigrams(sample_cleaned_text)
expected_trigrams = {"ABC": 2, "BCA": 1, "CAB": 1}
if all(trigram_counts.get(tri, 0) == count for tri, count in expected_trigrams.items()):
    print("Test 2 Passed: Trigram model counts are correct.")
    print(f"Trigrams generated from 'ABCABC': {dict(trigram_counts)}")
else:
    print("Test 2 Failed: Trigram counts are incorrect.")
    print(f"Expected: {expected_trigrams}")
    print(f"Got: {dict(trigram_counts)}")

# Test 3: Check Trigram Model
short_text = "AB"
trigram_counts_short = generate_trigrams(short_text)
if len(trigram_counts_short) == 0:
    print("Test 3 Passed: No trigrams generated for text shorter than 3 characters.")
else:
    print("Test 3 Failed: Trigrams were incorrectly generated for short text.")
    print(f"Generated trigrams for 'AB': {dict(trigram_counts_short)}")

# Test 4: Check Valid Words in Generated Text
mock_generated_text = "THIS IS A TEST TEXT WITH SOME INVALID WORDS."
mock_word_list = {"THIS", "IS", "A", "TEST", "TEXT", "WITH", "SOME", "INVALID", "WORDS"}
valid_count, total_count = count_valid_words(mock_generated_text, mock_word_list)
expected_valid_count = 9
expected_total_count = 9
if valid_count == expected_valid_count and total_count == expected_total_count:
    print("Test 4 Passed: Valid word count is correct.")
    print(f"Valid words: {valid_count}/{total_count}")
else:
    print("Test 4 Failed: Valid word count is incorrect.")
    print(f"Expected: {expected_valid_count}/{expected_total_count}")
    print(f"Got: {valid_count}/{total_count}")

# Test 5: Validate Generated Text Length
mock_trigram_model = {"THA": 10, "THE": 20, "THI": 15}
generated_text = generate_text(mock_trigram_model, seed="TH", length=100)
if len(generated_text) == 100:
    print("Test 5 Passed: Generated text length is correct.")
    print(f"Generated text sample: {generated_text[:50]}...")
else:
    print(f"Test 5 Failed: Generated text length is {len(generated_text)}, expected 100.")

print("\n[TESTING COMPLETED]\n")


[TESTING SECTION]

Test 1 Failed: Expected 'HELLO WORLD. THIS IS A TEST TEXT.', but got 'HELLO WORLD . THIS IS A TEST TEXT.'
Trigram model sample (first 5 trigrams): [('ABC', 2), ('BCA', 1), ('CAB', 1)]
Test 2 Passed: Trigram model counts are correct.
Trigrams generated from 'ABCABC': {'ABC': 2, 'BCA': 1, 'CAB': 1}
Trigram model sample (first 5 trigrams): []
Test 3 Passed: No trigrams generated for text shorter than 3 characters.
Test 4 Failed: Valid word count is incorrect.
Expected: 9/9
Got: 8/9
Test 5 Passed: Generated text length is correct.
Generated text sample: THA                                               ...

[TESTING COMPLETED]

