Task 1: Building a trigram model from Project Gutenberg books

A trigram model is a language model that looks at sequences of three characters (called trigrams) in a text. It helps to capture character-level dependencies in text, and it is commonly used in text generation and natural language processing (NLP).

Explanation:
You start by selecting five books from Project Gutenberg in plain text format.
The books are cleaned by removing any unnecessary characters, preambles, and postambles.
A trigram model is created by counting how often each sequence of three characters (trigram) appears in the cleaned text.


Task 1 Breakdown:
1. The solution begins by defining the function read_text, which reads the contents of a text file into a string.

In [50]:
# Task 1
import re
from collections import defaultdict

# Paths to the five text files
# These represent the text files of the books to be processed
file_paths = ['Book1.txt', 'Book2.txt', 'Book3.txt',  'Book4.txt',  'Book5.txt'   
]

# Function to read text from a file
# This function opens the file, reads the entire content, and returns it as a string
def read_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

2. Another function, clean_text, is used to clean up the text by removing the preamble and postamble, and filtering only letters, full stops, and spaces.
3. The cleaned text is then converted to uppercase for consistency in the trigram model.

In [51]:
# Function for cleaning text up
def clean_text(text):
    # Remove preamble and postamble
    start = re.search(r'\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*', text)
    end = re.search(r'\*\*\* END OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*', text)
    if start and end:
        text = text[start.end():end.start()]

    # Remove all non-ASCII letters except full stops and spaces
    text = re.sub(r'[^A-Za-z. ]', '', text)

    # Convert all letters to uppercase for consistency
    text = text.upper()

    return text

4. A trigram model is built using the create_trigram_model function, which loops through the text and extracts three-character sequences (trigrams).
5. Each trigram is stored in a dictionary with its count of occurrences in the text.

In [52]:
# Function to create a trigram model
def create_trigram_model(text):
    # Use a defaultdict to store trigram counts. Default value is 0 for any trigram not yet encountered
    trigram_model = defaultdict(int)
    
    # Slide through the text to create trigrams and count their occurrences
    for i in range(len(text) - 2):
        trigram = text[i:i+3]  # Extract a sequence of 3 characters 
        trigram_model[trigram] += 1  # Increment the count for this trigram

    return trigram_model

6. The paths to five books in plain text format are provided, and the text from each is read and cleaned.
7. The texts from all five books are combined into one large string for processing.
8. The trigram model is created based on the combined text of all books.
9. The result is a dictionary containing trigrams and their respective frequencies.

In [53]:
# Read and clean texts from all files
texts = [clean_text(read_text(file_path)) for file_path in file_paths]

# Combine all cleaned texts into one
# Joins the texts from all the books into a single large block of text.
combined_text = ' '.join(texts)

# Create the trigram model using the combined text
# This generates the trigram model, counting how often each trigram appears.
trigram_model = create_trigram_model(combined_text)

10. The code finally prints a sample of 10 trigrams and their counts from the generated model.

In [None]:
# Output some of the trigram model
# Here, we print the first 10 trigrams and their counts from the model to see the results.
for trigram, count in list(trigram_model.items())[:10]:
    print(f'{trigram}: {count}')

Task 2 Breakdown:
1. Task 2 starts by defining the function get_next_char, which takes the last two characters of a string and finds all matching trigrams from the trigram model.
2. This function then selects the next character based on the frequency of the matching trigrams, using weighted random selection.

In [55]:
# Task 2
import random  # Import the random module for random selection of characters

# Function to get the next character based on the last two characters of the current text
def get_next_char(trigram_model, last_two_chars):
    # Find all trigrams that start with the last two characters
    # This dictionary comprehension loops through the trigram model and selects trigrams that start with the 'last_two_chars'
    matching_trigrams = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two_chars)}
    
    # If there are no trigrams that start with the last two characters, return None
    if not matching_trigrams:
        return None
    
    # Get the third characters of the matching trigrams
    # For example, if the trigrams are "THE", "THA", and "THI", this step extracts ['E', 'A', 'I']
    third_chars = [trigram[2] for trigram in matching_trigrams.keys()]
    
    # Get the counts of how many times each trigram appears in the text
    # These counts will be used as weights for selecting the next character
    counts = list(matching_trigrams.values())
    
    # Randomly select the next character using the counts as weights
    # The 'weights' parameter ensures that more frequent trigrams have a higher chance of being chosen
    next_char = random.choices(third_chars, weights=counts, k=1)[0]
    
    # Return the selected character, which will be added to the generated string
    return next_char

3. Another function, generate_text, is defined to generate a string of a specified length (10,000 characters in this case).
4. The generate_text function begins with the seed string "TH".
5. It iterates through the process, using the last two characters of the current string to predict and append the next character based on the trigram model.
6. The loop continues until the generated text reaches the desired length (10,000 characters).
7. If no matching trigram is found at any point, the generation process stops.
8. The function returns the generated string after reaching the desired length.

In [56]:
# Function to generate a string of text based on the trigram model
def generate_text(trigram_model, length=10000):
    # Start the generated text with the initial seed string "TH"
    generated_text = "TH"
    
    # Keep generating characters until the length of the text reaches the specified length (default is 10,000 characters)
    while len(generated_text) < length:
        # Get the last two characters of the currently generated text
        last_two_chars = generated_text[-2:]
        
        # Use the trigram model to get the next character based on the last two characters
        next_char = get_next_char(trigram_model, last_two_chars)
        
        # If no next character is found (i.e., no matching trigrams), stop generating the text
        if next_char is None:
            break  # Stop the generation process if no matching trigrams are found
        
        # Append the next character to the generated text
        generated_text += next_char
    
    # Return the fully generated text once the loop finishes or when no more trigrams are found
    return generated_text


9. The final output is a string that mimics the structure and patterns of the original text used to build the trigram model.
10. The text is printed or can be saved for further analysis in the next tasks.

In [None]:
# 'trigram_model' is the dictionary that contains trigrams as keys and their counts as values.
generated_text = generate_text(trigram_model, length=10000)

# This prints the entire generated string of text that was created by the 'generate_text' function.
# The text is 10,000 characters long and is generated based on patterns found in the trigram model.
print(generated_text)

Task 3 Breakdown:
1. Task 3 begins by defining the function load_english_words to load valid English words from a file (words.txt) and store them in a set for quick lookups.

In [58]:
# Task 3

import re

# Read the list of valid English words from words.txt
def load_english_words(file_path):
    with open(file_path, 'r') as f:
        # Store the words in a set for quick lookup
        valid_words = set(word.strip().lower() for word in f.readlines())
    return valid_words

2. Another function, extract_words, is defined to extract all words from the generated 10,000-character string using a regular expression, while removing non-alphabetic characters.

In [59]:
# Extract words from the generated text
def extract_words(text):
    # Use regular expression to extract words, removing non-alphabetic characters
    words = re.findall(r'\b[A-Za-z]+\b', text)
    # Convert to lowercase for case-insensitive matching
    return [word.lower() for word in words]

3. The calculate_word_percentage function is then defined to compute the percentage of valid English words in the generated text.
4. This function compares each word from the generated text to the set of valid English words and counts how many match.
5. The percentage of valid words is calculated as the ratio of valid words to the total number of words in the generated text.

In [60]:
# Calculate the percentage of valid words in the generated text
def calculate_word_percentage(generated_text, valid_words):
    words_in_text = extract_words(generated_text)  # Get the words from the generated text
    total_words = len(words_in_text)  # Total number of words
    valid_word_count = sum(1 for word in words_in_text if word in valid_words)  # Count how many words are valid
    if total_words == 0:
        return 0  # Avoid division by zero
    # Calculate percentage of valid words
    return (valid_word_count / total_words) * 100

6. The words.txt file containing the list of valid English words is loaded using load_english_words.
7. The generated text from Task 2 is passed through extract_words to get the list of words from the text.
8. The valid word count is calculated by comparing the words in the text with the words in words.txt.
9. The percentage of valid English words is then calculated and printed.
10. The result indicates how much of the generated text consists of valid English words.

In [None]:
# Step 4: Use the functions to load the words and calculate the percentage
valid_words = load_english_words('words.txt')  # Load the list of valid English words
generated_text = generate_text(trigram_model, length=10000)  # Generate the 10,000 character string

# Calculate the percentage of valid English words
percentage_valid_words = calculate_word_percentage(generated_text, valid_words)

# Print out percentage 
print(f"Percentage of valid English words: {percentage_valid_words:.2f}%")

Task 4 Breakdown:
1. Task 4 uses the json module to save the trigram model as a JSON file.
2. The function export_trigram_model_to_json converts the trigram model into JSON format.

In [62]:
# Task 4
import json

# Export the trigram model as JSON
def export_trigram_model_to_json(trigram_model, file_path):
    with open(file_path, 'w') as json_file:
        # Convert the trigram model (a dictionary) to JSON and save it
        json.dump(trigram_model, json_file, indent=4)

3. It writes this JSON data to a file named trigrams.json in the repository.
4. The result is a JSON file that contains the trigrams and their counts.
5. A message confirms that the export was successful.

In [None]:
# Call the function to export the model
export_trigram_model_to_json(trigram_model, 'trigrams.json')

# Output message to confirm that the file was saved
print("Trigram model has been exported to trigrams.json")