# Preprocessing

## Overview

Before applying any NLP models, the raw text data must be cleaned and standardized.  
This preprocessing step removes non-linguistic artifacts (such as illustrations),
normalizes formatting, and prepares the text so that downstream models analyze
meaningful language patterns rather than noise.

In [1]:
import re 

### Removing Illustrations and Non-Text Elements

Luckily Project Gutenberg texts contain placeholders such as [Illustration], as well as headers and footers about copyright and other placeholders. Luckily they make these very easy to pick out.

These artifacts can distort word frequency counts and topic modeling results.

In [2]:
def remove_illistrations(text: str) -> str:
    """
    Remove Illustrations from the text using regex
    """

    # Remove Illustrations
    cleaned = re.sub(
        r"\[Illustration(?::.*?\]{1,2}|\])", # Searches for [Illustration] or [Illustration: description] or [[some text]]
        "", # Replaces with blank string
        text,
        flags=re.DOTALL # catches this pattern if it spans multiple lines
    )

    # Remove left over line
    cleaned = re.sub(r"\n\s*\n", "\n", cleaned)
    return cleaned

In [3]:
def remove_gutenberg_header_footer(text: str) -> str:
    """
    Remove the Project Gutenberg header and footer, and trim the text.
    Only keeps text starting from the last occurrence of 'CHAPTER I' followed by a newline.
    """

    start_marker = "CHAPTER I\n"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    # Use the last occurrence of the chapter marker followed by a newline
    start_idx = text.find(start_marker)
    if start_idx != -1:
        # Keep text starting at the final chapter marker,
        # then remove its heading line
        text = text[start_idx + len(start_marker):]

    # Locate and remove the footer
    end_idx = text.find(end_marker)
    if end_idx != -1:
        text = text[:end_idx]

    return text.strip()

### Corpus Construction

After cleaning individual texts, all books are split into chapters and combined into a single corpus.

The corpus returned here is a list of cleaned text documents.

In [4]:
def split_by_chapter(text: str) -> list[str]:
    """
    Split a book into chapters, remove chapter titles,
    and skip very short sections (like table of contents).
    """

    chapters = []
    chapter_marker = "CHAPTER"

    # Split the text at each occurrence of the chapter marker
    parts = text.split(chapter_marker)

    for part in parts:
        # Remove leading/trailing whitespace
        chapter_text = part.strip()

        # Skip very short sections
        if len(chapter_text.split()) < 20:
            continue

        # Convert newlines to spaces
        chapter_text = chapter_text.replace("\n", " ")

        # remove _word_ which is used for formatting by project gutenberg
        chapter_text = chapter_text.replace("_", " ")

        chapters.append(chapter_text)

    return chapters

In [5]:
def create_corpus(books: list[str]) -> list[str]:
    Corpus = None
    
    for book in books:
        book_no_images = remove_illistrations(book)
        book_corpus = remove_gutenberg_header_footer(book_no_images)
        book_chapters = split_by_chapter(book_corpus)

        if Corpus is None: 
            Corpus = book_chapters
        else:
            Corpus += book_chapters

    return Corpus

In [6]:
"""
Reads the contents of the three book files and appends them to a list
"""
def load_books(files: list) -> list[str]:
    books = []
    for file in files:
        with open(f'Books/{file}', 'r') as contents:
            books.append(contents.read())

    return books

In [7]:
#Take the three books, clean them and split them into chapters
books = load_books(["Emma.txt","Pride_and_Prejudice_Jane_Austin.txt","Sense_and_Sensibility.txt"])

Corpus = create_corpus(books)

### Final Preprocessed Corpus

At this stage, all texts have been cleaned and combined into a single corpus.
This corpus serves as the input for topic modeling and other NLP techniques
used later in the analysis.


# LDA

Latent Dirichlet Allocation is an unsupervised topic modeling technique that
identifies latent themes in a collection of documents. Rather than understanding
language semantically, LDA relies on word co-occurrence patterns to infer topics
as probability distributions over words.

This model is used here as a baseline NLP approach to compare against more
advanced language models.


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

### Text Vectorization

LDA operates on numerical representations of text rather than raw strings.
The CountVectorizer converts the corpus into a document-term matrix, where
each entry represents the frequency of a word in a document.

This bag-of-words approach ignores word order and context, which is a key
limitation explored later in the analysis.

In [10]:
vectorizer = CountVectorizer(
    stop_words="english",
    max_df=0.95,
    min_df=2,
)

X = vectorizer.fit_transform(Corpus)

### Training the LDA Model

The LDA model is trained on the document-term matrix to identify a fixed number
of topics. Each topic is represented as a probability distribution over words,
and each document is represented as a mixture of topics.

The number of topics is chosen arbitrarily.


In [11]:
lda = LatentDirichletAllocation(
    n_components=10,         # experiment with 5–30
    learning_method="batch", # stable & reproducible
    random_state=42,
)
lda.fit(X)

0,1,2
,"n_components  n_components: int, default=10 Number of topics. .. versionchanged:: 0.19  ``n_topics`` was renamed to ``n_components``",10
,"doc_topic_prior  doc_topic_prior: float, default=None Prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `alpha`.",
,"topic_word_prior  topic_word_prior: float, default=None Prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `eta`.",
,"learning_method  learning_method: {'batch', 'online'}, default='batch' Method used to update `_component`. Only used in :meth:`fit` method. In general, if the data size is large, the online update will be much faster than the batch update. Valid options: - 'batch': Batch variational Bayes method. Use all training data in each EM  update. Old `components_` will be overwritten in each iteration. - 'online': Online variational Bayes method. In each EM update, use mini-batch  of training data to update the ``components_`` variable incrementally. The  learning rate is controlled by the ``learning_decay`` and the  ``learning_offset`` parameters. .. versionchanged:: 0.20  The default learning method is now ``""batch""``.",'batch'
,"learning_decay  learning_decay: float, default=0.7 It is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is ``n_samples``, the update method is same as batch learning. In the literature, this is called kappa.",0.7
,"learning_offset  learning_offset: float, default=10.0 A (positive) parameter that downweights early iterations in online learning. It should be greater than 1.0. In the literature, this is called tau_0.",10.0
,"max_iter  max_iter: int, default=10 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the :meth:`fit` method, and not the :meth:`partial_fit` method.",10
,"batch_size  batch_size: int, default=128 Number of documents to use in each EM iteration. Only used in online learning.",128
,"evaluate_every  evaluate_every: int, default=-1 How often to evaluate perplexity. Only used in `fit` method. set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.",-1
,"total_samples  total_samples: int, default=1e6 Total number of documents. Only used in the :meth:`partial_fit` method.",1000000.0


In [12]:
def print_topics(model, vectorizer, n_words=10):
    words = vectorizer.get_feature_names_out()
    for idx, topic in enumerate(model.components_):
        print(f"Topic #{idx}")
        print("  " + " ".join(words[i] for i in topic.argsort()[-n_words:]))

In [13]:
print_topics(lda, vectorizer)

Topic #0
  woman heart affection marianne ferrars accomplished brother robert miss edward
Topic #1
  miss wickham did jane bingley said bennet darcy elizabeth mr
Topic #2
  heart day mr think mother did said sister elinor marianne
Topic #3
  think good weston said elton knightley miss harriet emma mr
Topic #4
  bennet lady bingley colonel room mr miss elizabeth darcy said
Topic #5
  father said miss thing john did mr good think dear
Topic #6
  did woodhouse thing think know weston said miss emma mr
Topic #7
  did sister know miss jennings lucy edward said marianne elinor
Topic #8
  campbell cole dixon emma bates thing mr fairfax miss jane
Topic #9
  did shall know dear said sister dashwood willoughby elinor marianne


In [14]:
chapter_topic_distrib = lda.transform(X)

for i, distrib in enumerate(chapter_topic_distrib):
    print(f"Chapter {i}: dominant topic = {distrib.argmax()}")

Chapter 0: dominant topic = 3
Chapter 1: dominant topic = 3
Chapter 2: dominant topic = 8
Chapter 3: dominant topic = 3
Chapter 4: dominant topic = 6
Chapter 5: dominant topic = 3
Chapter 6: dominant topic = 3
Chapter 7: dominant topic = 3
Chapter 8: dominant topic = 5
Chapter 9: dominant topic = 5
Chapter 10: dominant topic = 3
Chapter 11: dominant topic = 5
Chapter 12: dominant topic = 3
Chapter 13: dominant topic = 3
Chapter 14: dominant topic = 3
Chapter 15: dominant topic = 3
Chapter 16: dominant topic = 3
Chapter 17: dominant topic = 3
Chapter 18: dominant topic = 8
Chapter 19: dominant topic = 8
Chapter 20: dominant topic = 6
Chapter 21: dominant topic = 3
Chapter 22: dominant topic = 6
Chapter 23: dominant topic = 3
Chapter 24: dominant topic = 3
Chapter 25: dominant topic = 3
Chapter 26: dominant topic = 6
Chapter 27: dominant topic = 6
Chapter 28: dominant topic = 6
Chapter 29: dominant topic = 3
Chapter 30: dominant topic = 3
Chapter 31: dominant topic = 3
Chapter 32: domina

## Discussion

While LDA is effective at identifying word co-occurrence patterns, the resulting
topics often lack semantic coherence. Many topics appear as loosely related
collections of words rather than meaningful summaries.

This limitation motivates comparison with large language models, which capture
context, syntax, and semantic relationships beyond simple frequency statistics.


# TextRank

<https://medium.com/@yassineerraji/understanding-textrank-a-deep-dive-into-graph-based-text-summarization-and-keyword-extraction-905d1fb5d266>

TextRank gives a way to summarize documents without any supervision. It helps show how much structure you can get out of text using only relationships between sentences.

How it works conceptually:

1. Each sentence = a node

2. Sentences are connected if they share words

3. Sentences that connect to many others score higher

4. Top-scoring sentences become the summary

In [17]:
import spacy
import pytextrank

In [19]:
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")

<pytextrank.base.BaseTextRankFactory at 0x1629701a0>

In [20]:
chapter_num = 0

# Print size of the original document
print('Original Document Size:', len(Corpus[chapter_num]), '\n')

# Process the chapter with the spacy pipeline
doc = nlp(Corpus[chapter_num])

final_summary = None
phrase_count = 1

# Generate summary using TextRank
for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=5):
    print(f"Phrase: #{phrase_count}\n")
    print(sent, '\n')
    print('------------------------------------------------------------------', '\n')
    
    # Build the final summary string
    if final_summary is None:
        final_summary = str(sent)
    else:
        final_summary += " " + str(sent)

    phrase_count += 1
    
# Print total summary length
print('Total Summary Length:', len(final_summary), '\n')


Original Document Size: 17790 

Phrase: #1

I am sure she will be an excellent servant; and it will be a great comfort to poor Miss Taylor to have somebody about her that she is used to see. 

------------------------------------------------------------------ 

Phrase: #2

“Poor Mr. and Miss Woodhouse, if you please; but I cannot possibly say ‘poor Miss Taylor.’ 

------------------------------------------------------------------ 

Phrase: #3

“But, Mr. Knightley, she is really very sorry to lose poor Miss Taylor, and I am sure she  will  miss her more than she thinks for.” 

------------------------------------------------------------------ 

Phrase: #4

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a governess than a friend, very fond of both daughters, but particularly of Emma. 

------------------------------------------------------------------ 

Phrase: #5

Even before Miss Taylor had ceased to hold the nominal office of governess, the mildness of her tempe

## Discussion

The summaries produced by TextRank were generally readable and captured the main ideas of the documents. However, because it is extractive, the summaries sometimes feel a bit choppy or repetitive. It also doesn’t understand meaning beyond surface‑level similarity, so it can miss context or over‑emphasize repeated phrases.

# Basic Sentence Frequency Summarization

<https://stackabuse.com/text-summarization-with-nltk-in-python/>

After TextRank, I implemented a simpler baseline method based on sentence frequency. This approach scores sentences by looking at how often important words appear in them.

The process is roughly:

- Count word frequencies across the document

- Score each sentence based on the frequencies of its words

- Select the highest‑scoring sentences for the summary

- This method is much simpler than TextRank and doesn’t rely on graphs or similarity matrices.

In [21]:
from collections import Counter

In [22]:
word_frequencies = Counter(Corpus)

# Normalize frequencies
max_freq = max(word_frequencies.values())
for word in word_frequencies:
    word_frequencies[word] /= max_freq

In [23]:
Corpus1 = Corpus[40]
sentences = Corpus1.split('.')  # naive sentence split

sentence_scores = {}
for sent in sentences:
    sentence_words = re.findall(r'\b\w+\b', sent.lower())
    score = sum(word_frequencies.get(word, 0) for word in sentence_words)
    if len(sentence_words) < 30:  # optional length filter
        sentence_scores[sent] = score

In [24]:
top_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:5]

summary = '. '.join(top_sentences).strip() + '.'
print(summary)


V In this state of schemes, and hopes, and connivance, June opened upon Hartfield.  To Highbury in general it brought no material change.  Elton’s activity in her service, and save herself from being hurried into a delightful situation against her will.  Mr.  Knightley, who, for some reason best known to himself, had certainly taken an early dislike to Frank Churchill, was only growing to dislike him more.


## Discussion

This method very much surprised me. It tends to favor longer sentences and doesn’t account for redundancy very well. However, the summaries that it produces are very readable and their meaning seems very understandable compared to the choppiness of TextRank. 

## Reflection on Extractive Summarization

This is kind of obvious but the extractive summaries just cannot form coherent purpose or meaning since they only take bits out of the input. However, they are surprisingly good at giving meaningful topics and providing the most impactful parts of the text. In many cases, the summaries still give you a solid idea of what the document is about, even if they read a little awkwardly.

The main limitation is that extractive methods don’t actually understand the text. They don’t know which sentences depend on each other, and they can’t rewrite or connect ideas in a natural way. This can lead to summaries that repeat information or jump between ideas without smooth transitions.

That said, extractive summarization is still useful as a baseline. It’s fast, easy to interpret, and doesn’t require training data. For tasks like topic discovery, quick overviews, or highlighting important sections, these methods work better than expected. This also makes them a good comparison point for more advanced models that try to generate summaries rather than just select sentences.


# Transformer

<https://thepythoncode.com/article/text-summarization-using-huggingface-transformers-python>

Finally, I explored a transformer‑based summarization approach. Unlike TextRank and frequency methods, transformers use pretrained language models that understand context, grammar, and meaning at a much deeper level.

Instead of ranking sentences directly, the transformer model generates summaries based on patterns it learned from large datasets. This allows it to:

- Paraphrase instead of just copying sentences

- Produce smoother, more natural summaries

- Capture high‑level meaning better

In [26]:
import torch
from transformers import pipeline, AutoTokenizer
from tqdm import tqdm       # progress tracker

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
"""
Sentence segmentation was performed using a punctuation-based heuristic to avoid external model dependencies.
NLTK is the most annoying thing I have ever tried to use.
"""

def split_into_sentences(text: str) -> list[str]:
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s for s in sentences if s]

In [28]:
def chunk_chapter_by_tokens(chapter: str, model_name="facebook/bart-large-cnn", max_tokens=500) -> list[str]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)  # tokenizer for counting tokens
    sentences = split_into_sentences(chapter)  # split text into sentences
    chunks = []
    chunk, chunk_len = [], 0  # current chunk and its token count

    for sentence in sentences:
        tokens = len(tokenizer.encode(sentence, add_special_tokens=False))
        if chunk_len + tokens > max_tokens:  # if adding sentence exceeds max_tokens
            chunks.append(" ".join(chunk))  # finish current chunk
            chunk, chunk_len = [], 0
        chunk.append(sentence)
        chunk_len += tokens

    if chunk:
        chunks.append(" ".join(chunk))  # add last chunk

    return chunks

In [29]:
def get_summarization(summarizer, chapter: str) -> str:
    chunks = chunk_chapter_by_tokens(chapter)  # split chapter into token-safe chunks
    mini_summaries = []

    for chunk in tqdm(chunks, desc="Summarizing chunks"):
        max_len = max(30, int(len(chunk.split()) * 0.6))
        summary = summarizer(
            chunk,
            min_length=30,
            max_length=max_len,
            do_sample=False  # deterministic output
        )[0]["summary_text"]
        mini_summaries.append(summary)

    combined_summary = " ".join(mini_summaries)
    max_len_final = max(30, int(len(combined_summary.split()) * 0.6))
    
    print("Generating final summary...")
    final_summary = summarizer(
        combined_summary,
        min_length=30,
        max_length=max_len_final,
        do_sample=False
    )[0]["summary_text"]

    return final_summary


In [30]:
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn"  # explicitly calls model
)

Device set to use mps:0


In [31]:
print(get_summarization(summarizer, Corpus[0]))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Summarizing chunks: 100%|█████████████████████████████████████████████████████████████████████████████| 9/9 [00:41<00:00,  4.61s/it]


Generating final summary...
Emma Woodhouse was the youngest of the two daughters of a most affectionate, indulgent father. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses. Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a governess than a friend.


# LLMs

In [32]:
"""
Using env variables for API Keys
"""

import os
from dotenv import load_dotenv

# Load variables from .env
load_dotenv()

True

## ChatGPT
<https://platform.openai.com/docs/quickstart>

In [28]:
from openai import OpenAI

In [29]:
# Set your OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY_HERE")

In [10]:
def chatgpt_summarizer(text: str, max_tokens: int = 200) -> str:
    """
    Summarizes a given text using ChatGPT.
    
    Args:
        text (str): The text to summarize.
        max_tokens (int): Maximum tokens for the summary.
        
    Returns:
        str: The summary text.
    """
    
    client = OpenAI(
        # This is the default and can be omitted
        api_key=openai_api_key
    )
    
    response = client.responses.create(
        model="gpt-4o",
        instructions="You are a helpful assistant that summarizes text concisely.",
        input=f"Summarize this: {text}",
    )
    
    return response.output_text

In [36]:
chapter_text = Corpus[0]

summary = chatgpt_summarizer(chapter_text)
print(summary)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

## Gemini

<https://ai.google.dev/gemini-api/docs/text-generation>

In [24]:
from google import genai
from google.genai import types

In [21]:
gemini_api_key = os.getenv("GEMINI_API_KEY", "YOUR_GEMINI_API_KEY_HERE")

In [25]:
def gemini_summarizer(text: str, max_tokens: int = 200) -> str:
    """
    Summarizes a given text using ChatGPT.
    
    Args:
        text (str): The text to summarize.
        max_tokens (int): Maximum tokens for the summary.
        
    Returns:
        str: The summary text.
    """
    
    # The client gets the API key from the environment variable `GEMINI_API_KEY`.
    client = genai.Client(api_key = gemini_api_key)
    
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        config=types.GenerateContentConfig(
            system_instruction="You are a helpful assistant that summarizes text concisely."),
        contents=f"Summarize this: {text}"
    )
    print(response.text)


In [26]:
chapter_text = Corpus[0]

summary = gemini_summarizer(chapter_text)
print(summary)

Emma Woodhouse, a wealthy, clever, and privileged young woman, experiences her first real sorrow when her beloved governess and close companion, Miss Taylor, marries Mr. Weston. This leaves Emma feeling intellectually isolated at Hartfield, as her nervous, change-averse father, Mr. Woodhouse, is a poor conversationalist.

None


# Summary

## In this project I messed around with different ways to summarize text. Me tried a few approaches:

- Hugging Face Transformers – like BART/T5. Worked okay, but you have to be careful with token limits and chunking or the models freak out on long chapters.

- ChatGPT / Google GenAI – way easier for summaries that actually sound like a human wrote them. Less hassle with tokens, but you depend on their APIs.

- LDA and other extractive methods – basically tried to pull out the “important sentences” instead of rewriting everything. It worked, but the summaries meren’t as smooth or natural as the LLMs.

## What me noticed

- Hugging Face is solid if you want control, but handling long texts is annoying.

- ChatGPT/GenAI makes everything read nicely and human-like.

- LDA/extractive methods are fast and don’t need an API, but can feel robotic or miss context.

- Chunking is super important for LLMs, otherwise they crash or cut stuff off.

- Progress bars are a lifesaver for big texts.

## Overall

The LLMs unsurprisingly did the job best for natural-sounding summaries. Hugging Face models were decent, and the extractive stuff like was okay if you just wanted the “important bits” quickly. Overall, all methods gave a decent sense of what the chapters were about, but the LLMs made it way easier to actually read.