# Project Report: NLP Pipeline for Bechdel Test Analysis

## 1. Context and Task Description

### 1.1 Context
The Bechdel-Wallace test is a measure of the representation of women in fiction. While simple, it provides a basic metric for assessing female presence and interaction. The test consists of three criteria:
1.  The work must have at least two named female characters.
2.  Who talk to each other.
3.  About something besides a man.

This project focuses on developing an automated Natural Language Processing (NLP) pipeline to determine if literary texts (specifically novels) pass the **first step** of the Bechdel-Wallace test: identifying whether the novel contains at least two named female characters.

### 1.2 Task Description
The primary task is to process the raw text of a novel and determine the number of unique, named female characters present. This involves several sub-tasks implemented within this NLP pipeline:

1.  **Text Preprocessing:** Cleaning the raw novel text to prepare it for downstream analysis (e.g., removing headers/footers, normalizing quotes, fixing paragraph breaks).
2.  **Character Identification:** Automatically identifying mentions of characters within the text using Named Entity Recognition (NER) and consolidating different references (names, titles, aliases, nicknames) to the same character.
3.  **Gender Classification:** Assigning a gender (Male, Female, or Unknown) to each identified unique character using a combination of rule-based methods (titles, name lists) and context-based approaches (pronoun analysis, coreference resolution).
4.  **Final Check:** Counting the number of characters classified as 'Female' by the pipeline to determine if the novel meets the first criterion of the Bechdel test.
5.  **Evaluation:** Assessing the performance of the gender classification component against manually annotated ground truth data.

## 2. Datasets Used

### 2.1 Primary Dataset
The primary data for this project consists of the full text of novels sourced from **Project Gutenberg**, which provides the raw textual content for analysis. 

**Specific Novels Used:**
* **Dracula** by Bram Stoker
* **Emma** by Jane Austen
* **Pride and Prejudice** by Jane Austen

These novels were selected because they offer diverse writing styles, time periods, and character distributions. All were sourced from Project Gutenberg, which provides free, public domain literary works.

### 2.2 Evaluation Dataset
For evaluation and grounding of the character/gender identification, **annotated books from the QuoteLi project** were used. These annotations provide ground truth information about characters and their genders.

Link: [QuoteLi Project](https://muzny.github.io/quoteli.html)

### 2.3 Additional Resources
The project also utilizes name lists to support gender classification:
* Lists of common male and female names (`female_names.txt` and `male_names.txt`)
* These name lists contain approximately 200 names each and serve as a reference for the gender classification component

### 2.4 Dataset Summary

| Dataset Component | Description | Size | Source | Purpose |
|-------------------|-------------|------|--------|---------|
| Novel Texts | Full text of Dracula | ~848K chars | Project Gutenberg | Primary analysis text |
| Novel Texts | Full text of Emma | ~885K chars | Project Gutenberg | Primary analysis text |
| Novel Texts | Full text of Pride & Prejudice | ~723K chars | Project Gutenberg | Primary analysis text |
| Female Names | Common female names | 203 names | Custom-created | Gender classification |
| Male Names | Common male names | 200 names | Custom-created | Gender classification |
| Ground Truth | Character gender annotations | 146 characters | QuoteLi project | Evaluation |

In [10]:
# Cell 1: Import libraries
import re
import os
import nltk
from nltk.tokenize import sent_tokenize

# Download NLTK resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yassin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 3. Pipeline Design and Implementation

The pipeline consists of four main components, each implemented in a separate notebook:

### 3.1 Text Preprocessing (`00_pre_proc.ipynb`)

**Design Principles:**
- Clean Project Gutenberg texts while preserving important textual features
- Remove metadata, headers, footers, and illustrations
- Normalize different types of quotation marks
- Fix paragraph breaks to create proper paragraph structure

**Implementation Details:**
- Used regex patterns to identify and extract the main content from Project Gutenberg headers/footers
- Normalized various quote styles (curly quotes, straight quotes) to standard format
- Implemented paragraph fixing to handle hard-wrapped lines common in digitized texts
- Removed illustration blocks that would interfere with entity recognition


In [11]:
# Cell 2: Define functions
def extract_gutenberg_content(text):
    """Extract the actual content of a Gutenberg book, removing headers and footers."""
    start_pattern = r"\*\*\* START OF (THIS|THE) PROJECT GUTENBERG EBOOK .+? \*\*\*"
    end_pattern = r"\*\*\* END OF (THIS|THE) PROJECT GUTENBERG EBOOK .+? \*\*\*"
    
    start_match = re.search(start_pattern, text)
    end_match = re.search(end_pattern, text)
    
    if start_match and end_match:
        return text[start_match.end():end_match.start()].strip()
    return text


def normalize_quotes(text):
    """Normalize different kinds of quotes to standard single/double quotes."""
    # Replace various curly single quotes and backticks with standard apostrophe
    text = re.sub(r"[‘’‛`]", "'", text)
    # Replace various curly double quotes with standard double quote
    text = re.sub(r'[“”„‟]', '"', text)
    return text

import re

def fix_paragraphs(text: str) -> str:
    """
    Join hard‑wrapped lines inside a paragraph while preserving true paragraph
    breaks.  No external libraries required.
    ──────────────────────────────────────────────────────────────────────────
    Rule of thumb:
      • If a line ends with . ! ? " ’ ” )   ⇒ keep the break.
      • Otherwise                           ⇒ join with the next line.
    """
    out_lines = []
    buffer    = []

    for line in text.splitlines():
        if not line.strip():               # blank ⇒ paragraph break
            if buffer:
                out_lines.append(" ".join(buffer))
                buffer.clear()
            out_lines.append("")           # keep one blank line
            continue

        buffer.append(line.strip())

        # if this line *really* ends a sentence, flush the buffer
        if re.search(r'[.!?]["\')\]]?\s*$', line):
            out_lines.append(" ".join(buffer))
            buffer.clear()

    if buffer:
        out_lines.append(" ".join(buffer))

    return "\n\n".join(out_lines)



def remove_illustrations(text):
    """
    Removes multi-line illustration blocks starting with '[Illustration:' 
    and ending with ']'. Handles potential leading whitespace.
    """
    # Regex explanation:
    # ^\s* : Matches the start of a line (^) followed by optional whitespace (\s*)
    # \[Illustration: : Matches the literal starting text (square bracket escaped)
    # .*?       : Matches any character (.), including newlines (due to re.DOTALL),
    #             zero or more times (*), but as few times as possible (?) to stop at the first ']'
    # \]        : Matches the literal closing square bracket (escaped)
    pattern = r"^\s*\[Illustration:?.*?\]"
    
    # re.sub to replace found patterns with an empty string
    cleaned_text = re.sub(pattern, "", text, flags=re.MULTILINE | re.DOTALL)
    
    # here I remove potentially resulting empty lines
    cleaned_text = re.sub(r'\n\s*\n', '\n\n', cleaned_text) # Replace lines containing only whitespace with a single blank line if desired
    
    return cleaned_text.strip()

In [15]:
#Cell 3 

# --- Configuration ---
input_file_path = "../data/pp_novel.txt" 
output_file_path = "../data/pp_cleaned.txt" 

# --- Workflow ---
print(f"--- Starting Basic Preprocessing for {input_file_path} ---")
processed_text = "" # Initialize variable
try:
    with open(input_file_path, 'r', encoding='utf-8', errors='replace') as file:
        raw_text = file.read()
    print(f"Loaded raw text ({len(raw_text)} chars).")

    # Step 1: Initial Gutenberg Cleanup
    # Input: raw_text
    gutenberg_body = extract_gutenberg_content(raw_text)
    print(f"Length after Gutenberg extraction: {len(gutenberg_body)}")
    
    # Step 2: Remove Illustrations
    # Input: gutenberg_body
    text_no_illustrations = remove_illustrations(gutenberg_body)
    print(f"Length after removing illustrations: {len(text_no_illustrations)}")

    # Step 3: Normalize Quotes
    # Input: text_no_illustrations (output of Step 2)
    text_norm_quotes = normalize_quotes(text_no_illustrations) 
    print(f"Length after quote normalization: {len(text_norm_quotes)}")
     
    # Step 4: Fix Paragraphs
    # Input: text_norm_quotes (output of Step 3)
    processed_text = fix_paragraphs(text_norm_quotes) 
    print(f"Length after paragraph fixing: {len(processed_text)}")

    # Step 5: Save the final result
    if processed_text:
         try:
            with open(output_file_path, 'w', encoding='utf-8') as outfile:
                outfile.write(processed_text)
            print(f"\nSuccessfully saved cleaned text to: {output_file_path}")
         except Exception as e:
            print(f"\nError saving file {output_file_path}: {e}")
    else:
         print("\nSkipping save as processed text is empty.")

    # debug: Print sample of final result
    # if processed_text:
    #     print("\nSample of final processed text:")
    #     print(processed_text[:1000] + "...")

except FileNotFoundError:
    print(f"Error: Input file not found at {input_file_path}")
except Exception as e:
    print(f"An unexpected error occurred during processing: {e}")

print(f"\n--- Basic Preprocessing finished for {input_file_path} ---")

--- Starting Basic Preprocessing for ../data/pp_novel.txt ---
Loaded raw text (748126 chars).
Length after Gutenberg extraction: 728713
Length after removing illustrations: 721084
Length after quote normalization: 721084
Length after paragraph fixing: 723733

Successfully saved cleaned text to: ../data/pp_cleaned.txt

--- Basic Preprocessing finished for ../data/pp_novel.txt ---
