## 3.2 Character Identification (01_character_identification.ipynb)

**Objective**: The primary goal of this stage was to process the cleaned plain text from the novels, identify all mentions of characters, filter out noise and non-character entities, and consolidate different textual references (e.g., "Jonathan Harker", "Jonathan") into unique character representations.

**Design & Implementation Strategy**: Since the pipeline starts from plain text without leveraging pre-existing mention annotations (like those in the Muzny et al. XML), a Named Entity Recognition (NER) model was employed as the core component for identifying potential character mentions. This was followed by filtering and consolidation steps.

### Key Stages & Implementation Details:

1. **Named Entity Recognition (NER) using BERT**:

   - **Model**: A transformer-based NER model specifically fine-tuned on literary texts, compnet-renard/bert-base-cased-literary-NER, was utilized via the Hugging Face transformers library's pipeline function. This model was chosen for its reported effectiveness on literary domain text compared to general-purpose NER models.

   - **Chunking**: Due to the limitations of BERT-based models on long input sequences, the cleaned text of each novel was processed in overlapping chunks.

     - chunk_size: 2000 characters (heuristic, may require tuning per model/hardware).
     
     - overlap: 200 characters (to ensure entities spanning chunk boundaries are captured).

   - **Aggregation**: The aggregation_strategy="first" was used within the pipeline. This strategy attempts to merge sub-word tokens back into complete entities and, in cases of ambiguity or overlapping predictions for the same entity type, tends to favour the prediction associated with the first sub-word token.

   - **Output**: All identified entities tagged as 'PER' (Person) were extracted along with their text and character offsets relative to the original full text. Simple deduplication based on exact text and offsets was performed to handle entities detected identically in overlapping chunk regions. The raw, unique PERSON mentions were initially saved to ../data/ner_person_mentions_bert.json for traceability.

2. **Post-NER Filtering**:

   - **Rationale**: Raw NER output often contains noise or misclassifications (e.g., standalone titles, punctuation). Filtering is necessary to improve the quality of mentions passed to the consolidation stage.

   - **Rules Applied**:

     - Mentions consisting solely of punctuation or whitespace after stripping were removed.

     - Mentions consisting only of a common standalone title (e.g., "Mr", "Mrs", "Lady", case-insensitive, after stripping punctuation) were removed.

   - **Output**: The cleaned and filtered list of PERSON mentions was saved to ../data/ner_person_mentions_bert_filtered.json.

3. **Character Consolidation**:

   - **Goal**: Group the filtered mentions that likely refer to the same character entity.

   - **Primary Method (nicknames library)**: The Python nicknames library was used. For each mention's text (lowercased), it attempted to find a canonical/formal name (e.g., mapping "lizzy" to "elizabeth"). If found, the canonical name provided by the library was used as the initial grouping key.

   - **Fallback Method (Rule-based)**: If a mention was not found in the nicknames database:

     - Common titles (Mr, Mrs, etc.) were stripped from the beginning of the mention text.

     - The remaining text (or the original cleaned text if no title was stripped) was lowercased and used as the grouping key.

   - **Variation Collection**: The original filtered mention text (before lowercasing for nickname/fallback lookup) was added to a set associated with the determined grouping key. This preserves the different ways a character was referred to.

   - **Final Canonical Key Generation**: After grouping, a final representative canonical_key was generated for each group. This was done by identifying the most frequent variation within the group's collected mentions. This most frequent form was then formatted (e.g., stripping/prefixing titles, Title_Casing remaining parts, joining with underscores like Mr_Jonathan_Harker or Mina).

   - **Output**: A consolidated list mapping final canonical keys to their total mention count and the set of variations observed was saved to ../data/character_analysis_consolidated_nicknames.csv.

### Libraries & Tools Used:

- **transformers** (Hugging Face): For loading and running the NER model.

- **pandas**: For data manipulation and saving CSV outputs.

- **nicknames**: For dictionary-based nickname-to-canonical name mapping.

- **Standard Python libraries**: json, re, collections, os, string.

### Self-Correction/Improvements from previous versions mentioned in discussion:

- This version correctly describes using the BERT NER model and the nicknames library, not the spaCy NER and direct XML alias mapping from earlier attempts.

- It details the chunking and filtering steps which are present in your current code.

- It clarifies the consolidation process (nicknames first, then rule-based fallback).

In [16]:
# Cell 1: Import libraries
import spacy
import os 
import json
import re
from collections import Counter
import pandas as pd
import xml.etree.ElementTree as ET # Added for XML parsing (used in the first attempts when prinde and prejudice was used for alias mapping)

print("Libraries imported.")

Libraries imported.


## Load Cleaned Text
Read the content of the cleaned text file created by `00_pre_proc.ipynb`.

In [17]:
# Cell 2: Load cleaned text
input_file_path = "../data/dracula_cleaned.txt" # to edit

try:
    with open(input_file_path, 'r', encoding='utf-8') as file:
        cleaned_text = file.read()
    print(f"Successfully loaded cleaned text from: {input_file_path}")
    print(f"Text length: {len(cleaned_text)} characters")
except FileNotFoundError:
    print(f"Error: Cleaned text file not found at {input_file_path}")
except Exception as e:
    print(f"An error occurred loading the file: {e}")

Successfully loaded cleaned text from: ../data/dracula_cleaned.txt
Text length: 848415 characters


## Process Text for Named Entities

## Extract PERSON Entities

## I should improve NER - third attempt (way better)

In [18]:
from transformers import pipeline


# --- Initialize the pipeline ---
literary_ner = pipeline("ner", model="compnet-renard/bert-base-cased-literary-NER", aggregation_strategy="first")

# --- Define Chunking Parameters  ---.
chunk_size = 2000 # Number of characters per chunk (!)
overlap = 200    # Number of characters overlap (!)

# --- Process in Chunks ---
all_person_mentions = []
current_pos = 0

print(f"DEBUG: Starting chunk processing. Total text length: {len(cleaned_text)}")

while current_pos < len(cleaned_text):
    chunk_start = current_pos
    chunk_end = min(current_pos + chunk_size, len(cleaned_text))
    text_chunk = cleaned_text[chunk_start:chunk_end]

    # print(f"DEBUG: Processing chunk: {chunk_start} - {chunk_end}") # Optional debug

    if not text_chunk.strip(): # Skip empty chunks if any
        current_pos += chunk_size - overlap
        continue

    try:
         # Run NER on the chunk
         ner_results_chunk = literary_ner(text_chunk)

         # Process results for this chunk
         for entity in ner_results_chunk:
             # Adjust character offsets to be relative to the full text
             original_start = chunk_start + entity['start']
             original_end = chunk_start + entity['end']

             mention_data = {
                 "text": entity['word'],
                 "start_char": original_start,
                 "end_char": original_end
                 
             }

             if entity.get('entity_group') == 'PER':
                 all_person_mentions.append(mention_data)

    except Exception as e:
         print(f"ERROR processing chunk {chunk_start}-{chunk_end}: {e}") # Log errors

    # Move to the next chunk position
    # If it's the last chunk, stop
    if chunk_end == len(cleaned_text):
        break
    current_pos += chunk_size - overlap # Move forward, maintaining overlap


print(f"DEBUG: Chunk processing finished.")
print(f"DEBUG: Total PERSON mentions found across all chunks: {len(all_person_mentions)}")



# Simple deduplication based on exact start/end/text match
unique_person_mentions_set = set()
unique_person_mentions = []
for mention in all_person_mentions:
    mention_tuple = (mention['text'], mention['start_char'], mention['end_char'])
    if mention_tuple not in unique_person_mentions_set:
        unique_person_mentions_set.add(mention_tuple)
        unique_person_mentions.append(mention)

print(f"DEBUG: Unique PERSON mentions after deduplication: {len(unique_person_mentions)}")



person_mentions = unique_person_mentions # Assign to the variable name used later

# After the NER processing and deduplication...
if person_mentions:
    import json
    output_data_path = "../data/ner_person_mentions_bert.json"
    
    # Save to JSON file
    try:
        with open(output_data_path, 'w', encoding='utf-8') as f:
            json.dump(person_mentions, f, ensure_ascii=False, indent=2)
        print(f"Successfully saved {len(person_mentions)} person mentions to {output_data_path}")
    except Exception as e:
        print(f"Error saving to JSON: {e}")
else:
    print("No person mentions found to save")

# few entries to verify the data
print("\nFirst few person mentions:")
for mention in person_mentions[:5]:
    print(mention)

Device set to use cpu


DEBUG: Starting chunk processing. Total text length: 848415
DEBUG: Chunk processing finished.
DEBUG: Total PERSON mentions found across all chunks: 2752
DEBUG: Unique PERSON mentions after deduplication: 2692
Successfully saved 2692 person mentions to ../data/ner_person_mentions_bert.json

First few person mentions:
{'text': 'Bram Stoker', 'start_char': 19, 'end_char': 30}
{'text': 'Bram Stoker', 'start_char': 165, 'end_char': 176}
{'text': 'Jonathan Harker', 'start_char': 345, 'end_char': 360}
{'text': 'Jonathan Harker', 'start_char': 383, 'end_char': 398}
{'text': 'Jonathan Harker', 'start_char': 422, 'end_char': 437}


## Filtering

In [19]:

# List of standalone titles/honorifics to filter if they appear alone
standalone_titles = {"mr", "mrs", "miss", "ms", "dr", "lady", "sir", "colonel", "captain", "lord"} # Add more
# Characters considered punctuation for stripping/checking
import string
punctuation_chars = string.punctuation # Gets '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# --- Implement Post-NER Filtering ---
filtered_person_mentions = []
if person_mentions: # Use the list generated by the HF model
    print(f"DEBUG: Starting Post-NER Filtering on {len(person_mentions)} mentions...")
    for mention in person_mentions:
        original_text = mention['text']

        # 1. Basic Cleaning: Remove leading/trailing whitespace and punctuation
        #    Example: ". Bennet " -> "Bennet" ; "." -> "" ; "Mr." -> "Mr"
        cleaned_text = original_text.strip().strip(punctuation_chars)
        cleaned_text_lower = cleaned_text.lower()

        # --- Apply Filters ---
        # Filter 1: Check if empty after cleaning (e.g., if it was just ".")
        if not cleaned_text:
            # print(f"Filtering empty/punctuation mention: '{original_text}'") # Optional debug
            continue

        # Filter 2: Filter if it's just a standalone title
        if cleaned_text_lower in standalone_titles:
            # print(f"Filtering standalone title: '{original_text}' -> '{cleaned_text}'") # Optional debug
            continue


        
        mention_to_add = {
            "text": cleaned_text,
            "start_char": mention['start_char'],
            "end_char": mention['end_char']
        }
        filtered_person_mentions.append(mention_to_add)

    print(f"DEBUG: Mentions remaining after filtering: {len(filtered_person_mentions)}")
else:
     print("DEBUG: Initial person_mentions list was empty. Skipping filtering.")


# Saving the FILTERED list ---
output_filtered_path = "../data/ner_person_mentions_bert_filtered.json"
if filtered_person_mentions:
    print(f"DEBUG: Saving {len(filtered_person_mentions)} filtered mentions to {output_filtered_path}...")
    import json
    import os
    try:
        os.makedirs(os.path.dirname(output_filtered_path), exist_ok=True)
        with open(output_filtered_path, 'w', encoding='utf-8') as f:
            json.dump(filtered_person_mentions, f, indent=4)
        print(f"DEBUG: Successfully saved filtered mentions.")
    except Exception as e:
        print(f"DEBUG: Error saving filtered JSON: {e}")
else:
     print("DEBUG: No filtered person mentions to save.")


# next step is consolidation

DEBUG: Starting Post-NER Filtering on 2692 mentions...
DEBUG: Mentions remaining after filtering: 2338
DEBUG: Saving 2338 filtered mentions to ../data/ner_person_mentions_bert_filtered.json...
DEBUG: Successfully saved filtered mentions.



## Consolidate Characters - by using a nicknames library


In [20]:
# --- Cell for Character Consolidation  ---
import json
import pandas as pd
from collections import Counter, defaultdict
import re
from nicknames import NickNamer # Import the library

# --- Load Filtered Mentions ---
filtered_mentions_path = "../data/ner_person_mentions_bert_filtered.json"
original_filtered_mentions = []
try:
    with open(filtered_mentions_path, 'r', encoding='utf-8') as f:
        # This loads a LIST of mention dictionaries, e.g., [{'text': 'Bennet', 'start_char': ...}, ...]
        original_filtered_mentions = json.load(f)
    print(f"Loaded {len(original_filtered_mentions)} filtered mentions from {filtered_mentions_path}")
except Exception as e:
    print(f"Error loading filtered mentions JSON: {e}")
    original_filtered_mentions = [] # Ensure it's an empty list on error

# --- Initialize NickNamer ---
try:
    nn = NickNamer()
    print("NickNamer initialized.")
except Exception as e:
    print(f"Warning: Could not initialize NickNamer. Nickname lookup disabled. Error: {e}")
    nn = None # Disable nickname lookup if initialization fails

# --- Define Titles (for fallback/normalization if not a nickname) ---
titles = {"mr", "mrs", "miss", "ms", "dr", "lady", "sir", "colonel", "captain", "lord"} # Lowercase

# --- Consolidation Logic (Iterating through individual filtered mentions) ---
# This dictionary will store aggregated data: {canonical_base: {"count": N, "variations": set()}}
consolidated_characters = defaultdict(lambda: {"count": 0, "variations": set()})
mention_texts_for_frequency = [] # Collect all original texts to get counts

if original_filtered_mentions: # Check if the list is not empty
    print("Starting character consolidation using NickNamer and rules...")
    for mention_data in original_filtered_mentions:
        # Ensure mention_data is a dictionary with 'text' key
        if not isinstance(mention_data, dict) or 'text' not in mention_data:
             print(f"Skipping invalid mention data: {mention_data}")
             continue

        mention_text = mention_data['text'] # Use the cleaned text from filtering step
        mention_texts_for_frequency.append(mention_text) # Store for counting frequency later
        mention_lower = mention_text.lower()

        canonical_base = None # This will hold the temporary grouping key

        # --- Step 1: Check Nickname Dictionary ---
        if nn: # Only if NickNamer initialized successfully
            formal_names = nn.canonicals_of(mention_lower)
            if formal_names:
                # Simple strategy: use the first formal name found.
                canonical_base = list(formal_names)[0]
                # print(f"DEBUG: Nickname mapping: '{mention_text}' -> '{canonical_base}'") # Optional debug

        # --- Step 2: Fallback to Rule-Based Normalization (if not found in nicknames) ---
        if canonical_base is None:
            parts = mention_text.split()
            # Simple title stripping (if title is first word)
            if parts and parts[0].lower().strip('.') in titles:
                canonical_base = " ".join(parts[1:]) # Use name after title
            else:
                canonical_base = mention_text # Use original (cleaned) text

            # If after stripping title, the name is empty, use original text
            if not canonical_base:
                canonical_base = mention_text

            # Normalize to lowercase for consistent grouping *before* final formatting
            canonical_base = canonical_base.lower()

        # --- Grouping ---
        # Group based on the derived canonical_base (lowercase)
        consolidated_characters[canonical_base]["count"] += 1
        consolidated_characters[canonical_base]["variations"].add(mention_text) # Add the original mention text as a variation

    # --- Step 3: Refine Canonical Keys and Final Formatting ---
    print("Refining canonical keys...")
    final_consolidated_list = []
    mention_counts = Counter(mention_texts_for_frequency) # Count frequencies of original mentions

    for base_key, data in consolidated_characters.items():
        # Choose the most frequent variation within the group as the final key representation
        most_frequent_variation = base_key # Default to the base itself
        max_freq = 0
        for variation in data["variations"]:
            freq = mention_counts.get(variation, 0)
            if freq > max_freq:
                max_freq = freq
                most_frequent_variation = variation
            # Optional: Tie-breaking (e.g., prefer longer variation if frequencies are equal)
            elif freq == max_freq and len(variation) > len(most_frequent_variation):
                most_frequent_variation = variation

        # Format the chosen key (e.g., Title_Case, Underscores)
        parts = most_frequent_variation.split()
        final_key_parts = []
        title_prefix = ""
        start_index = 0
        if parts and parts[0].lower().strip('.') in titles:
            title_prefix = parts[0].strip('.').capitalize() + "_"
            start_index = 1
        final_key_parts = [part.capitalize() for part in parts[start_index:]]
        final_canonical_key = title_prefix + "_".join(final_key_parts)

        # Fallback if key is empty
        if not final_canonical_key:
            final_canonical_key = f"Unknown_{base_key[:10]}"

        # Append final data for this character group
        final_consolidated_list.append({
            "canonical_key": final_canonical_key,
            "total_mentions": data["count"],
            "variations": sorted(list(data["variations"])),
            "variation_count": len(data["variations"])
        })

    # --- Sort by total mentions (descending) ---
    final_consolidated_list.sort(key=lambda x: x['total_mentions'], reverse=True)

    # --- Save to CSV ---
    
    output_csv = "../data/character_analysis_consolidated_nicknames.csv"
    try:
        df_consolidated = pd.DataFrame(final_consolidated_list)
        # making sure columns are in a sensible order
        df_consolidated = df_consolidated[['canonical_key', 'total_mentions', 'variation_count', 'variations']]
        df_consolidated.to_csv(output_csv, index=False)
        print(f"\nSuccessfully saved consolidated results (nicknames + rules) to '{output_csv}'")

    except Exception as e:
        print(f"\nError saving consolidated results: {e}")

else:
    print("\nSkipping consolidation as no filtered PERSON mentions were loaded.")

Loaded 2338 filtered mentions from ../data/ner_person_mentions_bert_filtered.json
NickNamer initialized.
Starting character consolidation using NickNamer and rules...
Refining canonical keys...

Successfully saved consolidated results (nicknames + rules) to '../data/character_analysis_consolidated_nicknames.csv'
