## Part 6: Contextual Gender Analysis

### 3.4 Contextual Gender Analysis (`02b_context.ipynb`)

**Design Principles:**
- Use pronoun context and coreference resolution for characters with unknown gender
- Analyze sentences containing character mentions to find gendered pronouns
- Apply advanced NLP techniques (coreference resolution) to improve classification

**Implementation Details:**
- Used NLTK for sentence tokenization
- Analyzed sentences containing character mentions for gendered pronouns (he/him/his, she/her/hers)
- Used spaCy with the coreferee extension for coreference resolution
- Applied threshold-based classification based on pronoun counts

**Models and Libraries:**
- NLTK for sentence tokenization
- spaCy `en_core_web_lg` model for linguistic analysis
- `coreferee` extension for coreference resolution

**Key Thresholds:**
- Minimum number of dominant pronouns required: 2
- Minimum difference between male/female pronoun counts: 1

In [15]:
# Cell 1: Import libraries
import pandas as pd
import re
import os
import time
from collections import Counter
import nltk # Using NLTK for sentence tokenization

# Ensure NLTK sentence tokenizer is available
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    print("Downloading NLTK 'punkt' tokenizer...")
    nltk.download('punkt')

print("Libraries imported and NLTK setup checked.")

Libraries imported and NLTK setup checked.


In [16]:
# Cell 2: Configuration
# --- Input Files ---
GENDERED_CSV_PATH = "../data/character_analysis_gendered_new.csv" # Input from Notebook 02
CLEANED_TEXT_PATH = "../data/dracula_cleaned.txt" # Input from Notebook 00

# --- Output File ---
OUTPUT_CSV_PATH = "../data/character_analysis_gendered_contextual_new.csv"

# --- Constants ---
GENDER_FEMALE = "Female"
GENDER_MALE = "Male"
GENDER_UNKNOWN = "Unknown"

# --- Pronoun Sets ---
MALE_PRONOUNS = {'he', 'him', 'his'}
FEMALE_PRONOUNS = {'she', 'her', 'hers'}

# --- Parameters ---
MIN_PRONOUN_THRESHOLD = 2 # Minimum number of dominant pronouns required to make a classification
MIN_PRONOUN_DIFFERENCE = 1 # Minimum difference between male/female counts

print("Configuration set.")
print(f"Gendered input CSV: {GENDERED_CSV_PATH}")
print(f"Cleaned text input: {CLEANED_TEXT_PATH}")
print(f"Output CSV: {OUTPUT_CSV_PATH}")

Configuration set.
Gendered input CSV: ../data/character_analysis_gendered_new.csv
Cleaned text input: ../data/dracula_cleaned.txt
Output CSV: ../data/character_analysis_gendered_contextual_new.csv


In [17]:
# Cell 3: Load Input Data
print("Loading input data...")
try:
    char_df = pd.read_csv(GENDERED_CSV_PATH)
    print(f"Successfully loaded {len(char_df)} characters from {GENDERED_CSV_PATH}.")
except FileNotFoundError:
    print(f"Error: Character CSV file not found at {GENDERED_CSV_PATH}")
    char_df = None
except Exception as e:
    print(f"Error loading character CSV: {e}")
    char_df = None

try:
    with open(CLEANED_TEXT_PATH, 'r', encoding='utf-8') as f:
        full_text = f.read()
    print(f"Successfully loaded text ({len(full_text)} chars) from {CLEANED_TEXT_PATH}.")
    # Tokenize text into sentences
    sentences = nltk.sent_tokenize(full_text)
    print(f"Tokenized text into {len(sentences)} sentences.")
except FileNotFoundError:
    print(f"Error: Cleaned text file not found at {CLEANED_TEXT_PATH}")
    full_text = None
    sentences = []
except Exception as e:
    print(f"Error loading or tokenizing text: {e}")
    full_text = None
    sentences = []

Loading input data...
Successfully loaded 220 characters from ../data/character_analysis_gendered_new.csv.
Successfully loaded text (848415 chars) from ../data/dracula_cleaned.txt.
Tokenized text into 8479 sentences.


In [18]:
# Cell 4: Helper Function to Find Mentions and Analyze Context

def get_contextual_gender(character_name, variations_str, sentences_list):
    """Find mentions, analyze pronouns in surrounding sentences, return classification."""
    if not variations_str or not isinstance(variations_str, str):
        # Handle cases where variations might be NaN or not a string
        variations = {character_name} # Fallback to using the canonical key itself
    else:
        variations = set(var.strip() for var in variations_str.split(','))
        variations.add(character_name) # Ensure canonical name is included

    # Compile regex patterns for variations (case-insensitive, whole word)
    # Using word boundaries (\b) to avoid partial matches (e.g., 'Her' in 'Hertfordshire')
    patterns = [re.compile(r'\b' + re.escape(var) + r'\b', re.IGNORECASE) for var in variations if var]

    male_evidence = 0
    female_evidence = 0
    mentions_found = 0

    for sentence in sentences_list:
        sentence_lower = sentence.lower()
        found_mention_in_sentence = False
        for pattern in patterns:
            if pattern.search(sentence): # Check if any variation exists in the sentence
                found_mention_in_sentence = True
                mentions_found += len(pattern.findall(sentence)) # Count all occurrences in sentence
                break # Stop checking patterns for this sentence once one is found

        if found_mention_in_sentence:
            # Simple context: pronouns within the *same* sentence
            words = re.findall(r'\b\w+\b', sentence_lower) # Basic word tokenization
            for word in words:
                if word in MALE_PRONOUNS:
                    male_evidence += 1
                elif word in FEMALE_PRONOUNS:
                    female_evidence += 1

    # Apply classification rules
    if male_evidence >= MIN_PRONOUN_THRESHOLD and (male_evidence - female_evidence) >= MIN_PRONOUN_DIFFERENCE:
        return GENDER_MALE, male_evidence, female_evidence, mentions_found
    elif female_evidence >= MIN_PRONOUN_THRESHOLD and (female_evidence - male_evidence) >= MIN_PRONOUN_DIFFERENCE:
        return GENDER_FEMALE, male_evidence, female_evidence, mentions_found
    else:
        return GENDER_UNKNOWN, male_evidence, female_evidence, mentions_found

print("Helper function defined.")

Helper function defined.


In [19]:
# Cell 5: Apply Contextual Classification to 'Unknown' Characters

if char_df is not None and sentences:
    print("Applying contextual gender classification to 'Unknown' characters...")
    start_time = time.time()

    # Create new columns to store contextual results
    char_df['contextual_gender'] = GENDER_UNKNOWN
    char_df['male_pronouns'] = 0
    char_df['female_pronouns'] = 0
    char_df['context_mentions'] = 0

    unknown_indices = char_df[char_df['final_gender'] == GENDER_UNKNOWN].index
    print(f"Found {len(unknown_indices)} characters initially marked as Unknown.")

    processed_count = 0
    for index in unknown_indices:
        char_name = char_df.loc[index, 'canonical_key']
        variations = char_df.loc[index, 'variations']

        # Skip overly short names or potential initials unlikely to be characters
        # Also skip names that might be places often misidentified
        # (Refine this list as needed)
        if len(char_name) <= 2 and char_name.isupper() or char_name in ['Mr.', 'Mrs.', 'Longbourn', 'Netherfield', 'Pemberley', 'Hunsford', 'Rosings']:
             continue

        context_gender, m_count, f_count, mention_count = get_contextual_gender(
            char_name,
            variations,
            sentences
        )

        # Update the dataframe
        char_df.loc[index, 'contextual_gender'] = context_gender
        char_df.loc[index, 'male_pronouns'] = m_count
        char_df.loc[index, 'female_pronouns'] = f_count
        char_df.loc[index, 'context_mentions'] = mention_count

        processed_count += 1
        if processed_count % 20 == 0:
            print(f"  Processed {processed_count}/{len(unknown_indices)} unknown characters...")

    end_time = time.time()
    print(f"Contextual analysis finished in {end_time - start_time:.2f} seconds.")

    # Create a new 'final_gender_contextual' column
    # If original final_gender was known, keep it. If it was unknown, use the new contextual one.
    char_df['final_gender_contextual'] = char_df.apply(
        lambda row: row['final_gender'] if row['final_gender'] != GENDER_UNKNOWN else row['contextual_gender'],
        axis=1
    )

    print("\nUpdated gender classification results (showing previously Unknown):")
    context_changed = char_df[char_df.index.isin(unknown_indices)]
    print(context_changed[['canonical_key', 'final_gender', 'contextual_gender', 'male_pronouns', 'female_pronouns', 'final_gender_contextual']].head(20))

    print("\nNew Gender Distribution (Contextual):")
    print(char_df['final_gender_contextual'].value_counts())
else:
    print("Skipping contextual classification due to missing input data (CSV or text).")

Applying contextual gender classification to 'Unknown' characters...
Found 175 characters initially marked as Unknown.
  Processed 20/175 unknown characters...
  Processed 40/175 unknown characters...
  Processed 60/175 unknown characters...
  Processed 80/175 unknown characters...
  Processed 100/175 unknown characters...
  Processed 120/175 unknown characters...
  Processed 140/175 unknown characters...
  Processed 160/175 unknown characters...
Contextual analysis finished in 4.44 seconds.

Updated gender classification results (showing previously Unknown):
     canonical_key final_gender contextual_gender  male_pronouns  \
1      Van_Helsing      Unknown           Unknown              0   
2             Mina      Unknown              Male             88   
5        Professor      Unknown              Male            127   
7           Seward      Unknown              Male             71   
8           Harker      Unknown              Male             96   
9              God      Un

In [20]:
# Cell 6: Save Final Results

if char_df is not None:
    print(f"\nSaving final contextually gendered character data to {OUTPUT_CSV_PATH}...")
    try:
        # Ensure data directory exists
        os.makedirs(os.path.dirname(OUTPUT_CSV_PATH), exist_ok=True)
        # Select columns to save
        columns_to_save = ['canonical_key', 'total_mentions', 'variation_count', 'variations', 'classified_gender', 'final_gender', # Original columns
                           'contextual_gender', 'male_pronouns', 'female_pronouns', 'context_mentions', # Contextual analysis
                           'final_gender_contextual'] # Final combined
        # Reorder for clarity
        final_df_to_save = char_df[columns_to_save].sort_values('total_mentions', ascending=False)

        final_df_to_save.to_csv(OUTPUT_CSV_PATH, index=False)
        print("Results saved successfully.")
    except Exception as e:
        print(f"Error saving results: {e}")
else:
    print("\nSkipping saving results due to previous errors.")

print("\n--- Contextual Gender Classification Notebook Finished ---")


Saving final contextually gendered character data to ../data/character_analysis_gendered_contextual_new.csv...
Results saved successfully.

--- Contextual Gender Classification Notebook Finished ---


In [21]:
import spacy
import coreferee # Import coreferee
import pandas as pd

# --- Configuration ---
GENDERED_CONTEXT_CSV_PATH = "../data/character_analysis_gendered_contextual_new.csv" # Input from Notebook 02b
CLEANED_TEXT_PATH = "../data/dracula_cleaned.txt" # Input from Notebook 00
OUTPUT_CSV_PATH = "../data/character_analysis_gendered_coref_new_1.csv"
GENDER_FEMALE = "Female"
GENDER_MALE = "Male"
GENDER_UNKNOWN = "Unknown"
MALE_PRONOUNS = {'he', 'him', 'his'}
FEMALE_PRONOUNS = {'she', 'her', 'hers'}

# --- Load spaCy model and add coreferee ---
print("Loading spaCy model and adding coreferee...")

# Using 'en_core_web_lg' perché 'en_core_web_trf' caused memory issues
nlp = spacy.load('en_core_web_lg')
# Add the coreferee pipe
# Coreferee automatically initializes when added if needed.
nlp.add_pipe('coreferee')
print("Pipeline:", nlp.pipe_names)

# --- Load Data ---
print(f"Loading data from {GENDERED_CONTEXT_CSV_PATH}...")
try:
    char_df = pd.read_csv(GENDERED_CONTEXT_CSV_PATH)
    print(f"Loaded {len(char_df)} characters.")
except Exception as e:
    print(f"Error loading CSV: {e}")
    char_df = None

print(f"Loading text from {CLEANED_TEXT_PATH}...")
try:
    with open(CLEANED_TEXT_PATH, 'r', encoding='utf-8') as f:
        full_text = f.read()
    print(f"Loaded text ({len(full_text)} chars).")
except Exception as e:
    print(f"Error loading text: {e}")
    full_text = None

# --- Process the full text ---
doc = None
if full_text:
    print("Processing text with spaCy and coreferee (this can take time)...")
    # Increase max_length if your text is very long
    # nlp.max_length = len(full_text) + 100 # Consider adjusting if needed, but lg model is less demanding
    doc = nlp(full_text)
    print("Text processing complete.")
    # --- Access Coreference Chains ---
    if doc._.coref_chains:
         print(f"Found {len(doc._.coref_chains)} coreference chains.")
         # Example: Print the first few chains
         # doc._.coref_chains.print() # coreferee has a built-in print method
    else:
         print("No coreference chains found by coreferee.")

# --- Apply Coref Results to Gender Classification (Conceptual) ---
if char_df is not None and doc is not None and doc._.coref_chains:
    print("Applying coreference results to Unknown characters...")
    # Create a map from mention spans (start_token_index) to their chain index
    mention_to_chain_index = {}
    for chain_index, chain in enumerate(doc._.coref_chains):
        for mention in chain:
             # A mention in coreferee is a list of token indices
             # CORRECTED: Access the first token index directly from the mention object
             if mention: # Ensure mention is not empty (shouldn't happen but safe check)
                 start_token_index = mention[0]
                 mention_to_chain_index[start_token_index] = chain_index

    # Create a map to store aggregated gender evidence per chain
    chain_gender_evidence = {i: {'male': 0, 'female': 0, 'known_gender': GENDER_UNKNOWN} for i in range(len(doc._.coref_chains))}


    known_char_gender = {}
    for index, row in char_df[char_df['final_gender_contextual'] != GENDER_UNKNOWN].iterrows():
        gender = row['final_gender_contextual']
        variations = set(var.strip().lower() for var in str(row['variations']).split(','))
        variations.add(row['canonical_key'].lower())
        for var in variations:
            if var:
                known_char_gender[var] = gender


    for chain_index, chain in enumerate(doc._.coref_chains):
        for mention in chain:
             # CORRECTED: Get span using mention indices directly
             if not mention: continue # Skip empty mentions
             mention_span = doc[mention[0]:mention[-1]+1]
             mention_text = mention_span.text.lower()

             # Check against known gendered characters
             if mention_text in known_char_gender and chain_gender_evidence[chain_index]['known_gender'] == GENDER_UNKNOWN:
                  chain_gender_evidence[chain_index]['known_gender'] = known_char_gender[mention_text]
             # You might want more robust logic if a chain contains conflicting known characters

             # Count pronouns (check the text of the mention span)
             if mention_text in MALE_PRONOUNS:
                 chain_gender_evidence[chain_index]['male'] += 1
             if mention_text in FEMALE_PRONOUNS:
                 chain_gender_evidence[chain_index]['female'] += 1

    # --- Pass 2: Classify 'Unknown' characters based on their chain's evidence ---
    char_df['coref_gender'] = char_df['final_gender_contextual'] # Start with previous best guess
    unknown_indices = char_df[char_df['coref_gender'] == GENDER_UNKNOWN].index

    for index in unknown_indices:
        char_name = char_df.loc[index, 'canonical_key']
        # Ensure variations are lowercase for matching
        variations = set(var.strip().lower() for var in str(char_df.loc[index, 'variations']).split(','))
        variations.add(char_name.lower())
        variations.discard('') # Remove empty strings if any

        # Find mentions of this character in the doc
        found_chain_indices = set()
        # This part is tricky: Need to map character name back to mentions found by coreferee
        # A simple approach: iterate through all mentions in all chains
        for chain_index, chain in enumerate(doc._.coref_chains):
             for mention in chain:
                 # CORRECTED: Get span using mention indices directly
                 if not mention: continue
                 mention_span = doc[mention[0]:mention[-1]+1]
                 # Match against lowercase variations
                 if mention_span.text.lower() in variations:
                     found_chain_indices.add(chain_index)
                     # Optimization: Once found in a chain, you might not need to check other mentions in the same chain
                     # break # Uncomment if you only care if the char exists *anywhere* in the chain

        # Aggregate evidence from all chains this character belongs to
        final_male = 0
        final_female = 0
        final_known = GENDER_UNKNOWN
        # Use a simple majority for known gender if conflicts arise
        known_genders_found = []

        for chain_idx in found_chain_indices:
             evidence = chain_gender_evidence[chain_idx]
             final_male += evidence['male']
             final_female += evidence['female']
             if evidence['known_gender'] != GENDER_UNKNOWN:
                 known_genders_found.append(evidence['known_gender'])

        # Determine final known gender (simple majority or fallback)
        if known_genders_found:
            from collections import Counter
            gender_counts = Counter(known_genders_found)
            # If one gender is clearly dominant, use it. Otherwise, could remain Unknown or use pronoun counts.
            most_common = gender_counts.most_common(1)
            if most_common:
                # Simple approach: take the most common known gender found across chains
                 final_known = most_common[0][0]
            # More complex logic could be added here for tie-breaking or ambiguity


        # Apply classification logic based on aggregated evidence
        new_gender = GENDER_UNKNOWN
        if final_known != GENDER_UNKNOWN:
             new_gender = final_known
        # Add thresholds similar to the previous notebook?
        elif final_male > final_female: # Simple comparison for now
             new_gender = GENDER_MALE
        elif final_female > final_male:
             new_gender = GENDER_FEMALE
        # If counts are equal and no known gender, stays Unknown

        char_df.loc[index, 'coref_gender'] = new_gender

    # --- Display/Save Results ---
    print("\\nCharacters re-classified using Coreference:")
    # Show changes...
    changes = char_df[char_df['final_gender_contextual'] != char_df['coref_gender']]
    print(changes[['canonical_key', 'final_gender_contextual', 'coref_gender']].head(20))
    print(f"\n{len(changes)} characters changed classification based on coreference.")
    print("\nNew Gender Distribution (Coreference):")
    print(char_df['coref_gender'].value_counts())

    # Save results - Ensure OUTPUT_CSV_PATH is defined
    if OUTPUT_CSV_PATH:
        print(f"\\nSaving final coreference-based gendered character data to {OUTPUT_CSV_PATH}...")
        try:
            import os
            os.makedirs(os.path.dirname(OUTPUT_CSV_PATH), exist_ok=True)
            # Decide which columns to save
            output_columns = ['canonical_key', 'total_mentions', 'variations',
                              'final_gender_contextual', # Gender after context
                              'coref_gender'] # Gender after coref
            # Add other relevant columns as needed
            final_coref_df = char_df[output_columns].sort_values('total_mentions', ascending=False)
            final_coref_df.to_csv(OUTPUT_CSV_PATH, index=False)
            print("Results saved successfully.")
        except Exception as e:
            print(f"Error saving results: {e}")


else:
    print("Skipping coreference application due to missing data or coref chains.")

print("\\n--- Coreference Gender Classification Attempt Finished ---")


Loading spaCy model and adding coreferee...
Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'coreferee']
Loading data from ../data/character_analysis_gendered_contextual_new.csv...
Loaded 220 characters.
Loading text from ../data/dracula_cleaned.txt...
Loaded text (848415 chars).
Processing text with spaCy and coreferee (this can take time)...
Text processing complete.
Found 3742 coreference chains.
Applying coreference results to Unknown characters...
\nCharacters re-classified using Coreference:
    canonical_key final_gender_contextual coref_gender
85         Hamlet                 Unknown         Male
86    Shakespeare                 Unknown         Male
114         Byron                 Unknown         Male
173    Spencelagh                 Unknown         Male
188        Caffyn                 Unknown         Male
207      Disraeli                 Unknown         Male
215       Olgaren                 Unknown         Male
219     Bistritza      