### 3.3 Gender Classification - Rule-Based (`02_gender.ipynb`)

**Objective:** This notebook performs the initial, high-precision stage of gender classification. It takes the consolidated character list produced by `01_character_identification.ipynb` and attempts to assign gender based primarily on explicit titles and curated name lists. This step aims to classify unambiguous cases before potentially more complex methods (like context analysis or coreference) are applied in subsequent notebooks.

**Design Principles:**
- Prioritize high-confidence, rule-based methods for initial classification.
- Utilize common gendered titles (Mr., Mrs., Lady, etc.) as the strongest indicator.
- Leverage manually curated lists of common male and female first names relevant to the literary context.
- Establish a baseline `classified_gender` and `final_gender` (which are the same at this stage) before further refinement.

**Implementation Details:**
- **Input:** Loads the consolidated character data (`character_analysis_consolidated_nicknames.csv`) generated by Notebook 01.
- **Name Lists:** Loads external text files containing lists of common male (`male_names.txt`) and female (`female_names.txt`) first names (lowercase).
- **Title Checking:** Defined sets (`MALE_TITLES`, `FEMALE_TITLES`) containing common gendered titles (e.g., 'mr', 'mrs', 'lady', 'colonel'). The `classify_gender` function checks if the first part of the `canonical_key` (split by underscore) matches these titles.
- **Name List Lookup:** If no title match occurs, the function extracts the potential first name (first part of the key) and checks if it exists exclusively in the male or female name list. Ambiguous names found in both lists result in an "Unknown" classification (case-insensitive).
- **Default:** If neither title nor a unique name list match occurs, the function returns "Unknown".
- **Output:** Applies this classification to each character, storing the result in a `classified_gender` column. A `final_gender` column is also created, initially mirroring `classified_gender`, ready for potential updates in later stages. The results are saved to `character_analysis_gendered_new.csv`.

**Key Classification Logic (from `classify_gender` function):**

1. **Check `canonical_key` for Title Prefix:**
   ```python
   parts = name_lower.split('_')
   title_part = parts[0]
   if title_part in MALE_TITLES: return GENDER_MALE
   if title_part in FEMALE_TITLES: return GENDER_FEMALE
2. **Note:** title_part in FEMALE_TITLES: return GENDER_FEMALE
    ```
2.  ** More advanced methods like context analysis using pronouns or coreference resolution are applied in subsequent notebooks `02b_context.ipynb`.

In [19]:
# Cell 1: Import libraries
import pandas as pd
import re
import os
import time # Added for timing

print("Libraries imported.")

Libraries imported.


In [20]:
# Cell 2: Configuration
# --- Input Files ---
CHARACTER_CSV_PATH = "../data/character_analysis_consolidated_nicknames.csv"
FEMALE_NAMES_PATH = "../resources/female_names.txt"  
MALE_NAMES_PATH = "../resources/male_names.txt"      

# --- Output File ---
OUTPUT_CSV_PATH = "../data/character_analysis_gendered_new.csv"

# --- Constants ---
GENDER_FEMALE = "Female"
GENDER_MALE = "Male"
GENDER_UNKNOWN = "Unknown" 

# --- Define Titles ---
MALE_TITLES = {'mr', 'sir', 'lord', 'colonel', 'captain', 'reverend', 'dr'} # Added more common titles
FEMALE_TITLES = {'mrs', 'miss', 'ms', 'lady', 'dame', 'madam', 'madame'}

print("Configuration set.")
print(f"Character input: {CHARACTER_CSV_PATH}")
print(f"Female names: {FEMALE_NAMES_PATH}")
print(f"Male names: {MALE_NAMES_PATH}")
print(f"Output file: {OUTPUT_CSV_PATH}")

Configuration set.
Character input: ../data/character_analysis_consolidated_nicknames.csv
Female names: ../resources/female_names.txt
Male names: ../resources/male_names.txt
Output file: ../data/character_analysis_gendered_new.csv


In [21]:
# Cell 3: Load Name Lists Function
def load_name_list(filepath):
    """Loads names from a file (one name per line), converts to lowercase."""
    names = set()
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                name = line.strip().lower()
                if name: # Only add non-empty lines
                    names.add(name)
        print(f"Successfully loaded {len(names)} names from {filepath}")
        return names
    except FileNotFoundError:
        print(f"Error: Name list file not found at {filepath}. Please create it or update the path.")
        return None
    except Exception as e:
        print(f"Error loading name list {filepath}: {e}")
        return None

In [None]:
# Cell 4: Load the Name Lists
print("Loading name lists...")
female_names = load_name_list(FEMALE_NAMES_PATH)
male_names = load_name_list(MALE_NAMES_PATH)

# Basic check if lists loaded
if female_names is None or male_names is None:
    print("\n*** Critical Error: Could not load one or both name lists. Stopping execution. ***")
else:
    print("Name lists loaded (or errors noted above).")
    # print sample sizes
    print(f"Sample female names (up to 5): {list(female_names)[:5] if female_names else 'N/A'}")
    print(f"Sample male names (up to 5): {list(male_names)[:5] if male_names else 'N/A'}")

Loading name lists...
Successfully loaded 203 names from ../resources/female_names.txt
Successfully loaded 200 names from ../resources/male_names.txt
Name lists loaded (or errors noted above).
Sample female names (up to 5): ['maria', 'augusta', 'cordelia', 'harriott', 'jane']
Sample male names (up to 5): ['bartholomew', 'timothy', 'miles', 'stephen', 'morgan']


In [23]:
# Cell 5: Gender Classification Function
def classify_gender(canonical_name, female_names_set, male_names_set):
    """Classifies gender based on title, then first name lookup."""
    if not isinstance(canonical_name, str) or not canonical_name:
        return GENDER_UNKNOWN

    name_lower = canonical_name.lower()

    # 1. Check for Titles
    # Split name potentially containing titles like 'Mr_Darcy' or 'Lady_Catherine'
    parts = name_lower.split('_')
    title_part = parts[0] # Assume title is the first part if present

    if title_part in MALE_TITLES:
        return GENDER_MALE
    if title_part in FEMALE_TITLES:
        return GENDER_FEMALE

    # 2. Check Name Lists (if lists are available)
    if female_names_set is not None and male_names_set is not None:
        # Extract potential first name
        potential_first_name = parts[0] # Re-evaluate first part if it wasn't a title

        # If the first part wasn't a title, check it against name lists
        if potential_first_name not in MALE_TITLES and potential_first_name not in FEMALE_TITLES:
             is_female = potential_first_name in female_names_set
             is_male = potential_first_name in male_names_set

             if is_female and not is_male:
                 return GENDER_FEMALE
             if is_male and not is_female:
                 return GENDER_MALE
             if is_male and is_female:
                 # Name found in both lists - ambiguous based on lists alone
                 # Could add more logic here later if needed
                 return GENDER_UNKNOWN # Treat ambiguous as Unknown for now


    # 3. Default to Unknown
    return GENDER_UNKNOWN

print("Gender classification function defined.")

Gender classification function defined.


In [24]:
# Cell 6: Load Character Data
print(f"Loading character data from {CHARACTER_CSV_PATH}...")
try:
    char_df = pd.read_csv(CHARACTER_CSV_PATH)
    print(f"Successfully loaded {len(char_df)} characters.")
    print("Columns:", char_df.columns.tolist())
    print("\nSample data (first 5 rows):")
    print(char_df.head())
except FileNotFoundError:
    print(f"Error: Character CSV file not found at {CHARACTER_CSV_PATH}")
    print("Please ensure '01_character_identification.ipynb' ran successfully.")
    char_df = None # Set df to None to prevent further errors
except Exception as e:
    print(f"Error loading character CSV: {e}")
    char_df = None

Loading character data from ../data/character_analysis_consolidated_nicknames.csv...
Successfully loaded 220 characters.
Columns: ['canonical_key', 'total_mentions', 'variation_count', 'variations']

Sample data (first 5 rows):
  canonical_key  total_mentions  variation_count  \
0          Lucy             240                1   
1   Van_Helsing             215                1   
2          Mina             197                2   
3      Jonathan             178                3   
4        Arthur             154                3   

                              variations  
0                               ['Lucy']  
1                        ['Van Helsing']  
2                       ['MINA', 'Mina']  
3  [' Jonathan', 'JONATHAN', 'Jonathan']  
4       ['Art', 'Arthur', 'Lord Arthur']  


In [25]:
# Cell 6.5: Apply Gender Classification

if char_df is not None and female_names is not None and male_names is not None:
    print("Applying gender classification function...")
    # Apply the function to the 'canonical_key' column
    # Store the results in a new column called 'classified_gender'
    char_df['classified_gender'] = char_df['canonical_key'].apply(
        lambda name: classify_gender(name, female_names, male_names)
    )
    print("Gender classification applied. Displaying value counts:")
    print(char_df['classified_gender'].value_counts())
    print("\nSample data with classified_gender:")
    print(char_df[['canonical_key', 'classified_gender']].head())
else:
    print("Skipping gender classification due to missing DataFrame or name lists.")

Applying gender classification function...
Gender classification applied. Displaying value counts:
classified_gender
Unknown    175
Male        34
Female      11
Name: count, dtype: int64

Sample data with classified_gender:
  canonical_key classified_gender
0          Lucy            Female
1   Van_Helsing           Unknown
2          Mina           Unknown
3      Jonathan              Male
4        Arthur              Male


In [26]:
# --- Define Gender Constants  ---
GENDER_MALE = 'Male'
GENDER_FEMALE = 'Female'
GENDER_UNKNOWN = 'Unknown' 




if 'classified_gender' in char_df.columns:
    # Ensure the classified_gender column has appropriate values (e.g., handle None/NaN)
    char_df['final_gender'] = char_df['classified_gender'].fillna(GENDER_UNKNOWN)
    print("Set 'final_gender' based on 'classified_gender'.")
else:
    # If 'classified_gender' doesn't even exist yet, create a default 'final_gender'
    char_df['final_gender'] = GENDER_UNKNOWN
    print("Warning: 'classified_gender' column not found. Setting 'final_gender' to Unknown.")
    



Set 'final_gender' based on 'classified_gender'.


In [27]:
# Cell 8: Save Results
if char_df is not None:
    print(f"\nSaving gendered character data to {OUTPUT_CSV_PATH}...")
    try:
        # Ensure data directory exists
        os.makedirs(os.path.dirname(OUTPUT_CSV_PATH), exist_ok=True)
        # Select columns to save 
        columns_to_save = ['canonical_key', 'total_mentions', 'variation_count', 'variations', 'classified_gender', 'final_gender']
        char_df[columns_to_save].to_csv(OUTPUT_CSV_PATH, index=False)
        print("Results saved successfully.")
    except Exception as e:
        print(f"Error saving results: {e}")
else:
    print("\nSkipping saving results due to previous errors.")

print("\n--- Gender Classification Notebook Finished ---")


Saving gendered character data to ../data/character_analysis_gendered_new.csv...
Results saved successfully.

--- Gender Classification Notebook Finished ---
