# Engineer Bechdel Test with Coreference Resolution Model

- After running the Coreference Resolution Model to identify instances of characters within each script (coreference_resolution_predict.py)(Sabyasachee Baruah and Shrikanth Narayanan. 2023), we use a set of engineered rules to capture the results of the Bechdel Test.

    - We first identify conversations with two women if characters identified with the coreference model have woman-identifying pronouns.
    - We use a set of rules to capture the beginning and end of conversations within scenes. See the function `extract_women_conversations_with_target` for more details.
    - It's important to note that we conducted numerous experiments to test conversation detection. After examining negative predictions, we found that a lot of missed cases dealt with the challenges of dialogue within the right conversation, and that the rule implementation did not succeed in capturing conversation in many cases and thus misrepresented conversations for the appropriate test. 
    
        - Therefore, we tested many other conversation detection approaches such as unsupervised embedding clustering, directed graphs, generative ai classification, question and answer classification, and pre-training conversation span detection models via the Cornell dataset.
        
        - We unfortunately did not have the time or resources to thoroughly explore these options, and it would be a point of emphasis for the project moving forward.

## Bechdel Test Functions

In [9]:
import os
import json
from multiprocessing import Pool
from movie_coref.preprocess import preprocess_scripts as preprocess
from movie_coref.movie_coref import MovieCoreference

In [10]:
def extract_character_names(script_lines):
    """
    Extracts a set of unique character names from the script lines based on strict 'C: ' tags,
    removing spaces within names for consistent formatting.

    Args:
        script_lines: List of script lines from the file.

    Returns:
        A set of unique character names with spaces removed.
    """
    character_names = set()

    for line in script_lines:
        # Ensure the line starts with 'C: ' exactly
        if line[:3] == 'C: ':
            # Extract the name after 'C: ' and normalize it
            name = line[3:].strip().replace(" ", "")  # Remove spaces within names
            if name:  # Avoid adding empty or invalid names
                character_names.add(name)
    
    return character_names

def identify_characters(tokens, clusters, character_names):
    """
    Associates characters with gender-specific pronouns, counting occurrences and handling ties.
    Combines tokens within spans to ensure proper entity recognition.
    
    Args:
        tokens: List of all tokens in the script.
        clusters: Coreference resolution clusters.
        character_names: Set of known character names from the script.

    Returns:
        Dictionary mapping entity IDs to their associated name and pronoun:
        {entity_id: {'ID_NAME': <character_name>, 'PRONOUN': <gender>}}.
    """
    character_pronouns = {}

    # Expanded pronouns and identifiers for gender
    men_pronouns = {'he', 'his', 'him', 'dad', 'father', 'boy', 'man'}
    women_pronouns = {'she', 'her', 'hers', 'mom', 'mother', 'girl', 'woman'}

    for entity_id, spans in enumerate(clusters):
        # Extract and combine tokens for each entity span
        entity_tokens = [
            "".join(tokens[start:end + 1])  # Combine tokens into one string for each span
            for start, end in spans
        ]
        
        # Flatten all spans into one list for analysis
        flattened_tokens = [token.strip() for token in entity_tokens]

        # Identify the name (capitalized token that matches known character names)
        entity_name = next((token for token in flattened_tokens if token in character_names), f"Entity_{entity_id}")

        # Count occurrences of  and women identifiers
        men_count = sum(1 for token in flattened_tokens if token.lower() in men_pronouns)
        women_count = sum(1 for token in flattened_tokens if token.lower() in women_pronouns)

        # Determine gender based on counts
        if men_count > women_count:
            gender = "man"
        elif women_count > men_count:
            gender = "woman"
        else:
            gender = "non-binary"

        # Add to the dictionary with entity_id as the key
        character_pronouns[entity_id] = {
            'ID_NAME': entity_name,
            'PRONOUN': gender
        }

    return character_pronouns

In [11]:
def reconstruct_script(tokens, parse_tags):
    """
    Reconstructs the script by grouping tokens based on their parse tags.

    Args:
        tokens: List of tokens from the script (e.g., movie_data[0]['tokens']).
        parse_tags: List of parse tags corresponding to the tokens (e.g., movie_data[0]['parse']).

    Returns:
        A list of reconstructed lines as strings, each grouped by its parse tag.
    """
    reconstructed_script = []
    current_line = []
    current_tag = parse_tags[0]  # Start with the first tag

    for token, tag in zip(tokens, parse_tags):
        if tag == current_tag:  # Continue grouping tokens with the same tag
            current_line.append(token)
        else:  # Tag changed, finish the current line and start a new one
            reconstructed_script.append(f"{current_tag}: {' '.join(current_line)}")
            current_line = [token]  # Start a new line with the current token
            current_tag = tag  # Update the tag

    # Append the last line after looping
    if current_line:
        reconstructed_script.append(f"{current_tag}: {' '.join(current_line)}")

    return reconstructed_script

def create_secondary_script(tokens, clusters, character_pronouns):
    """
    Creates a secondary script where tokens associated with entities are replaced by their character names,
    combining ranges of tokens into a single unified character name. The script is then reconstructed
    into lines grouped by their parse tags.

    Args:
        tokens: List of all tokens in the script.
        parse_tags: List of parse tags corresponding to the tokens.
        clusters: Coreference resolution clusters.
        character_pronouns: Dictionary mapping entity IDs to {'ID_NAME': <name>, 'PRONOUN': <gender>}.

    Returns:
        List representing the reconstructed secondary script with tokens replaced by character names.
    """
    # Map each token index to a character name using character_pronouns
    token_to_character_map = {}

    # Iterate over clusters and assign character names from character_pronouns
    for entity_id, spans in enumerate(clusters):
        character_name = character_pronouns[entity_id]['ID_NAME']  # Get character name from character_pronouns
        for start, end in spans:
            for idx in range(start, end + 1):  # Map all tokens in the span to the character name
                token_to_character_map[idx] = character_name

    # Create the secondary script
    secondary_script = [
        token_to_character_map.get(idx, token)  # Replace with character name if it exists
        for idx, token in enumerate(tokens)
    ]

    return secondary_script

In [12]:
def split_into_scenes(script_lines):
    """Splits script lines into scenes based on the 'S' tag."""
    scenes = []
    current_scene = []
    
    for line in script_lines:
        if line.startswith('S:'):
            if current_scene:  # Save the previous scene
                scenes.append(current_scene)
            current_scene = [line]  # Start a new scene
        else:
            current_scene.append(line)
    
    if current_scene:  # Add the last scene
        scenes.append(current_scene)
    
    return scenes

# def extract_conversations(scene):
#     """Extracts conversations (character-dialogue pairs) from a scene."""
#     conversations = []
#     current_character = None
    
#     for line in scene:
#         if line.startswith('C:'):  # Character tag
#             current_character = line[3:].strip()
#         elif line.startswith('D:') and current_character:  # Dialogue tag
#             conversations.append((current_character, line[3:].strip()))
    
#     return conversations

In [13]:
def extract_women_conversations_with_target(scene, character_pronouns, character_names):
    """
    Extracts valid woman-to-woman conversations from a scene, considering explicit 'E' tags
    and dialogue sequences directed to other characters using rule-based approach.

    Args:
        scene: List of lines from the script, including 'C', 'E', and 'D' tags.
        character_pronouns: Dictionary mapping character names to {'ID_NAME': name, 'PRONOUN': gender}.
        character_names: Set of known character names.

    Returns:
        List of valid woman-to-woman conversation blocks.
    """
    women_characters = {info['ID_NAME'] for info in character_pronouns.values() if info['PRONOUN'] == "woman"}
    women_conversations = []
    current_conversation = []
    current_speaker = None  # Tracks the current C tag speaker
    previous_speaker = None  # Tracks the last C tag speaker before the current one
    directed_to = None  # Tracks the target of the dialogue from 'E' tags
    previous_directed_to = None  # Tracks the previous directed_to value
    dialogue_buffer = []  # Temporarily stores dialogue before determining the target

    for i, line in enumerate(scene):
        tag, content = line[:1], line[3:].strip()

        if tag == 'C':  # Character tag
            previous_speaker = current_speaker  # Update previous speaker
            current_speaker = content  # Update current speaker
            # If the conversation breaks, finalize the current conversation
            if current_conversation and not (
                current_speaker in women_characters and directed_to in women_characters
            ):
                women_conversations.append(current_conversation)
                current_conversation = []
            dialogue_buffer = []  # Reset dialogue buffer
            directed_to = None  # Reset directed_to for each new speaker
            

        elif tag == 'E':  # Explicit direction tag
            # Extract the directed-to character name
            words = content.replace("(", "").replace(")", "").split()
            directed_to = next((word.upper() for word in words if word.upper() in character_names), None)

        elif tag == 'D':  # Dialogue tag
            dialogue_buffer.append(content)  # Temporarily store dialogue
            
            # Resolve `directed_to` using the context
            if not directed_to:
                for j in range(i + 1, len(scene)):
                    next_tag, next_content = scene[j][:1], scene[j][3:].strip()
                    if next_tag == 'C':  # Found a C tag
                        directed_to = next_content
                        break
                    elif next_tag != 'D':  # Found a non-D tag
                        directed_to = previous_speaker  # Default to previous_speaker
                        break
                else:  # If no valid tag found, default to None
                    directed_to = None

            # Add dialogue to the conversation if `directed_to` resolves and matches women criteria
            if current_speaker in women_characters and current_speaker != directed_to and directed_to in women_characters and previous_directed_to is None:
                for dialogue in dialogue_buffer:
                    current_conversation.append((current_speaker, directed_to, dialogue))
                dialogue_buffer = []  # Clear buffer after appending

            elif current_speaker == previous_directed_to and directed_to == previous_speaker and current_speaker != directed_to and current_speaker in women_characters and directed_to in women_characters:
                for dialogue in dialogue_buffer:
                    current_conversation.append((current_speaker, directed_to, dialogue))
                dialogue_buffer = []  # Clear buffer after appending

        else:  # Non-conversation tags (e.g., 'N')
            # If a non-conversation tag appears, finalize the current conversation
            if current_conversation and current_speaker != directed_to:
                women_conversations.append(current_conversation)
                current_conversation = []
            dialogue_buffer = []  # Clear dialogue buffer
            directed_to = None  # Reset directed_to

        previous_directed_to = directed_to
    # Append the last conversation if valid
    if current_conversation:
        women_conversations.append(current_conversation)

    return women_conversations

def no_man_conversation(women_conversations, character_pronouns):
    """
    Checks if woman-to-woman conversations discuss something other than a man.

    Args:
        women_conversations: List of woman-to-woman conversation blocks.
        secondary_script: List representing the script with tokens replaced by character names.
        character_pronouns: Dictionary mapping character names to their gender.

    Returns:
        List of results indicating if each conversation passes Criterion 3.
    """
    men_characters = {info['ID_NAME'] for info in character_pronouns.values() if info['PRONOUN'] == "man"}

    results = []
    for conversation in women_conversations:
        # Flatten all dialogue in the conversation
        dialogue_text = " ".join(dialogue for _, _, dialogue in conversation)
        # Check if any men character is mentioned in the dialogue
        mentions_men = any(men_char in dialogue_text for men_char in men_characters)
        results.append(not mentions_men)

    return results

def score_scene(scene, character_pronouns, character_names):
    """Scores a scene on the Bechdel Test criteria."""
    # Criterion 1: Are there two women in the scene?
    women_count = sum(1 for info in character_pronouns.values() if info['PRONOUN'] == "woman")
    criterion_1 = women_count > 1

    women_conversations = extract_women_conversations_with_target(scene, character_pronouns, character_names)

    # Criterion 2: Do two women talk to each other? True if there is at least one women-to-women conversation
    # Criterion 2 is 
    criterion_2 = len(women_conversations) > 0

    # Criterion 3: Do they talk about something other than a man?
    criterion_3 = sum(no_man_conversation(women_conversations, character_pronouns)) > 0

    return criterion_1, criterion_2, criterion_3

### Test Single Case

In [14]:
imdb_movie_id = 1231587
json_file_path = '/Users/alecnaidoo/Downloads/MIDS/DATASCI_266_NLP_with_DL/W266_Project/scripts/'+str(imdb_movie_id)+'/'+str(imdb_movie_id)+'_movie_data.json'
COMBINED_FILE = '/Users/alecnaidoo/Downloads/MIDS/DATASCI_266_NLP_with_DL/W266_Project/scripts/'+str(imdb_movie_id)+'/'+str(imdb_movie_id)+'_combined_norm.txt'

with open(json_file_path, 'r', encoding='utf-8') as f:
                movie_data = json.load(f)

with open(COMBINED_FILE, "r", encoding="utf-8") as fr:
            combined_lines = fr.read().splitlines()  # Use splitlines() to preserve all lines as-is
            
tokens = movie_data[0]['token']
parse_tags = movie_data[0]['parse']
clusters = movie_data[0]['clusters']
character_names = extract_character_names(combined_lines)
character_pronouns = identify_characters(tokens, clusters, character_names)

secondary_script = create_secondary_script(tokens, clusters, character_pronouns)
secondary_script

reconstructed = reconstruct_script(tokens, parse_tags)
reconstructed_scenes = split_into_scenes(reconstructed)

temp_script = reconstruct_script(secondary_script, parse_tags)
scenes = split_into_scenes(temp_script)

In [15]:
# Initialize total scores and a dictionary to store scene scores
s_1 = 0
s_2 = 0
s_3 = 0
scene_scores = {}

# Iterate through scenes and score each one
for scene_id, scene in enumerate(scenes):
    score_1, score_2, score_3 = score_scene(scene, character_pronouns, character_names)
    
    # Add scores to totals
    s_1 += score_1
    s_2 += score_2
    s_3 += score_3

    # Store scores in the dictionary
    scene_scores[scene_id] = {
        "Score 1": score_1,
        "Score 2": score_2,
        "Score 3": score_3
    }

# Print total scores
print(f"Total S1: {s_1}")
print(f"Total S2: {s_2}")
print(f"Total S3: {s_3}")

# Optional: Print the scores dictionary for analysis
print("Scene Scores:")
for scene_id, scores in scene_scores.items():
    print(f"Scene {scene_id}: {scores}")

Total S1: 91
Total S2: 1
Total S3: 1
Scene Scores:
Scene 0: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 1: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 2: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 3: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 4: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 5: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 6: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 7: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 8: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 9: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 10: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 11: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 12: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 13: {'Score 1': True, 'Score 2': False, 'Score 3': False}
Scene 14: {'Score 1': True, 'Score 2': False, 'Score 3': False}

In [16]:
# Find scene_ids that pass Test 2
test_2_pass_ids = [scene_id for scene_id, scores in scene_scores.items() if scores["Score 2"] == 1]

print(f"Scene IDs that pass Test 2: {test_2_pass_ids}")

Scene IDs that pass Test 2: [44]


In [18]:
reconstructed_scenes[44]

['S: INT . HOUSE PARTY - NIGHT',
 'N: It \' s like the 80s exploded . Music , clothes , hair , attitude -- it \' s all on overdrive . In one section , PARTY - GOERS marvel at DUCK HUNT , while in another area , people make out and dance . Adam , Nick , Lou , and Jacob walk in the front door . They have updated their " looks " with 80s sweaters and other era - appropriate attire . They all look ridiculous , except for Jacob , whose youth lends him hipster appeal .',
 'C: ADAM',
 'D: This sweater makes me look like a jerkof f .',
 "N: LOU ( BREATHES DEEPLY ) It ' s good to be home . In a corner , Phil puts his ARM in a SHARK TANK . Just as the shark goes to bite , he PULLS HIS ARM OUT , unscathed . A small crowd claps . Our guys are confused and upset .",
 'C: NICK',
 'D: Was this like an 80s thing ?',
 'C: LOU',
 "D: If he doesn ' t lose that arm soon , I ' m gon na take it from him myself . With that , Lou wanders off toward another room , leering at and groping girls as he goes .",
 "

In [21]:
extract_women_conversations_with_target(scenes[44], character_pronouns, character_names)

[[('MICHELLE', 'SANDY', 'Come on ! SANDY have to !'),
  ('SANDY', 'MICHELLE', "OK , but you can ' t laugh .")]]

### Test Multiple Cases: Process Bechdel Test for Multiple Movies (using CorefModel Predictions)

In [24]:
import os
import json

def find_norm_files(pct=None):
    """
    Find all script, tags, and combined `_norm.txt` files within the directory structure.
    Optionally return a percentage of the files.

    Args:
        pct (float): A float between 0 and 1 specifying the percentage of files to return.
                     If None, return all files.

    Returns:
        script_files: List of paths to `_raw_norm.txt` files.
        tags_files: List of paths to `_tags_norm.txt` files.
        combined_files: List of paths to `_combined_norm.txt` files.
    """
    scripts_dir = '/Users/alecnaidoo/Downloads/MIDS/DATASCI_266_NLP_with_DL/W266_Project/scripts'
    script_files = []
    tags_files = []
    combined_files = []

    for imdb_id_folder in os.listdir(scripts_dir):
        movie_dir = os.path.join(scripts_dir, imdb_id_folder)
        if os.path.isdir(movie_dir):
            # Construct paths for each type of file
            raw_script_path = os.path.join(movie_dir, f'{imdb_id_folder}_raw_norm.txt')
            tags_path = os.path.join(movie_dir, f'{imdb_id_folder}_tags_norm.txt')
            combined_path = os.path.join(movie_dir, f'{imdb_id_folder}_combined_norm.txt')

            # Check for existence and add to the corresponding lists
            if os.path.exists(raw_script_path) and os.path.exists(tags_path):
                script_files.append(raw_script_path)
                tags_files.append(tags_path)
                if os.path.exists(combined_path):  # Combined is optional
                    combined_files.append(combined_path)
                else:
                    combined_files.append(None)  # Placeholder if combined file is missing

    # Limit files by percentage if `pct` is provided
    if pct is not None:
        if not (0 <= pct <= 1):
            raise ValueError("pct must be a float between 0 and 1.")
        limit = int(len(script_files) * pct)
        script_files = script_files[:limit]
        tags_files = tags_files[:limit]
        combined_files = combined_files[:limit]

    print(f"Found {len(script_files)} script files, {len(tags_files)} tags files, and {len(combined_files)} combined files.")
    return script_files, tags_files, combined_files

import os

def check_progress(script_files, parse_files):
    """
    Checks the progress of processed files by looking for the existence of movie_data.json files.

    Args:
        script_files: List of script file paths.
        parse_files: List of parse file paths.

    Returns:
        A tuple containing:
        - count_processed: Number of processed files.
        - count_remaining: Number of remaining files to process.
        - remaining_script_files: List of script files yet to be processed.
        - remaining_parse_files: List of parse files yet to be processed.
        - processed_jsons: List of paths to the processed JSON files.
    """
    processed_jsons = []
    remaining_script_files = []
    remaining_parse_files = []

    for script_file, parse_file in zip(script_files, parse_files):
        movie_dir = os.path.dirname(script_file)  # Movie directory
        movie_id = os.path.basename(movie_dir)  # Extract movie ID
        output_file = os.path.join(movie_dir, f"{movie_id}_movie_data.json")  # Output JSON path

        if os.path.exists(output_file):
            processed_jsons.append(output_file)
        else:
            remaining_script_files.append(script_file)
            remaining_parse_files.append(parse_file)

    count_processed = len(processed_jsons)
    count_remaining = len(remaining_script_files)

    return count_processed, count_remaining, remaining_script_files, remaining_parse_files, processed_jsons

def load_processed_data_and_combined_lines(processed_jsons):
    """
    Loads all processed JSON files and their corresponding combined lines into lists.

    Args:
        processed_jsons: List of file paths to processed JSON files.

    Returns:
        A tuple of two lists:
        - all_movie_data: List of loaded JSON data from the files.
        - all_combined_lines: List of combined lines corresponding to the processed JSONs.
    """
    all_movie_data = []
    all_combined_lines = []

    for json_file_path in processed_jsons:
        try:
            # Load the JSON file
            with open(json_file_path, 'r', encoding='utf-8') as f:
                movie_data = json.load(f)
                all_movie_data.append(movie_data)
            
            # Derive the combined lines file path
            movie_dir = os.path.dirname(json_file_path)
            movie_id = os.path.basename(movie_dir)
            combined_file_path = os.path.join(movie_dir, f"{movie_id}_combined_norm.txt")

            # Load the combined lines
            if os.path.exists(combined_file_path):
                with open(combined_file_path, 'r', encoding='utf-8') as fr:
                    combined_lines = fr.read().splitlines()  # Preserve all lines as-is
                    all_combined_lines.append(combined_lines)

            else:
                print(f"Combined lines file not found: {combined_file_path}")
                all_combined_lines.append(None)  # Placeholder for missing files

        except Exception as e:
            print(f"Error loading {json_file_path}: {e}")
            all_movie_data.append(None)
            all_combined_lines.append(None)

    return all_movie_data, all_combined_lines

In [41]:
if __name__ == "__main__":
    # Replace these with the full lists of script and parse files
    script_files, parse_files, _ = find_norm_files()

    count_processed, count_remaining, remaining_scripts, remaining_parses, processed_jsons = check_progress(script_files, parse_files)

    print(f"Number of files processed: {count_processed}")
    print(f"Number of files remaining (typically prediction errors): {count_remaining}")
    
    all_movie_data, all_combined_lines = load_processed_data_and_combined_lines(processed_jsons)

Found 416 script files, 416 tags files, and 416 combined files.
Number of files processed: 412
Number of files remaining (typically prediction errors): 4


In [26]:
def process_movies(all_movie_data, all_combined_lines):
    """
    Process movies for bechdel test by extracting coreference resolution predictions,
    scoring scenes, and storing results.

    Args:
        all_movie_data: List of loaded movie coreference JSON data.
        all_combined_lines: List of combined lines corresponding to the movies.

    Returns:
        A list of dictionaries, each containing scores and relevant information for a movie.
    """
    results = []

    for idx, (movie_data, combined_lines) in enumerate(zip(all_movie_data, all_combined_lines)):
        if movie_data is None or combined_lines is None:
            print(f"Skipping movie at index {idx}: Missing data.")
            results.append(None)
            continue

        try:
            # Extract IMDb movie ID
            imdb_movie_id = movie_data[0]['movie'].split("_")[0] if 'movie' in movie_data[0] else f"unknown_{idx}"

            # Extract necessary information
            tokens = movie_data[0]['token']
            parse_tags = movie_data[0]['parse']
            clusters = movie_data[0]['clusters']

            # Extract character names and identify pronouns
            character_names = extract_character_names(combined_lines)
            character_pronouns = identify_characters(tokens, clusters, character_names)

            # Create secondary script and reconstruct scenes
            secondary_script = create_secondary_script(tokens, clusters, character_pronouns)
            reconstructed = reconstruct_script(tokens, parse_tags)
            reconstructed_scenes = split_into_scenes(reconstructed)

            temp_script = reconstruct_script(secondary_script, parse_tags)
            scenes = split_into_scenes(temp_script)

            # Initialize scores and store scene-level scores
            s_1 = s_2 = s_3 = 0
            scene_scores = {}

            # Score each scene
            for scene_id, scene in enumerate(scenes):
                score_1, score_2, score_3 = score_scene(scene, character_pronouns, character_names)

                # Update totals
                s_1 += score_1
                s_2 += score_2
                s_3 += score_3

                # Store individual scene scores
                scene_scores[scene_id] = {"score_1": score_1, "score_2": score_2, "score_3": score_3}

            # Store results for this movie
            results.append({
                "movie_index": idx,
                "imdb_movie_id": imdb_movie_id,
                "total_scores": {"s_1": s_1, "s_2": s_2, "s_3": s_3},
                "scene_scores": scene_scores,
                "character_pronouns": character_pronouns,
                "character_names": list(character_names),
                "secondary_script": secondary_script,
                "reconstructed_scenes": reconstructed_scenes,
                "scenes": scenes,
            })

            print(f"Processed movie at index {idx} (IMDb ID: {imdb_movie_id}): S1={s_1}, S2={s_2}, S3={s_3}")

        except Exception as e:
            print(f"Error processing movie at index {idx}: {e}")
            results.append(None)

    return results


# Example usage:
if __name__ == "__main__":
    # Assume `all_movie_data` and `all_combined_lines` are preloaded
    processed_results = process_movies(all_movie_data, all_combined_lines)

    # Access results for analysis
    for idx, result in enumerate(processed_results):
        if result:
            print(f"Movie {idx} (IMDb ID: {result['imdb_movie_id']}) Total Scores: {result['total_scores']}")
            print(f"Number of Scenes: {len(result['scenes'])}")
        else:
            print(f"Movie {idx} processing failed or was skipped.")

Processed movie at index 0 (IMDb ID: 86250): S1=110, S2=1, S3=1
Processed movie at index 1 (IMDb ID: 32138): S1=1, S2=1, S3=1
Processed movie at index 2 (IMDb ID: 70379): S1=101, S2=0, S3=0
Processed movie at index 3 (IMDb ID: 120696): S1=146, S2=1, S3=1
Processed movie at index 4 (IMDb ID: 95016): S1=116, S2=4, S3=2
Processed movie at index 5 (IMDb ID: 90756): S1=231, S2=1, S3=0
Processed movie at index 6 (IMDb ID: 1013753): S1=157, S2=0, S3=0
Processed movie at index 7 (IMDb ID: 51036): S1=94, S2=0, S3=0
Processed movie at index 8 (IMDb ID: 113820): S1=124, S2=15, S3=14
Processed movie at index 9 (IMDb ID: 493464): S1=109, S2=0, S3=0
Processed movie at index 10 (IMDb ID: 119174): S1=1, S2=0, S3=0
Processed movie at index 11 (IMDb ID: 100405): S1=162, S2=14, S3=12
Processed movie at index 12 (IMDb ID: 119173): S1=184, S2=8, S3=5
Processed movie at index 13 (IMDb ID: 1655420): S1=169, S2=6, S3=3
Processed movie at index 14 (IMDb ID: 1311071): S1=167, S2=0, S3=0
Processed movie at index

### Create DataFrame with Processed Bechdel Test Scores

In [27]:
import pandas as pd
df = pd.DataFrame(processed_results)

df['score_1'] = df.total_scores.apply(lambda x: 1 if x['s_1'] > 0 else 0)
df['score_2'] = df.total_scores.apply(lambda x: 1 if x['s_2'] > 0 else 0)
df['score_3'] = df.total_scores.apply(lambda x: 1 if x['s_3'] > 0 else 0)

scores = df.copy()

### Connect with original DF for Evaluation

In [30]:
import pandas as pd

df = pd.read_csv(r'/Users/alecnaidoo/Downloads/MIDS/DATASCI_266_NLP_with_DL/W266_Project/Table_1_Exploratory_Data_With_Scripts.csv')
df.rename(columns={'0':'imdb_movie_id'}, inplace=True)
df['imdb_movie_id'] = df['imdb_movie_id'].astype(str)

In [31]:
df.head()

Unnamed: 0,imdb_movie_id,script_date,script,bechdel_id,title,release_year,bechdel_rating,language,popularity,vote_average,...,Thriller,War,Comedy,Music,Western,Horror,Science Fiction,Action,Animation,History
0,22958,,GRAND H...,1328,Grand Hotel,1932,3,en,85.188,6.959,...,0,0,0,0,0,0,0,0,0,0
1,32138,March 1939,FADE IN -- Title:\r\n\r\nFor nearly forty year...,174,"Wizard of Oz, The",1939,3,en,81.243,7.6,...,0,0,0,0,0,0,0,0,0,0
2,33467,,Citizen Kane \r\n\r\n ...,1266,Citizen Kane,1941,1,en,331.301,8.008,...,0,0,0,0,0,0,0,0,0,0
3,113101,,"""FOUR ROOMS""\r\n\r\n ...",986,Four rooms,1995,3,en,21.231,5.829,...,0,0,1,0,0,0,0,0,0,0
4,42192,,FADE IN:\r\n\nINT. DINING HALL - SARAH SIDDONS...,139,All About Eve,1950,3,en,18.633,8.1,...,0,0,0,0,0,0,0,0,0,0


In [34]:
true_bechd = df[['imdb_movie_id', 'bechdel_rating']].copy()
true_bechd['binary_bechdel_rating'] = true_bechd['bechdel_rating'].apply(lambda x: 1 if x == 3 else 0)
true_bechd['binary_bechdel_rating2'] = true_bechd['bechdel_rating'].apply(lambda x: 1 if x == 2 else 0)

In [35]:
true_bechd

Unnamed: 0,imdb_movie_id,bechdel_rating,binary_bechdel_rating,binary_bechdel_rating2
0,22958,3,1,0
1,32138,3,1,0
2,33467,1,0,0
3,113101,3,1,0
4,42192,3,1,0
...,...,...,...,...
421,6139732,3,1,0
422,837563,3,1,0
423,4566758,3,1,0
424,11245972,3,1,0


In [36]:
scores_eval_df = scores.merge(true_bechd, on = 'imdb_movie_id', how='left')

In [38]:
from sklearn.metrics import accuracy_score, f1_score

# Assuming scores DataFrame is available
y_true = scores_eval_df['binary_bechdel_rating']
y_pred = scores_eval_df['score_3']

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate f1-score
f1_score = f1_score(y_true, y_pred)
print(f"F1-Score: {f1_score}")

Accuracy: 0.7012048192771084
F1-Score: 0.6915422885572139


### Analyzing Incorrect Prediction Rows

In [39]:
# Identify rows where binary_bechdel_rating does not equal score_3
mismatched_rows = scores_eval_df[scores_eval_df['binary_bechdel_rating'] != scores_eval_df['score_3']]

# Extract the IMDb movie IDs for these rows
mismatched_movie_ids = mismatched_rows['imdb_movie_id'].tolist()

In [40]:
# Assuming 'title' is the column containing movie titles
for _, row in df.loc[df['imdb_movie_id'].isin(mismatched_movie_ids)].iterrows():
    print(f"Title: {row['title']}, Year: {row['release_year']}, IMDb ID: {row['imdb_movie_id']}, Bechdel Rating: {row['bechdel_rating']}")

Title: Strangers on a Train, Year: 1951, IMDb ID: 44079, Bechdel Rating: 2
Title: From Here to Eternity, Year: 1953, IMDb ID: 45793, Bechdel Rating: 3
Title: Rebel Without a Cause, Year: 1955, IMDb ID: 48545, Bechdel Rating: 1
Title: Apartment, The, Year: 1960, IMDb ID: 53604, Bechdel Rating: 3
Title: Mary Poppins, Year: 1964, IMDb ID: 58331, Bechdel Rating: 3
Title: Graduate, The, Year: 1967, IMDb ID: 61722, Bechdel Rating: 2
Title: American Graffiti, Year: 1973, IMDb ID: 69704, Bechdel Rating: 2
Title: Rocky, Year: 1976, IMDb ID: 75148, Bechdel Rating: 1
Title: Jaws 2, Year: 1978, IMDb ID: 77766, Bechdel Rating: 2
Title: Alien, Year: 1979, IMDb ID: 78748, Bechdel Rating: 3
Title: Shining, The, Year: 1980, IMDb ID: 81505, Bechdel Rating: 2
Title: Body Heat, Year: 1981, IMDb ID: 82089, Bechdel Rating: 2
Title: Dragonslayer, Year: 1981, IMDb ID: 82288, Bechdel Rating: 1
Title: Fast Times at Ridgemont High, Year: 1982, IMDb ID: 83929, Bechdel Rating: 2
Title: 48 Hrs., Year: 1982, IMDb ID

In [216]:
for idx, title in enumerate(df.title):
    print(idx, title)

0 Grand Hotel
1 Wizard of Oz, The
2 Citizen Kane
3 Four rooms
4 All About Eve
5 Sunset Blvd.
6 Strangers on a Train
7 From Here to Eternity
8 White Christmas
9 Rebel Without a Cause
10 Bad Day at Black Rock
11 Searchers, The
12 Sweet Smell of Success
13 Apartment, The
14 Psycho
15 Mary Poppins
16 Graduate, The
17 Bonnie and Clyde
18 Midnight Cowboy
19 Five Easy Pieces
20 Klute
21 American Graffiti
22 Mean Streets
23 Dark Star
24 Barry Lyndon
25 Rocky
26 Heat
27 Deer Hunter, The
28 Midnight Express
29 Jaws 2
30 Alien
31 Star Trek: The Motion Picture
32 Being There
33 Shining, The
34 Raging Bull
35 Escape from New York
36 Heavy Metal
37 Body Heat
38 Dragonslayer
39 Blade Runner
40 TRON
41 Fast Times at Ridgemont High
42 48 Hrs.
43 Verdict, The
44 Scarface
45 Under Fire
46 Gremlins
47 Indiana Jones and the Temple of Doom
48 Supergirl
49 Agnes of God
50 Commando
51 Aliens
52 Highlander
53 Blue Velvet
54 Platoon
55 Princess Bride, The
56 Predator
57 Hellraiser
58 Broadcast News
59 Twins
60 

In [53]:
df.iloc[266]

0                                                             829482
script_date                                                July 2006
script                                             SUPERBAD\r\n\r...
bechdel_id                                                       651
title                                                       Superbad
release_year                                                    2007
bechdel_rating                                                     3
language                                                          en
popularity                                                    67.854
vote_average                                                   7.247
vote_count                                                      7033
overview           Two co-dependent high school seniors are force...
Drama                                                              0
Romance                                                            0
Adventure                         

In [217]:
df.iloc[341]

0                                                            1219289
script_date                                            December 2009
script                                                LIMITLESS\r...
bechdel_id                                                      2162
title                                                      Limitless
release_year                                                    2011
bechdel_rating                                                     1
language                                                          en
popularity                                                    46.325
vote_average                                                   7.192
vote_count                                                     10317
overview           A paranoia-fueled action thriller about an uns...
Drama                                                              0
Romance                                                            0
Adventure                         