# Character Dialogue Scraping 

## Overview
We are scraping more movie script data from the internet to enrich our dialogue dataset to get more reliable result for project-6.

## Steps
1. Checking the movies and characters we need to scrape, those are movies/characters that exist in the moral rating dataset but do not exist in the movie script dataset.
2. Scrape/download the dataset directly from the internet.
3. Check whether our dataset is large enough in order to get reliable results. 


## Step 1

structured_data_full is a structured data (formatted for moral rating predictions) containing all the movies and characters that exist in both movie script and rating dataset.

In [1]:
import json
import os
import pandas as pd

In [2]:
with open('../data/structured_data.json', 'r') as f:
    # Structured data is a dictionary containing movies and characters data
    # that exist in both ratings and dialogue datasets.

    # This was processed in analysis_2.ipynb
    structured_data = json.load(f)

with open('../data/structured_data_full.json', 'r') as f:
    # This contains all the movies and characters that exist in both movies and ratings datasets
    # but only characters with at least 100 lines of dialogue. 

    # This was processed in explore_2.ipynb
    structured_data_full = json.load(f)

In [8]:
# Let's load the moral rating dataset

moral_ratings = pd.read_csv("../data/SWCPQ-Features-Survey-Dataset-November2023/scripts/aggregated-means.csv", sep = '\t')

In [9]:
moral_ratings.head()

Unnamed: 0.1,Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,...,F491,F492,F493,F494,F495,F496,F497,F498,F499,F500
0,MLP/1,40.4,88.0,16.9,38.2,37.5,73.6,74.9,54.5,82.3,...,84.5,53.6,37.3,21.0,67.2,62.1,30.4,45.7,19.8,50.9
1,MLP/2,70.1,56.0,31.6,80.5,59.9,76.3,15.0,35.7,20.2,...,70.7,34.7,51.5,26.6,75.2,46.2,63.6,86.1,51.6,16.0
2,MLP/3,63.0,82.1,32.7,96.9,12.8,87.5,42.5,33.6,8.6,...,60.0,40.9,62.5,35.1,59.1,37.6,60.3,97.1,49.9,23.7
3,MLP/4,3.6,92.4,5.2,87.9,34.7,70.7,72.9,89.5,60.0,...,77.2,83.2,1.9,10.2,38.4,7.4,76.9,49.8,80.4,61.9
4,MLP/5,26.9,63.9,29.9,35.9,56.1,63.5,56.7,61.5,62.2,...,50.3,66.6,26.5,53.9,70.6,51.1,68.2,40.7,52.1,42.7


In [10]:
moral_ratings.shape

# We potentially will have 2000 characters to work with if we're able to use all of them

(2000, 501)

In [16]:
with open("../data/SWCPQ-Features-Survey-Dataset-November2023/scripts/subjects.json", 'r') as f:
    subjects = json.load(f)

# Extract movie names that are in the moral ratings dataset
movie_in_rating_data = [data["name"] for movie_code, data in subjects.items()]

In [96]:
movie_to_scrape = [movie for movie in movie_in_rating_data if movie not in structured_data]

In [35]:
print(movie_to_scrape)  # 244 movies to scrape

['Game of Thrones', 'Harry Potter', 'The Office', 'Friends', 'The West Wing', 'LOST', 'The Wire', 'Avatar: The Last Airbender', 'Star Trek: Deep Space Nine', 'Pride and Prejudice', 'Marvel Cinematic Universe', 'The Simpsons', "That 70's Show", 'Downton Abbey', 'Grey&apos;s Anatomy', 'Breaking Bad', 'Firefly + Serenity', 'Lord of the Rings', 'Community', 'The Walking Dead', 'The Big Bang Theory', 'True Detective', 'Parks and Recreation', 'The Hunger Games', 'Westworld', 'Casablanca', 'Battlestar Galactica', 'Mad Men', 'Pirates of the Caribbean', 'Sherlock', 'Law & Order: SVU', 'Silicon Valley', 'Twin Peaks', 'The X-Files', 'Romeo and Juliet', 'Cowboy Bebop', 'Stargate SG-1', 'Mean Girls', 'Alien', 'Robin Hood', 'Orange is the New Black', 'Friday Night Lights', 'Atlas Shrugged', 'Brooklyn Nine-Nine', 'The Room', 'Seinfeld', 'Two and Half Men', 'NCIS', 'This Is Us', 'Arrested Development', 'Ozark', 'Scrubs', 'Star Trek: The Next Generation', 'Girls', 'Prison Break', 'The Mentalist', 'Desp

In [21]:
# Note that movies in new_dialogue are in format "Movie Name_Year"
# We need to extract the movie name from this format
def extract_movie_name(movie_key):
    """
    Extract the movie name from the movie key.
    
    :param movie_key: Movie key in the format "Movie Name_Year".
    :return: The movie name without the year.
    """
    return movie_key.rsplit('_', 1)[0]

In [26]:
from difflib import SequenceMatcher

def check_if_movie_exist(movie_name, data, threshold=90, discard_year = False) -> str:
    """
    Check if there's a movie in data that is similar to the given movie name.
    
    :param movie_name: Name of the movie to check.
    :param data: The data dictionary.
    :param threshold: Similarity threshold to consider a match (default is 90).
    :return: string of the movie name if a match is found, otherwise False.
    """
    for existing_movie in data.keys():
        if discard_year:
            existing_movie = extract_movie_name(existing_movie)
        similarity = SequenceMatcher(None, movie_name.lower(), existing_movie.lower()).ratio() * 100
        if similarity >= threshold:
            return existing_movie
    return ""

In [64]:
# Let's load both old and new script datasets

with open("../data/dialogue.json", 'r') as f:
    old_dialogue = json.load(f)

with open("../data/new_dialogue.json", 'r') as f:
    new_dialogue = json.load(f)

movies_in_new_dialogue = [extract_movie_name(movie_key) for movie_key in new_dialogue.keys()]

In [139]:
# Let's now check if there's movie in movie_to_scrape that exists in old_dialogue or new_dialogue

for movie in movie_to_scrape:
    if movie in old_dialogue:
        print(f"{movie} exists in old dialogue dataset")
    elif movie in new_dialogue:
        print(f"{movie} exists in new dialogue dataset")
    else:
        similar_movie = check_if_movie_exist(movie, old_dialogue, threshold = 80)
        if similar_movie:
            print(f"{movie} is similar to {similar_movie} in old dialogue dataset")
        else:
            similar_movie = check_if_movie_exist(movie, new_dialogue, threshold=80, discard_year=True)
            if similar_movie:
                print(f"{movie} is similar to {similar_movie} in new dialogue dataset")
            
            # We don't need to print if the movie is not found
            # else:
            #     print(f"{movie} does not exist in either dataset")

The West Wing is similar to The Sting in new dialogue dataset
The Wire is similar to The Wife in new dialogue dataset
Pride and Prejudice exists in old dialogue dataset
Breaking Bad is similar to Breaking Away in old dialogue dataset
Pirates of the Caribbean exists in old dialogue dataset
Twin Peaks exists in old dialogue dataset
The X-Files is similar to The X Files in new dialogue dataset
Romeo and Juliet is similar to Romeo + Juliet in new dialogue dataset
The Room is similar to The Prom in new dialogue dataset
This Is Us is similar to This is 40 in old dialogue dataset
Star Trek: The Next Generation is similar to Star Trek: Generations in old dialogue dataset
Desperate Housewives is similar to Desperate Hours in new dialogue dataset
Mulan exists in old dialogue dataset
Killing Eve is similar to Killing Zoe in old dialogue dataset
Tommorrow Never Dies is similar to Tomorrow Never Dies in old dialogue dataset
Terminator 2: Judgement Day exists in old dialogue dataset
Mindhunter is si

In [148]:
for subject, data in subjects.items():
    if data["name"] == "After Life":
        print(data.values())

dict_values([['Tony Johnson', 'AFTL/1.jpg'], ['Matt Braden', 'AFTL/2.jpg'], ['Lenny', 'AFTL/3.jpg'], ['Kath', 'AFTL/4.jpg'], ['Lisa Johnson', 'AFTL/5.jpg'], ['Emma', 'AFTL/6.jpg'], 6, 'After Life'])


30 Potential additional movies (5 are potentially discarded)

After manual checking, we're taking 27 out of these 30 movies (all except for Death Note, Romeo and Juliet, and Psych).



In [190]:
"The Notebook" in new_dialogue

False

In [151]:
# Below are movies we want to add to the structured_data

movies_to_add = [
    ["Pride and Prejudice", "old"],
    ["Pirates of the Caribbean", "old"],
    ["Twin Peaks", "old"],
    ["Mulan", "old"],
    [("Star Wars: Revenge of the Sith","Star Wars: Episode III - Revenge of the Sith"), "new" ],
    [("Ferris Bueller&apos;s", "Ferris Bueller's Day Off"), "old"],
    ["South Park", "old"],
    ["Tropic Thunder", "old"],
    ["Finding Nemo", "old"],
    ["Terminator 2: Judgement Day", "old"],
    ["Mindhunter", "old"],
    [("Les Mis&#xE9;rables", "Les Miserables"),"old"]
]

Let's define a function to insert the data for these movies easier.

In [None]:
def add_movies_to_structured_data(movies_to_add, structured_data, new_dialogue, old_dialogue, subjects, moral_ratings) -> dict:
    """
    Add movies to the structured_data dictionary.
    
    :param movies_to_add: List of movies to add.
    :param structured_data: The structured data dictionary.
    :param new_dialogue: New dialogue dataset.
    :param old_dialogue: Old dialogue dataset.
    :return: Updated structured_data dictionary.
    """
    for movie_info in movies_to_add:
        if isinstance(movie_info[0], tuple):
            movie_name_in_rating_data, movie_name_in_dialogue_data = movie_info[0]
        else:
            movie_name_in_rating_data = movie_info[0]
            movie_name_in_dialogue_data = movie_name_in_rating_data
        
        for subject, subject_data in subjects.items():
            if subject_data["name"] == movie_name_in_rating_data:
                movie_key = subject
                break
        else:
            print(f"Movie {movie_name_in_rating_data} not found in subjects.")
            continue

        dct = {
            "subject_name": movie_name_in_rating_data,
            "subject_code": movie_key,
            "characters": {}
        }
        # <TODO> Add logic to add characters for this movie
        structured_data[movie_info[1]] = dct
    
    return structured_data

In [169]:
movie = "Ghostbusters_1984"

for movie_name in subjects.values() :
    if movie_name["name"] == "Ghostbusters":
        print(movie_name) 

{'1': ['Peter Venkman', 'GB/1.jpg'], '2': ['Ray Stantz', 'GB/2.jpg'], '3': ['Egon Spengler', 'GB/3.jpg'], '4': ['Dana Barrett', 'GB/4.jpg'], '5': ['Louis Tully', 'GB/5.jpg'], 'N': 5, 'name': 'Ghostbusters'}


In [158]:
for movie, movie_data in structured_data.items():
    if not movie_data["characters"]:
        print(f"Movie {movie} has no characters in structured_data")

Movie Alien has no characters in structured_data
Movie Bones has no characters in structured_data
Movie Casablanca has no characters in structured_data


In [156]:
structured_data["Superbad"]

{'subject_name': 'Superbad',
 'subject_code': 'SB',
 'characters': {'SETH': {'character_name': 'SETH',
   'subject_name': 'Seth',
   'subject_code': 'SB/1',
   'rating': [29.4,
    72.3,
    50.6,
    38.9,
    73.6,
    15.0,
    54.8,
    69.7,
    74.3,
    61.2,
    55.5,
    36.9,
    85.3,
    32.7,
    85.1,
    77.6,
    24.2,
    31.8,
    32.1,
    75.1,
    34.7,
    35.7,
    74.0,
    41.0,
    65.4,
    81.5,
    17.3,
    43.2,
    32.7,
    37.8,
    24.4,
    68.4,
    14.8,
    16.0,
    20.4,
    18.6,
    29.7,
    19.9,
    46.8,
    70.5,
    74.6,
    13.9,
    74.3,
    14.0,
    42.0,
    30.9,
    32.9,
    36.8,
    78.3,
    83.3,
    77.6,
    41.0,
    52.9,
    45.1,
    62.9,
    60.8,
    24.1,
    28.7,
    22.6,
    86.7,
    66.1,
    63.3,
    63.0,
    25.8,
    63.8,
    61.1,
    19.7,
    32.7,
    76.6,
    16.9,
    77.4,
    28.4,
    25.3,
    79.4,
    16.9,
    29.9,
    14.9,
    24.5,
    21.5,
    50.9,
    69.6,
    60.2,
    78.5,
   

In [112]:
num_characters = 0
num_sentences = 0
num_movies = 0
movie_with_no_characters_with_more_than_100_sentences = []

for movie, data in structured_data.items():
    movie_has_more_than_100_sentences = False
    for character, char_data in data["characters"].items():
        num_characters += 1
        num_sentences += len(char_data["sentences"])
        if len(char_data["sentences"] ) >= 100:
            movie_has_more_than_100_sentences = True
    if not movie_has_more_than_100_sentences:
        print(f"Movie {movie} has no character with more than 100 sentences")
        movie_with_no_characters_with_more_than_100_sentences.append(movie)

print(f'Total number of movies: {num_movies}')
print(f'Total number of characters: {num_characters}')
print(f'Total number of sentences: {num_sentences}')

Movie Alien has no character with more than 100 sentences
Movie Death Note has no character with more than 100 sentences
Movie Bones has no character with more than 100 sentences
Movie The Shape of Water has no character with more than 100 sentences
Movie Ender's Game has no character with more than 100 sentences
Movie Beauty and the Beast has no character with more than 100 sentences
Movie Divergent has no character with more than 100 sentences
Movie Supergirl has no character with more than 100 sentences
Movie Raiders of the Lost Ark has no character with more than 100 sentences
Movie The Little Mermaid has no character with more than 100 sentences
Movie Ghostbusters has no character with more than 100 sentences
Movie Casablanca has no character with more than 100 sentences
Movie Hannibal has no character with more than 100 sentences
Movie Transformers has no character with more than 100 sentences
Movie Downton Abbey has no character with more than 100 sentences
Movie X-Men has no ch

In [106]:
len(movie_with_no_characters_with_more_than_100_sentences)

113

In [136]:
new_dialogue["Alien_1979"].keys()

dict_keys(['ROBY', 'BROUSSARD', 'FAUST', 'MELKONIS', 'HUNTER', 'STANDARD', 'COMPUTER', 'ROBY &amp; MELKONIS'])

In [95]:
print("Alien" in structured_data_full)
print("Alien" in structured_data)

# For some reason, "Alien" which exist in both old_dialogue and new_dialogue
# is not in structured_data_full. 

# There might be some issues with the processing of the data.

# "Alien" exists in structured_data, but not in structured_data_full
# This is because structured_data_full only contains characters with at least 100 lines of dialogue

False
True


In [85]:
for movie in structured_data.keys():
    if movie not in movie_in_rating_data:
        print(f"{movie} is not in moral ratings dataset")

# Since this prints none, it means all mobies in structured_data are in moral ratings dataset

## Step 2

In [178]:
# Movies that exist in moral ratings dataset but not in structured_data
missing_movies = [movie for movie in movie_in_rating_data if movie not in structured_data.keys()]

In [179]:
len(missing_movies)

228

Movies to scrape:
- Mean Girls /
- Robin Hood
- The Room /
- Ocean's 11 /
- Lilo & Stitch /
- Gone with the wind /
- Tomorrow Never Dies (https://imsdb.com/scripts/Tomorrow-Never-Dies.html)
- Zombieland /
- Hamilton
- Mad Max: Fury Road
- The Devil Wears Prada (exists in old dialogue)
- Snow White and the Seven Dwarfs (https://the-jh-movie-collection-official.fandom.com/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)/Transcript)
- The Notebook 
- Hamlet / 
- Fifty Shades of Grey 
- Dirty dancing (maybe)
- Elizabethtown
- When Harry Met Sally...
- Brokeback Mountain
- Roman Holiday
- 8 Mile
- Potrait of a Lady on Fire
- The Spectacular Now
- Breakfast at Tiffany's
- Ruby Sparks
- Moulin Rouge!
- Amelie
- High School Musical (Maybe)
- Star Wars: The Last Jedi
- The Incredibles
- Monsters, Inc.
- The Land Before Time (Maybe)
- Jurassic World
- Rogue One
- Tangled
- The Da Vinci Code
- Mamma Mia!
- Adventures of Huckleberry Finn
- Farrenheit 451
- Beowulf
- Nineteen Eighty-Four
- A Tale of Two Cities
- After Life
- Singin&apos; in the Rain (Singing in the rain)
- The Scarlett Letter
- Mrs Dalloway

In [191]:
for movie in missing_movies:
    print(movie)

Game of Thrones
Harry Potter
The Office
Friends
The West Wing
LOST
The Wire
Avatar: The Last Airbender
Star Trek: Deep Space Nine
Pride and Prejudice
Marvel Cinematic Universe
The Simpsons
That 70's Show
Grey&apos;s Anatomy
Breaking Bad
Firefly + Serenity
Lord of the Rings
Community
The Walking Dead
The Big Bang Theory
True Detective
Parks and Recreation
The Hunger Games
Westworld
Battlestar Galactica
Mad Men
Pirates of the Caribbean
Sherlock
Law & Order: SVU
Silicon Valley
Twin Peaks
The X-Files
Romeo and Juliet
Cowboy Bebop
Stargate SG-1
Mean Girls
Robin Hood
Orange is the New Black
Friday Night Lights
Atlas Shrugged
Brooklyn Nine-Nine
The Room
Seinfeld
Two and Half Men
NCIS
This Is Us
Arrested Development
Ozark
Scrubs
Star Trek: The Next Generation
Girls
Prison Break
The Mentalist
Desperate Housewives
The Good, the Bad, and the Ugly
Schitt's Creek
Glee
House, M.D.
The Good Place
Chilling Adventures of Sabrina
The 100
Scandal
How To Get Away With Murder
Riverdale
Space Force
Gossip G