## EDA & Preprocessing

In this notebook we will do some data exploration, and ensure that the scripts are small enough to be processed for machine learning!

*Disclaimer for scripts: This project is in no way associated with Friends, Warner Bros, NBC or Bright/Kauffman/Crane Productions. This project is for educational purposes only.*

## Environment Set-Up

In [None]:
import os
import glob
import re
import tensorflow as tf
import numpy as np
from collections import Counter
import torch
import pickle
from transformers import pipeline, AdamW, GPT2Tokenizer, GPT2LMHeadModel
from tqdm.auto import tqdm
import torch.nn as nn  # For neural network modules
import torch.optim as optim  # For optimizers like SGD, Adam, etc.
import torch.nn.functional as F  # For functions like activations
from torch.utils.data import DataLoader, Dataset  # For creating data loaders and custom datasets
from torch.nn.utils.rnn import pad_sequence
from torch.optim import AdamW
from shutil import copyfile

## Get the Data
The data is scripts from all 10 seaons, retrieved via web scraping.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
directory_path = '/content/drive/My Drive'
directory_files = os.listdir(directory_path)

In [None]:
directory_files

['Honors Biology Spring Final Exam Study Guide 2013.docx',
 'Hr. World Lit. - Annotated Resources.docx.gdoc',
 'Annotated Bibliographies.doc.gdoc',
 "Annotations - Dante's Inferno.docx.gdoc",
 'photo 2.JPG',
 'Vive le Taureau Chapter 4.docx',
 'Vive le Taureau Chapter 4 (1).gdoc',
 'Vive le Taureau Chapter 4.gdoc',
 'Revenge Play and Amleth.docx',
 'Revenge Play and Amleth.gdoc',
 'Changes in Russia 1750-1914.pages.zip',
 'Saint John Vianney.gdoc',
 'Letters Against and For Constitution.docx',
 'Letters Against and For Constitution.docx.gdoc',
 'Sophia Tannir',
 'Reflexives, problematic voyage, EU, Inspector Maigret-2.docx.gdoc',
 'Untitled document (2).gdoc',
 'heyy.gdoc',
 'Church, yo..gslides',
 "Strohmann and Bridget's Wedding.gdoc",
 'She Could Be a Farmer in those Clothes!!! (Part Deux).gslides',
 'Chem Handout-Big Idea 1.gdoc',
 'Big Idea 1.gslides',
 'AP Gov Study Guide.gdoc',
 'Island Project.gslides',
 'John 5:1-47.gslides',
 'SophiaGene Chart1.docx',
 'SophiaGene Chart1.docx

In [None]:
# Define the path to the directory
directory = "/content/drive/My Drive/friends/cleaned_scripts"

## Explore the Data


In [None]:
# Initialize counters
dialogues_per_main_character = Counter({'Rachel': 0, 'Ross': 0, 'Monica': 0, 'Chandler': 0, 'Joey': 0, 'Phoebe': 0})
total_word_count = 0
total_scenes = 0
total_lines = 0
script_count = 0

# Process each file
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
        script_count += 1
        filepath = os.path.join(directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script_content = file.read()

        # Count dialogues for main characters
        for line in script_content.split('\n'):
            if ':' in line:
                character, dialogue = line.split(':', 1)
                character = character.strip()
                if character in dialogues_per_main_character:
                    dialogues_per_main_character[character] += 1
                    word_count = len(dialogue.split())
                    total_word_count += word_count
                    total_lines += 1

# Calculate averages
average_word_count_per_script = total_word_count / script_count if script_count else 0
average_scenes_per_script = total_scenes / script_count if script_count else 0
average_words_per_line = total_word_count / total_lines if total_lines else 0

# Display results
print("Number of Dialogues per Main Character:")
for character, count in dialogues_per_main_character.items():
    print(f"{character}: {count}")

print(f"\nAverage Word Count per Script: {average_word_count_per_script}")
print(f"Average Number of Scenes per Script: {average_scenes_per_script}")
print(f"Average Number of Words per Line: {average_words_per_line}")

Number of Dialogues per Main Character:
Rachel: 7090
Ross: 6718
Monica: 6206
Chandler: 6255
Joey: 6114
Phoebe: 5502

Average Word Count per Script: 1541.3829787234042
Average Number of Scenes per Script: 0.0
Average Number of Words per Line: 7.648937574237825


In [None]:
# Define the locations to count
locations = ['Central Perk', "Joey and Rachel's", "Monica's", "Monica and Rachel's", "Chandler and Joey's", "Ross's", "Phoebe's", "Moondance Diner"]
locations_counter = Counter({location: 0 for location in locations})

# Process each file
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
        filepath = os.path.join(directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script_content = file.read()

        # Count occurrences of specified locations
        for line in script_content.split('\n'):
            if line.startswith('[') and line.endswith(']'):
                scene_description = line.strip('[]').split(',')[0]
                for location in locations:
                    if location in scene_description:
                        locations_counter[location] += 1

# Display results
print("Scene Count for Specific Locations:")
for location, count in locations_counter.items():
    print(f"{location}: {count}")


Scene Count for Specific Locations:
Central Perk: 278
Joey and Rachel's: 37
Monica's: 35
Monica and Rachel's: 191
Chandler and Joey's: 58
Ross's: 26
Phoebe's: 13
Moondance Diner: 6


 ## Break scripts up into scenes

To tokenize the data, the max lenth is 1,024. To ensure that the scripts stay understandable, we will break scripts up into their respective scenes to help this (and also hopefully help the model be able to write a short scene.)

In [None]:
def split_script_into_scenes(script_filepath, output_directory):
    with open(script_filepath, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    scene_lines = []
    scene_count = 1
    base_filename = os.path.splitext(os.path.basename(script_filepath))[0]

    for line in lines:
        if line.strip().startswith('[Scene:') and scene_lines:
            # Save current scene to file
            scene_filename = f"{base_filename}_{scene_count}.txt"
            with open(os.path.join(output_directory, scene_filename), 'w', encoding='utf-8') as scene_file:
                scene_file.write(''.join(scene_lines))

            # Reset for next scene
            scene_lines = []
            scene_count += 1

        scene_lines.append(line)

    # Save the last scene
    if scene_lines:
        scene_filename = f"{base_filename}_{scene_count}.txt"
        with open(os.path.join(output_directory, scene_filename), 'w', encoding='utf-8') as scene_file:
            scene_file.write(''.join(scene_lines))

def split_all_scripts(input_directory, output_directory):
    for filename in os.listdir(input_directory):
        if filename.endswith('.txt'):
            split_script_into_scenes(os.path.join(input_directory, filename), output_directory)
            print(f"Processed {filename}")


In [None]:
input_directory = '/content/drive/My Drive/friends/cleaned_scripts'
output_directory = '/content/drive/My Drive/friends/split_scripts'

# Creates the folder if it does not exist
os.makedirs(output_directory, exist_ok=True)

In [None]:
split_all_scripts(input_directory, output_directory)

Processed The_One_Where_Monica_Gets_a_New_Roommate_(The_Pilot-The_Uncut_Version).txt
Processed The_One_With_the_Sonogram_at_the_End.txt
Processed The_One_With_the_Thumb.txt
Processed The_One_With_George_Stephanopoulos.txt
Processed The_One_With_the_East_German_Laundry_Detergent.txt
Processed The_One_With_the_Butt.txt
Processed The_One_With_the_Blackout.txt
Processed The_One_Where_Nana_Dies_Twice.txt
Processed The_One_Where_Underdog_Gets_Away.txt
Processed The_One_With_the_Monkey.txt
Processed The_One_With_Mrs._Bing.txt
Processed The_One_With_the_Dozen_Lasagnas.txt
Processed The_One_With_the_Boobies.txt
Processed The_One_With_the_Candy_Hearts.txt
Processed The_One_With_the_Stoned_Guy.txt
Processed The_One_With_Two_Parts,_part_1.txt
Processed The_One_With_Two_Parts,_Part_2.txt
Processed The_One_With_All_The_Poker.txt
Processed The_One_Where_the_Monkey_Gets_Away.txt
Processed The_One_With_the_Evil_Orthodontist.txt
Processed The_One_With_The_Fake_Monica.txt
Processed The_One_With_the_Ick_F

## Tokenize Data

Now, we will tokenize the data, and ensure that each script is under the token limit of 1,024.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
tokenized_scripts = []

for filename in os.listdir(output_directory):
    if filename.endswith('.txt'):
        filepath = os.path.join(output_directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script = file.read()
            # Tokenize the script and add special tokens
            tokenized_script = tokenizer.encode(script, add_special_tokens=True)
            tokenized_scripts.append(tokenized_script)


Token indices sequence length is longer than the specified maximum sequence length for this model (2755 > 1024). Running this sequence through the model will result in indexing errors


## Removing texts that are over token limit

Now, we will remove the texts that are over the token limit.

In [None]:
### Counting number of text files

# Initialize a counter
total_scripts = 0

# Iterate over the files in the directory
for filename in os.listdir(output_directory):
    if filename.endswith('.txt'):
        total_scripts += 1

# Print the total number of script files
print(f"Total number of scripts: {total_scripts}")


Total number of scripts: 2509


In [None]:
# Initialize a counter
scripts_exceeding_limit = 0

# Iterate over the scripts
for filename in os.listdir(output_directory):
    if filename.endswith('.txt'):
        filepath = os.path.join(output_directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script_content = file.read()

        # Tokenize the script
        tokens = tokenizer.encode(script_content)

        # Check if token count exceeds 1024
        if len(tokens) > 1024:
            scripts_exceeding_limit += 1
            print(f"Script '{filename}' exceeds 1024 tokens.")

# Print the total number of scripts exceeding the limit
print(f"Total scripts exceeding 1024 tokens: {scripts_exceeding_limit}")

Script 'The_One_Where_Monica_Gets_a_New_Roommate_(The_Pilot-The_Uncut_Version)_1.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Sonogram_at_the_End_3.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Thumb_2.txt' exceeds 1024 tokens.
Script 'The_One_With_George_Stephanopoulos_2.txt' exceeds 1024 tokens.
Script 'The_One_With_the_East_German_Laundry_Detergent_3.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Butt_4.txt' exceeds 1024 tokens.
Script 'The_One_Where_Nana_Dies_Twice_11.txt' exceeds 1024 tokens.
Script 'The_One_Where_Underdog_Gets_Away_13.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Monkey_3.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Monkey_8.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Stoned_Guy_9.txt' exceeds 1024 tokens.
Script 'The_One_With_the_Stoned_Guy_12.txt' exceeds 1024 tokens.
Script 'The_One_With_Two_Parts,_Part_2_7.txt' exceeds 1024 tokens.
Script 'The_One_With_All_The_Poker_8.txt' exceeds 1024 tokens.
Script 'The_One_Where_the_Monkey_

115 scripts have a token indices sequence length longer than the specified maximum sequence length for this model, which will result in indexing errors. Because this is a small amount (~5 %)  we will remove them.

In [None]:
# Paths
original_directory = "/content/drive/My Drive/Colab/friends/split_scripts"
new_directory = "/content/drive/My Drive/Colab/friends/processed_scripts"

# Create new directory if it doesn't exist
if not os.path.exists(new_directory):
    os.makedirs(new_directory)

In [None]:
# Initialize tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Process files
for filename in os.listdir(original_directory):
    if filename.endswith('.txt'):
        filepath = os.path.join(original_directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script_content = file.read()

        # Tokenize and check length
        tokens = tokenizer.encode(script_content)
        if len(tokens) <= 1024:  # GPT-2 token limit
            # Copy file to new directory
            new_filepath = os.path.join(new_directory, filename)
            copyfile(filepath, new_filepath)

print("Finished processing scripts.")


Token indices sequence length is longer than the specified maximum sequence length for this model (2755 > 1024). Running this sequence through the model will result in indexing errors


Finished processing scripts.
