## Transfromers Project: Script Generation

In this project, we'll generate your own Friends TV script using GPT2. We'll be using scripts from all 10 seasons, from web scraping. You will generate a TV script for a scene at Monica and Rachel's apartment, or Central Perk.

*In general, Friends is a good choice of a show to use for this project:*


*   Limited settings
    * Monica's apartment
    * Chandler and Joey's apartment
    * Central Perk
    * Ross's apartment
*   Limited characters for an ensemble cast show (6 main characters with distinct personalities and humors)
*   All scripts from all seasons are avaiable
*   Consistent title naming ("The One With ... ")
*   Notable vernacular/ slang
    * Joey's catchphrase, "How you doin'?"
    * The characters use the emphasized word "so" to modify adjectives more often than any other intensifier
    * Chandler's habit of ending a sentence unfinished for sarcasm
* Friends has also been credited in helping non-English speaking students to learn the language, so it should be a good choice for large language model

Overall, with these consistencies it should be easier for the model to "learn" how the show works, and make it recognizable for the reader.

*Disclaimer for scripts: This project is in no way associated with Friends, Warner Bros, NBC or Bright/Kauffman/Crane Productions. This project is for educational purposes only.*

## Environment Set-Up

In [1]:
import os
import glob
import re
import tensorflow as tf
import numpy as np
from collections import Counter
import torch
import pickle
from transformers import pipeline, AdamW, GPT2Tokenizer, GPT2LMHeadModel
from tqdm.auto import tqdm
import torch.nn as nn  # For neural network modules
import torch.optim as optim  # For optimizers like SGD, Adam, etc.
import torch.nn.functional as F  # For functions like activations
from torch.utils.data import DataLoader, Dataset  # For creating data loaders and custom datasets
from torch.nn.utils.rnn import pad_sequence
from torch.optim import AdamW
from shutil import copyfile

## Get the Data
The data is scripts from all 10 seaons, retrieved via web scraping.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
directory_path = '/content/drive/My Drive/Colab'
directory_files = os.listdir(directory_path)

In [4]:
directory_files

['Simpsons',
 'Homework1',
 'simpsons_script.ipynb',
 'Untitled0.ipynb',
 'friends']

In [5]:
# Define the path to the directory
directory = "/content/drive/My Drive/Colab/friends/processed_scripts"

## Model Loading

Initialize the tokenizer and the language model from the pre-trained GPT-2 available in Hugging Face's transformers library.

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Tokenization

Tokenization: Converts the read script content into tokens using the initialized tokenizer.

Token Length Check: Checks if the total number of tokens is less than or equal to 1024, which is the token limit for a single input sequence in the standard GPT-2 model. This is important because GPT-2 cannot process inputs longer than 1024 tokens in a single batch.

File Copying: If the script meets the token limit criteria, it copies the script file from the original_directory to the new_directory. This step likely uses the copyfile function from Python's shutil module to perform the file copying. (Note: The import statement for copyfile is missing in the provided code snippet and should be added as from shutil import copyfile.)

In [7]:
# Define file paths
original_directory = "/content/drive/My Drive/Colab/friends/processed_scripts"
new_directory = "/content/drive/My Drive/Colab/friends/tokenized_scripts"

# Create new directory if it doesn't exist
if not os.path.exists(new_directory):
    os.makedirs(new_directory)

In [8]:
# Initialize tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Process files
for filename in os.listdir(original_directory):
    if filename.endswith('.txt'):
        filepath = os.path.join(original_directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script_content = file.read()

        # Tokenize and check length
        tokens = tokenizer.encode(script_content)
        if len(tokens) <= 1024:  # GPT-2 token limit
            # Copy file to new directory
            new_filepath = os.path.join(new_directory, filename)
            copyfile(filepath, new_filepath)

print("Finished processing scripts.")


Finished processing scripts.


## Set Up the Training Dataset

Now, we define a custom dataset class for loading and processing script files to be used with a GPT-2 model, in a PyTorch-based machine learning or deep learning workflow.
This ensures that the custom dataset class can handle a list of tokenized scripts.

In [9]:
# combine individual scripts into a batch -> important when batches have varying lengths
def collate_fn(batch):
    batch_padded = pad_sequence(batch, batch_first=True, padding_value=0) #pads sequences in the batch to the same length
                                                          # batch_first=True makes sure that the batch dimension is first
                                                        # padding_value=0 fills the gaps in shorter sequences with zeros
    return batch_padded

Padding is important to ensures that all scripts are the same size within each batch, allowing them to be processed in parallel later on.

In [10]:
# stores a list of tokenized scripts
class ScriptDataset(Dataset):
    def __init__(self, directory):
        self.tokenized_scripts = []
        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

        # Read and tokenize each script in the directory
        # Iterates over each text file, tokenizes (max length 1024 tokens), appends the token tensor to the list
        for filename in os.listdir(directory):
            if filename.endswith('.txt'):
                filepath = os.path.join(directory, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    script_content = file.read()
                    tokens = tokenizer.encode(script_content, truncation=True, max_length=1024)
                    self.tokenized_scripts.append(torch.tensor(tokens, dtype=torch.long))

    # returns the number of scripts in the dataset, needed for PyTorch's Dataset class to function properly
    def __len__(self):
        return len(self.tokenized_scripts)

    # retrieves an item from the dataset by index, allowing the dataset to be indexed and iterated over.
    def __getitem__(self, idx):
        return self.tokenized_scripts[idx]

In [11]:
# Usage
new_directory = "/content/drive/My Drive/Colab/friends/gpt_scripts"

# Create new directory if it doesn't exist
if not os.path.exists(new_directory):
    os.makedirs(new_directory)

In [12]:
# creates an instance of ScriptDataset with the specified directory, loading and tokenizing the script files found there
dataset = ScriptDataset(new_directory)

With this setup, we can use the ScriptDataset with PyTorch's data loading utilities!

## Dataloader Initialization

DataLoader is a utility provided by PyTorch that automates the data loading process, including batching, shuffling (if enabled), and parallel processing using multiple worker threads (if specified).

In [13]:
# Create Dataset and DataLoader
batch_size = 8
dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate_fn)

In [14]:
for batch in dataloader:
    print(type(batch))
    if isinstance(batch, dict):
        print(batch.keys())
    # Add more debug prints here if needed
    break  # Remove or comment out this line to inspect more batches


<class 'torch.Tensor'>


## Initializing and Displaying GPT-2 Model Architecture

First, let's load a pre-trained GPT-2 model from Hugging Face's transformers library, and set it into training mode. This is important because models like GPT-2 behave differently depending on whether they are in training or evaluation mode. For example, some layers like Dropout behave actively during training (dropping out neurons randomly to prevent overfitting) and passively during evaluation (not dropping out any neurons).


In [15]:
model = GPT2LMHeadModel.from_pretrained("gpt2") # model initialization
model.train() # setting model into training mode

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Training loop

In [16]:
# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [17]:
def train_model2(model, dataloader, epochs=3, lr=5e-5, max_seq_length=512):
    optimizer = AdamW(model.parameters(), lr=lr)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    for epoch in range(epochs):
        model.train()  # Set the model to training mode
        epoch_loss = 0
        progress_bar = tqdm(enumerate(dataloader), total=len(dataloader), desc=f"Epoch {epoch+1}/{epochs}")

        for i, batch in progress_bar:
            # Check if the batch is a tensor
            if not isinstance(batch, torch.Tensor):
                raise ValueError("Each batch should be a tensor.")

            # Assume the batch is [input_ids, labels]
            # Adjust this if your batch structure is different
            input_ids = batch[:, :max_seq_length].to(device)
            labels = batch[:, :max_seq_length].to(device)  # Adjust this line based on your specific label structure

            optimizer.zero_grad()

            try:
                outputs = model(input_ids=input_ids, labels=labels)
                loss = outputs.loss
                loss.backward()
                optimizer.step()

                epoch_loss += loss.item()
                progress_bar.set_postfix({'loss': loss.item()})
            except RuntimeError as e:
                if 'out of memory' in str(e):
                    print(f"WARNING: out of memory with batch {i}. If this message repeats, reduce batch size.")
                    torch.cuda.empty_cache()
                else:
                    raise e

        print(f"Average Loss Epoch {epoch+1}: {epoch_loss / len(dataloader)}")


In [18]:
def train_model(model, dataloader, epochs=3, lr=5e-5, max_seq_length=512):
    # Initialize the optimizer with the model's parameters and a specific learning rate
    optimizer = AdamW(model.parameters(), lr=lr)

    # Setup the device (GPU or CPU) based on availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)  # Move the model to the appropriate device

    # Loop over the dataset for a fixed number of epochs
    for epoch in range(epochs):
        model.train()  # Explicitly set the model to training mode (enables dropout, batch norm etc.)
        epoch_loss = 0  # Initialize loss for epoch calculations
        # Progress bar to monitor the training progress for each epoch
        progress_bar = tqdm(enumerate(dataloader), total=len(dataloader), desc=f"Epoch {epoch+1}/{epochs}")

        # Iterate over batches of data in the dataloader
        for i, batch in progress_bar:
            if not isinstance(batch, torch.Tensor):
                raise ValueError("Each batch should be a tensor.")  # Ensure batch is a tensor

            # Slice the input_ids and labels from the batch tensor
            # Here we assume the batch consists of sequences padded to the same length
            input_ids = batch[:, :max_seq_length].to(device)  # Move input_ids to the same device as the model
            labels = batch[:, :max_seq_length].to(device)  # Move labels to the same device as the model

            optimizer.zero_grad()  # Clear gradients before calculating new ones

            try:
                # Forward pass: compute predicted outputs by passing inputs to the model
                outputs = model(input_ids=input_ids, labels=labels)
                loss = outputs.loss  # Extract loss from the model outputs
                loss.backward()  # Backpropagate the error
                optimizer.step()  # Perform a single optimization step (parameter update)

                epoch_loss += loss.item()  # Aggregate loss for the epoch
                progress_bar.set_postfix({'loss': loss.item()})  # Update progress bar with current loss
            except RuntimeError as e:
                if 'out of memory' in str(e):
                    # Handle CUDA out of memory error: give advice on reducing batch size
                    print(f"WARNING: out of memory with batch {i}. If this message repeats, reduce batch size.")
                    torch.cuda.empty_cache()  # Clear cache to attempt to mitigate out of memory error
                else:
                    raise e  # Reraise any other exception that is not related to memory

        # Print the average loss for the epoch after all batches are processed
        print(f"Average Loss Epoch {epoch+1}: {epoch_loss / len(dataloader)}")


In [19]:
train_model(model, dataloader, epochs=10)

Epoch 1/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 1: 1.8267856127023696


Epoch 2/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 2: 1.6913169871767362


Epoch 3/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 3: 1.626354124645392


Epoch 4/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 4: 1.5750632178783417


Epoch 5/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 5: 1.5293781914313633


Epoch 6/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 6: 1.4857927328844864


Epoch 7/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 7: 1.444965411076943


Epoch 8/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 8: 1.4061499133706092


Epoch 9/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 9: 1.367321743319432


Epoch 10/10:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 10: 1.3300529276331265


## Save the trained model

The model is saved so it can be deployed as a streamlit app, to provide a demo!

In [20]:
# Define the paths where you want to save the model and tokenizer

model_save_path = '/content/drive/My Drive/Colab/friends/gpt2_model10'
tokenizer_save_path = '/content/drive/My Drive/Colab/friends/gpt2_model10'

In [21]:
# Save the model and tokenizer
# saved model so can deploy as a streamlit app
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(tokenizer_save_path)

('/content/drive/My Drive/Colab/friends/gpt2_model10/tokenizer_config.json',
 '/content/drive/My Drive/Colab/friends/gpt2_model10/special_tokens_map.json',
 '/content/drive/My Drive/Colab/friends/gpt2_model10/vocab.json',
 '/content/drive/My Drive/Colab/friends/gpt2_model10/merges.txt',
 '/content/drive/My Drive/Colab/friends/gpt2_model10/added_tokens.json')

In [22]:
## load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained(model_save_path)
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_save_path)

# Make sure to set the model to the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## Generate TV Script
This will generate the TV script for you.  Set `gen_length` to the length of TV script you want to generate.

The parameters temperature, top_k, and top_p are used to control the randomness and diversity of the text generated by a language model like GPT-2.



*   temperature: Controls randomness in text generation
    * Closer to 0: More predictable, conservative text; the model favors more likely words.
  *Closer to 1: More diverse, creative text; the model is more likely to choose less probable words.
*   top_k: Limits the model to only consider the top k most likely next words, filtering out highly improbable words
    * Closer to 0: More predictable text; the model considers fewer options for each prediction.
    * Closer to 1: More diverse text; the model considers a larger set of possible next words.

*   top_p: Nucleus sampling; Chooses words from the smallest set whose cumulative probability exceeds the threshold P, allowing dynamic adjustment based on the model's certainty
    * Closer to 0: Very conservative; the model selects from a very small set of highly probable words.
    * Closer to 1: More inclusive and diverse; the model considers a broader set of possible words.

Each of these parameters tunes the trade-off between the coherence and predictability of the generated text (lower values) and its creativity and diversity (higher values).

When provided with a prompt, such as a specific scenario or dialogue opening, the fine-tuned model utilizes its learned knowledge to predict and generate the subsequent text. This effectively "creates" new scenes that mimic the style and tone of the show, as if they were part of an actual episode.

In [23]:
def generate_script_from_prompt(model, tokenizer, prompt, gen_length, device, temperature=0.7, top_k=50, top_p=0.95):
    model.eval()
    with torch.no_grad():
        input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

        output = model.generate(
            input_ids,
            max_length=gen_length,
            do_sample=True,  # Enable sampling-based generation
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

        generated_script = tokenizer.decode(output[0], skip_special_tokens=True)

        # Post-process to remove sequences of repetitive punctuation
        cleaned_script = re.sub(r'[\!\?\.\,]{2,}', '', generated_script)

    return cleaned_script

In [26]:
# Example usage
scene_prompt = "Monica and Rachel's apartment.Rachel is talking to herself and planning Monica's wedding. It is not going well."
gen_length = 150
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

generated_script = generate_script_from_prompt(model, tokenizer, scene_prompt, gen_length, device)
print(generated_script)

Monica and Rachel's apartment.Rachel is talking to herself and planning Monica's wedding. It is not going well. Monica is
wearing a veil.]
Rachel: Oh my God! I’m so sorry.
Monicas: (entering) Hi, it‘s Rachel!
Chandler: Hey! How”s it going with Emily and Chandler?
Joey: Well, I think it went very well, y“know? They‼re the best. And
they‡re my best friends. I mean, they‪re-they just…
(They hear him start to enter and find Monica‭s bridal dress.)
Ross: It‏
