## Transfromers Project: Script Generation

In this project, we'll generate your own Friends TV script using GPT2. We'll be using scripts from all 10 seasons, from web scraping. You will generate a TV script for a scene at Monica and Rachel's apartment, or Central Perk.

*In general, Friends is a good choice of a show to use for this project:*


*   Limited settings
    * Monica's apartment
    * Chandler and Joey's apartment
    * Central Perk
    * Ross's apartment
*   Limited characters for an ensemble cast show (6 main characters with distinct personalities and humors)
*   All scripts from all seasons are avaiable
*   Consistent title naming ("The One With ... ")
*   Notable vernacular/ slang
    * Joey's catchphrase, "How you doin'?"
    * The characters use the emphasized word "so" to modify adjectives more often than any other intensifier
    * Chandler's habit of ending a sentence unfinished for sarcasm
* Friends has also been credited in helping non-English speaking students to learn the language, so it should be a good choice for large language model

Overall, with these consistencies it should be easier for the model to "learn" how the show works, and make it recognizable for the reader.

*Disclaimer for scripts: This project is in no way associated with Friends, Warner Bros, NBC or Bright/Kauffman/Crane Productions. This project is for educational purposes only.*

## Environment Set-Up

In [3]:
import os
import glob
import re
import tensorflow as tf
import numpy as np
from collections import Counter
import torch
import pickle
from transformers import pipeline, AdamW, GPT2Tokenizer, GPT2LMHeadModel
from tqdm.auto import tqdm
import torch.nn as nn  # For neural network modules
import torch.optim as optim  # For optimizers like SGD, Adam, etc.
import torch.nn.functional as F  # For functions like activations
from torch.utils.data import DataLoader, Dataset  # For creating data loaders and custom datasets
from torch.nn.utils.rnn import pad_sequence
from torch.optim import AdamW
from shutil import copyfile

## Get the Data
The data is scripts from all 10 seaons, retrieved via web scraping.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
directory_path = '/content/drive/My Drive/Colab'
directory_files = os.listdir(directory_path)

In [6]:
directory_files

['Simpsons',
 'Homework1',
 'simpsons_script.ipynb',
 'Untitled0.ipynb',
 'friends']

In [7]:
# Define the path to the directory
directory = "/content/drive/My Drive/Colab/friends/cleaned_scripts"

## Tokenize Data

Tokenize each script or segment separately.

In [8]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [16]:
# Paths
original_directory = "/content/drive/My Drive/Colab/friends/split_scripts"
new_directory = "/content/drive/My Drive/Colab/friends/gpt_scripts"

# Create new directory if it doesn't exist
if not os.path.exists(new_directory):
    os.makedirs(new_directory)

In [17]:
# Initialize tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Process files
for filename in os.listdir(original_directory):
    if filename.endswith('.txt'):
        filepath = os.path.join(original_directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            script_content = file.read()

        # Tokenize and check length
        tokens = tokenizer.encode(script_content)
        if len(tokens) <= 1024:  # GPT-2 token limit
            # Copy file to new directory
            new_filepath = os.path.join(new_directory, filename)
            copyfile(filepath, new_filepath)

print("Finished processing scripts.")


Token indices sequence length is longer than the specified maximum sequence length for this model (1565 > 1024). Running this sequence through the model will result in indexing errors


Finished processing scripts.


## Set Up the Training Dataset

Ensure your custom dataset class can handle a list of tokenized scripts:

In [18]:
def collate_fn(batch):
    batch_padded = pad_sequence(batch, batch_first=True, padding_value=0)
    return batch_padded

In [19]:
class ScriptDataset(Dataset):
    def __init__(self, directory):
        self.tokenized_scripts = []
        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

        # Read and tokenize each script in the directory
        for filename in os.listdir(directory):
            if filename.endswith('.txt'):
                filepath = os.path.join(directory, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    script_content = file.read()
                    tokens = tokenizer.encode(script_content, truncation=True, max_length=1024)
                    self.tokenized_scripts.append(torch.tensor(tokens, dtype=torch.long))

    def __len__(self):
        return len(self.tokenized_scripts)

    def __getitem__(self, idx):
        return self.tokenized_scripts[idx]

In [20]:
# Usage
new_directory = "/content/drive/My Drive/Colab/friends/gpt_scripts"

# Create new directory if it doesn't exist
if not os.path.exists(new_directory):
    os.makedirs(new_directory)

In [21]:
dataset = ScriptDataset(new_directory)

## Dataloader

In [22]:
# Create Dataset and DataLoader
batch_size = 8
dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate_fn)

In [23]:
for batch in dataloader:
    print(type(batch))
    if isinstance(batch, dict):
        print(batch.keys())
    # Add more debug prints here if needed
    break  # Remove or comment out this line to inspect more batches


<class 'torch.Tensor'>


## Prepare Data for Training

In [24]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.train()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Training loop

In [27]:
# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [28]:
def train_model(model, dataloader, epochs=3, lr=5e-5, max_seq_length=512):
    optimizer = AdamW(model.parameters(), lr=lr)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    for epoch in range(epochs):
        model.train()  # Set the model to training mode
        epoch_loss = 0
        progress_bar = tqdm(enumerate(dataloader), total=len(dataloader), desc=f"Epoch {epoch+1}/{epochs}")

        for i, batch in progress_bar:
            # Check if the batch is a tensor
            if not isinstance(batch, torch.Tensor):
                raise ValueError("Each batch should be a tensor.")

            # Assume the batch is [input_ids, labels]
            # Adjust this if your batch structure is different
            input_ids = batch[:, :max_seq_length].to(device)
            labels = batch[:, :max_seq_length].to(device)  # Adjust this line based on your specific label structure

            optimizer.zero_grad()

            try:
                outputs = model(input_ids=input_ids, labels=labels)
                loss = outputs.loss
                loss.backward()
                optimizer.step()

                epoch_loss += loss.item()
                progress_bar.set_postfix({'loss': loss.item()})
            except RuntimeError as e:
                if 'out of memory' in str(e):
                    print(f"WARNING: out of memory with batch {i}. If this message repeats, reduce batch size.")
                    torch.cuda.empty_cache()
                else:
                    raise e

        print(f"Average Loss Epoch {epoch+1}: {epoch_loss / len(dataloader)}")


In [29]:
train_model(model, dataloader, epochs=3)

Epoch 1/3:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 1: 1.819843972325325


Epoch 2/3:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 2: 1.6850700122614701


Epoch 3/3:   0%|          | 0/300 [00:00<?, ?it/s]

Average Loss Epoch 3: 1.6199626463154952


## Save the trained model

In [30]:
# Define the paths where you want to save the model and tokenizer
model_save_path = '/content/drive/My Drive/Colab/friends/gpt2_model'
tokenizer_save_path = '/content/drive/My Drive/Colab/friends/gpt2_model'

In [31]:
# Save the model and tokenizer
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(tokenizer_save_path)

('/content/drive/My Drive/Colab/friends/gpt2_model/tokenizer_config.json',
 '/content/drive/My Drive/Colab/friends/gpt2_model/special_tokens_map.json',
 '/content/drive/My Drive/Colab/friends/gpt2_model/vocab.json',
 '/content/drive/My Drive/Colab/friends/gpt2_model/merges.txt',
 '/content/drive/My Drive/Colab/friends/gpt2_model/added_tokens.json')

In [None]:
## load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained(model_save_path)
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_save_path)

# Make sure to set the model to the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## Generate TV Script
This will generate the TV script for you.  Set `gen_length` to the length of TV script you want to generate.

In [None]:
def generate_script(model, tokenizer, scene, gen_length, device, temperature=0.7, top_k=50, top_p=0.95):
    model.eval()
    with torch.no_grad():
        input_ids = tokenizer.encode(f"[Scene: {scene}]", return_tensors='pt').to(device)

        output = model.generate(
            input_ids,
            max_length=gen_length,
            do_sample=True,  # Enable sampling-based generation
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.eos_token_id
        )

        generated_script = tokenizer.decode(output[0], skip_special_tokens=True)

    return generated_script

The parameters temperature, top_k, and top_p are used to control the randomness and diversity of the text generated by a language model like GPT-2.



*   temperature: Controls randomness in text generation
    * Closer to 0: More predictable, conservative text; the model favors more likely words.
  *Closer to 1: More diverse, creative text; the model is more likely to choose less probable words.
*   top_k: Limits the model to only consider the top k most likely next words, filtering out highly improbable words
    * Closer to 0: More predictable text; the model considers fewer options for each prediction.
    * Closer to 1: More diverse text; the model considers a larger set of possible next words.

*   top_p: Nucleus sampling; Chooses words from the smallest set whose cumulative probability exceeds the threshold P, allowing dynamic adjustment based on the model's certainty
    * Closer to 0: Very conservative; the model selects from a very small set of highly probable words.
    * Closer to 1: More inclusive and diverse; the model considers a broader set of possible words.

Each of these parameters tunes the trade-off between the coherence and predictability of the generated text (lower values) and its creativity and diversity (higher values).

In [None]:
def generate_script(model, tokenizer, scene, gen_length, device, temperature=0.7, top_k=50, top_p=0.95, verbose=False):
    model.eval()
    with torch.no_grad():
        input_ids = tokenizer.encode(f"[Scene: {scene}]", return_tensors='pt').to(device)

        try:
            output = model.generate(
                input_ids,
                max_length=gen_length,
                do_sample=True,
                temperature=temperature,
                top_k=top_k,
                top_p=top_p,
                no_repeat_ngram_size=2,
                pad_token_id=tokenizer.eos_token_id
            )
            generated_script = tokenizer.decode(output[0], skip_special_tokens=True)
            # Post-process here if necessary
        except Exception as e:
            generated_script = f"Error in generation: {str(e)}"
            if verbose:
                print(f"Generation failed with error: {str(e)}")

    return generated_script


In [None]:
# Example usage
scene = "Monica and Rachel's Apartment"
gen_length = 2
temperature = 0.8  # Adjust as needed
top_k = 40         # Adjust as needed
top_p = 0.9        # Adjust as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
generated_script = generate_script(model, tokenizer, scene, gen_length, device, temperature, top_k, top_p)
print(generated_script)

Encoded Input IDs: tensor([[   58, 36542,    25, 23240,   290, 15984,   338,  5949,  1823,    60]],
       device='cuda:0')
[Scene: Monica and Rachel's Apartment]





In [None]:
def generate_script(model, tokenizer, scene, gen_length, device, temperature=0.7, top_k=50, top_p=0.95):
    model.eval()
    with torch.no_grad():
        input_ids = tokenizer.encode(f"[Scene: {scene}]", return_tensors='pt').to(device)

        print(f"Encoded Input IDs: {input_ids}")  # Debug: Inspect input IDs

        output = model.generate(
            input_ids,
            max_length=gen_length + input_ids.shape[-1],  # Ensure generation length is beyond the prompt
            do_sample=True,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.eos_token_id
        )

        generated_script = tokenizer.decode(output[0], skip_special_tokens=True)

    return generated_script


In [None]:
def generate_script_from_prompt(model, tokenizer, prompt, gen_length, device, temperature=0.7, top_k=50, top_p=0.95):
    model.eval()
    with torch.no_grad():
        input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

        output = model.generate(
            input_ids,
            max_length=gen_length,
            do_sample=True,  # Enable sampling-based generation
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            no_repeat_ngram_size=2,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

        generated_script = tokenizer.decode(output[0], skip_special_tokens=True)

        # Post-process to remove sequences of repetitive punctuation
        cleaned_script = re.sub(r'[\!\?\.\,]{2,}', '', generated_script)

    return cleaned_script

In [None]:
# Example usage
scene_prompt = "Central Perk. Rachel and Ross discuss their plans for the evening."
gen_length = 150
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

generated_script = generate_script_from_prompt(model, tokenizer, scene_prompt, gen_length, device)
print(generated_script)

Central Perk. Rachel and Ross discuss their plans for the evening.
Rachel: (entering) Oh my God, how did you get here?
Ross: I don’t know. I mean, I”ve been out here all day. We
can‘t find a hotel. You“re going to have to find one!
Chandler: Hey! I just got back from the doctor‖s appointment, why don
you come in. It―s a little late, but I really want to talk to you. (Sits down
on the couch.)
Joey: All right, we‼re gonna start by getting some coffee and some ice cream.


