This whole blog is divided into 4 parts. Let's see what we are going to do in each part and understand the WHYs behind it.
- Part 1: Data Collection and Data Preprocessing - Here we will collect the movie scripts from the internet. We will then preprocess the data to make it ready for ingestion.
- Part 2: Data Ingestion and Data Indexing - Here we will ingest and index the data into the Qdrant Vector Database using FastEmbed.
- Part 3: RAG-Based chatbot for mimicking the conversation - We finally use the indexed data to build a character chatbot using the Haystack framework.
- Part 4: Building a User Interface for the chatbot - Finally, we will build a user interface for the chatbot using Streamlit.

Now, let's start with each part one by one and look into the implementational details.

Before we move any further, let's first see all the list of superheroes whose dialogues we are going to use to build the Character Chatbot. I have also noted down the real names of the superheroes to extract the dialogues from the movie scripts. Apart from this, I have also noted down the movie names for each superhero that we have in our list. All these details are saved in a .yaml file. Let's see the contents of the file.

Now that we have seen all the superheroes whose dialogues we are going to use, let's move to the next part where we will do the data preprocessing to extract the dialogues along with it's small context from the movie scripts. We not only need the dialogues but also the context in which the dialogues are spoken. This will help us in building a better chatbot.

Let's first define some constants like location of the data files, etc.

In [None]:
root = '..'
data_folder = 'data' # folder where all the data is stored
script_folder = 'scripts' # folder where all the scripts are stored
config_file = 'config.yaml' # file where the configuration is stored

Great! Now let's load the configuration file so that we can use the details of the superheroes in our code.

In [1]:

import os
import re
# used to load the pdfs
import pymupdf
from tqdm.notebook import tqdm
from os.path import join as pjoin

In [4]:
import yaml

# load the configuration
with open(pjoin(root, config_file), 'r') as f:
    config = yaml.safe_load(f)

# used to loop through the scripts
list_of_superheroes = config['LIST_OF_SUPERHEROES']

# very essential for efficient dialogue extraction
# in some scripts, the name of the superhero name are interchanged with their real names
superhero_synonyms = config['SUPERHERO_SYNONYMS']

# used to get the relevant scripts for a particular superhero
movies_list_of_superheroes = config['MOVIES_LIST_OF_SUPERHEROES']

In [5]:
# used to save the dialogues that are extracted from the scripts
dialogue_folder = 'dialogues'
# with each dialogue, we want to have a context of the previous dialogues so that the model can learn and understand the dialogues better
# we will try to keep the context length small
max_context_length = 100
# each dialogue will be combined with the previous dialogues and in between them, we will add a special token
# this is done so as to have one single txt file for a movie for a particular superhero
dialogues_joiner = '\n|_/-|_/-|_/-|_/-|_/-|_/-|_/-|_/-|_/-|_/-|\n\n'

# used to load the pdfs of the scripts
data_folder_path = pjoin(root, data_folder)
all_movie_scripts = os.listdir(pjoin(root, data_folder, script_folder))

Here we have defined some additional constants like the folder where we will save the dialogues, the maximum context length that we want to keep associated with each extracted dialogue, and the special token that we will use to join the dialogues so that we can have one single txt file for each dialogue of a movie for a particular superhero.

In [6]:
data_folder_path = pjoin(root, data_folder)
all_movie_scripts = os.listdir(pjoin(root, data_folder, script_folder))

In [7]:
def extract_text_from_pdf(pdf_path):
    '''
    This function extracts the text from a pdf file using the pymupdf library
    '''
    # open the pdf file
    pdf = pymupdf.open(pdf_path)
    text = ''
    for page in pdf:
        # extract the text from the page and keep adding it to the text variable
        text += page.get_text()
    return text

def get_all_superhero_names(superhero, superhero_synonyms):
    '''
    With each superhero, we will have a list of names that the superhero can be referred to in the script
    For example, for Batman, the names can be Batman, Bruce Wayne, Bruce, Wayne, Bruce-Wayne, etc.
    '''

    # get the superhero synonym which essentially is the real name of the superhero
    superhero_synonym = superhero_synonyms[superhero][0]
    # get all the possible names of the superhero
    superhero_names = [superhero.upper(), superhero_synonym.upper(), superhero_synonym.replace(' ', '-').upper()]
    superhero_names = superhero_names + [i.upper () for i in superhero_synonym.split()]
    return superhero_names

def split_script_by_superhero_dialogue(script_text, superhero_names):
    '''
    This function splits the script such that we have the split points where the dialogues of the superhero start
    Since we're only interested in the dialogues of the superhero, we will split the script based on the dialogues of the superhero
    '''
    # we will find all the matches of the superhero names in the script
    matches = re.finditer("|".join(superhero_names), script_text)
    # get the split points where the dialogues of the superhero start
    split_points = [match.start() for match in matches][1:] + [len(script_text)]
    # extract the dialogues of the superhero
    extrcated_split_script_text = [script_text[split_points[i]:split_points[i+1]] for i in range(len(split_points) - 1)]
    return extrcated_split_script_text

def remove_extra_charachters_dialogue_from_each_split(extrcated_split_script_text, max_extra_dialogues=3):
    '''
    This function removes the extra characters from the dialogues extracted from the script
    It checks if there are other characters in the dialogues other than the dialogues of the superhero
    This is done by checking if a line has only uppercase characters, spaces, and some special characters
    This means that the line is indicative of a start of a new dialogue
    We only keep the dialogues till the max_extra_dialogues and remove the rest
    '''
    # pattern to check if a line has only uppercase characters, spaces, and some special characters
    pattern = re.compile(r'^[A-Z\s\'().,-]+$', re.MULTILINE)
    extrcated_split_script_text_filtered = []

    for idx in range(len(extrcated_split_script_text)):
        # find all the matches of the pattern in the dialogue
        matches = re.finditer(pattern, extrcated_split_script_text[idx])
        # get the indices of the matches
        indices = [match.start() for match in matches]
        # if there are more than max_extra_dialogues, we only keep the dialogues till the max_extra_dialogues
        if len(indices) >=1:
            max_indices = len(extrcated_split_script_text[idx]) if len(indices) == 1 else indices[:max_extra_dialogues][-1]
            extrcated_split_script_text_filtered.append(extrcated_split_script_text[idx][:max_indices])
    
    return extrcated_split_script_text_filtered

def combine_dialogue_with_context(script_text, extrcated_split_script_text_filtered, max_context_length):
    '''
    Combine the dialogues with the context of the previous dialogues.
    This is very essential for the model to learn the dialogues better. A dialogue without context is of no use.
    Dialogues when combined with the context of the previous dialogues can help the model understand the dialogues better.
    '''
    dialogue_with_context_all = []
    # loop through all the dialogues
    for idx in range(len(extrcated_split_script_text_filtered)):
        # for each dialogue, get the index of the start of the dialogue in the script
        dialogue_idx = script_text.find(extrcated_split_script_text_filtered[idx])
        # add the context of the previous dialogues to the current dialogue and append it to the list
        dialogue_with_context = script_text[dialogue_idx-max_context_length:dialogue_idx] + extrcated_split_script_text_filtered[idx]
        dialogue_with_context_all.append(dialogue_with_context)

    return dialogue_with_context_all

# loop through all the superheroes
for superhero in tqdm(list_of_superheroes):
    superhero_script = []
    # loop through all the scripts of the superhero
    for script in movies_list_of_superheroes[superhero]:
        superhero_dialogue_save_path = pjoin(data_folder_path, dialogue_folder, superhero)
        save_script_name = ".".join(script.split('.')[:-1])+'.txt'
        script_path = pjoin(data_folder_path, script_folder, script)
        os.makedirs(superhero_dialogue_save_path, exist_ok=True)

        # extract the text from the pdf
        script_text = extract_text_from_pdf(script_path)
        # get all the names of the superhero
        superhero_names = get_all_superhero_names(superhero, superhero_synonyms)
        # split the script based on the dialogues of the superhero
        extrcated_split_script_text = split_script_by_superhero_dialogue(script_text, superhero_names)
        # remove the extra characters from the dialogues
        extrcated_split_script_text_filtered = remove_extra_charachters_dialogue_from_each_split(extrcated_split_script_text, max_extra_dialogues=3)
        # combine the dialogues with the context of the previous dialogues
        dialogues_with_context = combine_dialogue_with_context(script_text, extrcated_split_script_text_filtered, max_context_length)
        # join the dialogues with the context
        dialogues_with_context_combined = f"{dialogues_joiner}".join(dialogues_with_context)
        # save the dialogues with the context to a txt file
        with open(pjoin(superhero_dialogue_save_path, save_script_name), 'w') as f:
            f.write(dialogues_with_context_combined)

Here is the code that does the following:
- Extracts the text from the pdf file using the PyMuPDF library.
- Splits the script based on the dialogues of the superhero. We are only interested in the dialogues of the superhero.
- Removes the extra characters from the dialogues. We only keep the dialogues that have uppercase characters, spaces, and some special characters and that too till a maximum of 3 dialogues.
- Combines the dialogues with the context of the previous dialogues. This is very essential for the model to learn the dialogues better. A dialogue without context is of no use. Dialogues when combined with the context of the previous dialogues can help the model understand the dialogues better.
- Saves the dialogues with the context to a txt file.

So, based on these preprocessing steps, we have now successfully converted the movie scripts to dialogues with context only for the superheroes that we are interested in. This will help us in building a better chatbot. The quality of the data is very important for the model to give good results. Currently, each dialogue for a superhero that we created has 3 main things: Context, Dialogue, and extra characters dialogues. 

Ideally we would like to have only the context and the dialogue, removing the extra characters dialogues along with some screenplays that are present in the dialogues. For this, blog I have tried to keep it as it is; context, dialogue, and extra characters dialogues. But you can always modify the code to remove the extra characters dialogues and screenplays from the dialogues as well. Once other possible preprocessing step post-extraction is extract just the dialogues and context and remove the extra characters dialogues and screenplays from the dialogues using LLMs. This will help in improving the quality of the data. I have provided the code for the same in the code section below, but I have not used it in this blog since filtering the dialogues using LLMs is a time and resource-consuming process. But you can always use it to improve the quality of the data.

In [None]:
# import torch
# import transformers


# def load_model_pipeline(model_id, batch_size):
#     '''
#     Load the model pipeline with the model id and the batch size
#     '''
#     pipeline = transformers.pipeline(
#         "text-generation",
#         model=model_id,
#         model_kwargs={"torch_dtype": torch.bfloat16},
#         device_map="auto",
#         batch_size=batch_size,
#     )

#     torch.backends.cuda.enable_mem_efficient_sdp(False)
#     torch.backends.cuda.enable_flash_sdp(False)
#     return pipeline

# def extract_dialogue_from_llm(pipeline, messages):
#     '''
#     Extract the dialogues only from the model based on the messages
#     '''
#     pipeline.tokenizer.pad_token_id = pipeline.tokenizer.eos_token_id
#     pipeline.tokenizer.padding_side = 'left'

#     terminators = [
#         pipeline.tokenizer.eos_token_id,
#         pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
#     ]

#     outputs = pipeline(
#         messages,
#         max_new_tokens=256,
#         eos_token_id=terminators,
#         do_sample=True,
#         temperature=1,
#         top_p=1,
#     )

#     return outputs

# def create_batch(extrcated_split_script_text, superhero, superhero_names, batch_size):
#     '''
#     To make the extraction of the dialogues faster, we will create a batch of messages
#     Messages are the prompts that has information about the system and the user
#     '''
#     messages_batch = []
#     for extracted_text in tqdm(extrcated_split_script_text[:batch_size]):
#         system_prompt = f"You are a movie dialogue separator. From the context you are given, separate the dialogue and provide the dialogue of a charachter. You are only allowed to give final dialoige without any thing. Don't say anything else, just list the dialogue. Always start with the NAME of the character followed by a colon and then the dialogue. The extracted dialogue should always be in single line. Make sure that you extract all the dialouges of the asked charachters. It can be present in multiple lines. These are the identifier for charachter dialoges for which you need to extrcat the dialouges: {", ".join([f"'{i}'" for i in superhero_names])} The identifier are always in captital leter."
#         user_prompt = f"Extract only the dialogues of {superhero.upper()} - Synonyms of {superhero.upper()} are {", ".join([f"'{i}'" for i in superhero_names])}. Now extract dialogue based on the synonyms given from the following text\n\n\n\n {extracted_text} \n\n\n\n\n Make sure you only extract dialogue of {", ".join([f"'{i}'" for i in superhero_names])}. The dialogues starts only after the name of the charachter is in capital letter."

#         messages = [
#             {"role": "system", "content": system_prompt},
#             {"role": "user", "content": user_prompt},
#         ]
#         messages_batch.append(messages)
#     return messages_batch

# batch_size = 8
# model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# pipeline = load_model_pipeline(model_id, batch_size)
# messages_batch = create_batch(extrcated_split_script_text, superhero, superhero_names, batch_size)
# extrcated_dialogue = extract_dialogue_from_llm(pipeline, messages_batch)

In conclusion, building a superhero character AI is a fascinating endeavor that merges technology with the beloved world of superheroes. By following the outlined approach, we can create an engaging platform where fans can interact with their favorite characters in an unprecedented way. This project not only showcases the potential of AI in entertainment and gaming but also highlights the creative possibilities that arise when technology intersects with popular culture. With the code committed to GitHub and a user-friendly interface ready for interaction, this superhero chat AI stands as a testament to the innovative capabilities of modern AI and its power to transform how we engage with fictional worlds.




Finally, we have successfully built a superhero character chatbot using the Haystack framework. We have also built a user interface for the chatbot using Streamlit. This chatbot can be used to interact with the superheroes and get responses from them. Though this started as a fun project, it has the potential to be expanded into a full-fledged chatbot that can engage with users on a variety of topics. Building a chatbot for superheroes character is a great way to engage with fans and provide them with a unique experience. The chatbot can be further improved by adding more superheroes and dialogues to the dataset. Overall, in this blog, we have seen how to build a superhero character chatbot using the Haystack framework and Streamlit. I hope you enjoyed reading this blog and learned something new. If you have any questions or feedback, feel free to leave a comment below. Thank you for reading!