# Introduction

This notebook walks through the creation of a RAG system for injecting writing prompts into a user conversation with ChatGPT. 

We will use a dataset of plot synopsis for chapter of the Invincible comic book series.

We will be using the following source: https://comic-invincible.fandom.com/wiki/Invincible_(Comic_Series)
While this does not contain the full list of issues for the comics, it is a good starting point that can be expanded upon later.

In [None]:
# Installation

!pip install beautifulsoup4 requests

!pip install openai


In [1]:
# Imports
import os
from dotenv import load_dotenv

import requests
from bs4 import BeautifulSoup

import pandas as pd

from transformers import AutoTokenizer, AutoModel
import torch

import pymongo
from tqdm.notebook import tqdm

from openai import OpenAI

In [2]:
# Get secrets

load_dotenv()

mongo_connection_string = os.getenv('MONGO_CONNECTION_STRING')
openai_api_key = os.getenv('OPENAI_API_KEY')

# Step 1: Load the dataset into our vector database

## Scrape index page to find issue links


In [3]:
# URL of the index page listing all issues
base_url = "https://comic-invincible.fandom.com"
index_url = "https://comic-invincible.fandom.com/wiki/Invincible_(Comic_Series)"
index_response = requests.get(index_url)
index_soup = BeautifulSoup(index_response.content, 'html.parser')

volumes_and_issues_header = index_soup.find('span', id='Volumes_and_Issues').parent
issue_links = []

for sibling in volumes_and_issues_header.find_next_siblings():
    if sibling.name == 'h3':
        # Find the next <ul> tag after the <h3>
        next_ul = sibling.find_next_sibling('ul')
        if next_ul:
            # Extract all <a> tags within the <ul>
            for a_tag in next_ul.find_all('a'):
                issue_links.append(base_url + a_tag['href'])
    elif sibling.name == 'div':
        # Break if a new div is encountered
        break

print(issue_links)

['https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_2', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_3', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_4', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_2', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_3', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_4', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_2', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_3', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_4', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_5', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_4_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_4_2', 'https://comic-invincible.fandom.com/wiki/Invincible_V

## Scrape issue links to get plot synopses

In [11]:
all_plot_synopses = []

def get_plot_synopsis(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the "Plot Synopsis" header
    plot_header = soup.find('span', id='Plot_Synopsis') or soup.find('span', id='Synopsis_for_the_1st_Story')
    if plot_header:
        plot_header = plot_header.parent
    else:
        print("No plot synopsis found for", url)
        return  # Skip if no relevant header is found
    
    # Initialize a list to store the plot paragraphs
    plot_paragraphs = []

    # Iterate over the siblings after the plot header until encountering a different type of tag
    for sibling in plot_header.find_next_siblings():
        if sibling.name == 'p':
            plot_paragraphs.append(sibling.get_text())
        else:
            break
    
    full_synopsis = '\n'.join(plot_paragraphs)
    return full_synopsis

for link in issue_links:
    synopsis = get_plot_synopsis(link)
    all_plot_synopses.append(synopsis)

# Convert array of strings into a pandas DataFrame
synopses_df = pd.DataFrame(all_plot_synopses, columns=["synopsis"])
print(synopses_df.head())


                                            synopsis
0  Four months into the future, a flying teenage ...
1  In a flashback to when Mark Grayson was seven,...
2  There's been a rash of disappearances at Mark'...
3  Mark flies into the Teen Team base and asks Ro...
4  Mark receives a call from his father in his ro...


In [13]:
def count_total_words(text_array):
    total_words = sum(len(sentence.split()) for sentence in text_array)
    return total_words

total_words = count_total_words(all_plot_synopses)
print(f"Total number of words: {total_words}")

print(f"Total documents: {len(all_plot_synopses)}")

Total number of words: 20868
Total documents: 42


## Create vector embeddings

In [3]:
# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")
model = AutoModel.from_pretrained("intfloat/e5-large-v2").to("cuda" if torch.cuda.is_available() else "cpu")

# Function to get embeddings with appropriate prefix
def get_embedding(text, prefix="passage"):
    prefixed_text = f"{prefix}: {text}"
    tokens = tokenizer(prefixed_text, padding=True, truncation=True, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**tokens)
    embedding = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
    return embedding.flatten().tolist()

In [22]:
# Enable tqdm with pandas
tqdm.pandas()

# Create embeddings and store them as a new field
synopses_df["synopsis_embedding"] = synopses_df["synopsis"].progress_apply(lambda x: get_embedding(x, prefix="passage"))
print(synopses_df[["synopsis", "synopsis_embedding"]].head())

100%|██████████| 42/42 [01:28<00:00,  2.11s/it]

                                            synopsis  \
0  Four months into the future, a flying teenage ...   
1  In a flashback to when Mark Grayson was seven,...   
2  There's been a rash of disappearances at Mark'...   
3  Mark flies into the Teen Team base and asks Ro...   
4  Mark receives a call from his father in his ro...   

                                  synopsis_embedding  
0  [0.33630841970443726, -1.911747932434082, 0.17...  
1  [0.6546132564544678, -1.355846881866455, 0.401...  
2  [0.10531767457723618, -1.6947165727615356, 0.3...  
3  [0.7967595458030701, -1.3208731412887573, 0.30...  
4  [0.5341870784759521, -1.0583305358886719, -0.0...  





## Store the data in Atas

In [23]:
# Get the vector size of the embeddings
vector_size = len(synopses_df['synopsis_embedding'].iloc[0])

print(f"The vector size of the embeddings is: {vector_size}")

The vector size of the embeddings is: 1024


In [4]:
# Connect to Atlas cluster
mongo_client = pymongo.MongoClient(mongo_connection_string)

# Ingest data into Atlas
db = mongo_client["invincible"]
collection = db["plot_synopses"]

In [None]:
documents = synopses_df.to_dict("records")
collection.insert_many(documents)

## Sample query

In [5]:
def vector_query(query):
    # Generate embedding for the search query
    query_embedding = get_embedding("Nolan tells Mark the truth about Viltrum", prefix="query")
    
    pipeline = [
       {
          "$vectorSearch": {
          "index": "invincible_synopses_index",
          "path": "synopsis_embedding",
          "queryVector": query_embedding,
          "numCandidates": 42,
          "limit": 4
        }
       },
       {
          "$project": {
             "_id": 0,
             "synopsis": 1,
             "score": {
                "$meta": "vectorSearchScore"
             }
          }
       }
    ]

    # Execute the search
    return collection.aggregate(pipeline)

   
   
results = vector_query("Nolan tells Mark the truth about Viltrum")
for result in results:
    print(result)

{'synopsis': "Nolan then reveals to Mark the truth about his origin and the Viltrumites. Nolan reveals that the Viltrumites achieved a perfect society, but not without killing off the weak from their planet. Cutting the population in half, they emerged unbeatable warrior race. Afterwards, a planetary empire was established and was agreed upon unanimously. Nolan expanded on Viltrumite history by explaining the planet conquering process. First, a planet was searched for and founded. They would set up an orbit monitor to see their conquered planet. He revealed that those who accepted takeover would be given Viltrumite technology to help their way of life. Those who resisted would be killed off until they submitted. Seeing it as successful, the Viltrumites double their efforts in conquering planets.\n\nThey would begin to use enslaved aliens to conquer planets to avoid overusing Viltrumites. Not long after Nolan was born, he then would accept a job to help find planets, but would be eager 

# Step 2: Integrate with LLM

In [None]:
client = OpenAI(
    api_key=openai_api_key,
)

In [6]:
def get_vector_query(user_prompt):
    chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": f"Can you output a vector query for the following user prompt: \"{user_prompt}\". Please output only the text query. Example: \"Nolan tells Mark the truth about Viltrum\". You need to turn the user query into search terms for a vector database with embedded plot synopses for every issue of the comic.",
        }
    ],
    model="gpt-3.5-turbo",
    )
    
    return chat_completion.choices[0].message.content
    
print(get_vector_query("How did Nolan react when Mark got his powers"))
    

"Nolan's reaction to Mark obtaining powers"
