# Introduction

This notebook walks through the creation of a RAG system for injecting writing prompts into a user conversation with ChatGPT. 

We will use a dataset of plot synopsis for chapter of the Invincible comic book series.

We will be using the following source: https://comic-invincible.fandom.com/wiki/Invincible_(Comic_Series)
While this does not contain the full list of issues for the comics, it is a good starting point that can be expanded upon later.

In [1]:
# Installation

!pip install beautifulsoup4 requests





[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Imports
import requests
from bs4 import BeautifulSoup

# Step 1: Load the dataset into our vector database

## Scrape index page to find issue links


In [5]:
# URL of the index page listing all issues
base_url = "https://comic-invincible.fandom.com"
index_url = "https://comic-invincible.fandom.com/wiki/Invincible_(Comic_Series)"
index_response = requests.get(index_url)
index_soup = BeautifulSoup(index_response.content, 'html.parser')

volumes_and_issues_header = index_soup.find('span', id='Volumes_and_Issues').parent
issue_links = []

for sibling in volumes_and_issues_header.find_next_siblings():
    if sibling.name == 'h3':
        # Find the next <ul> tag after the <h3>
        next_ul = sibling.find_next_sibling('ul')
        if next_ul:
            # Extract all <a> tags within the <ul>
            for a_tag in next_ul.find_all('a'):
                issue_links.append(base_url + a_tag['href'])
    elif sibling.name == 'div':
        # Break if a new div is encountered
        break

print(issue_links)

['https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_2', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_3', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_1_4', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_2', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_3', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_2_4', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_2', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_3', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_4', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_3_5', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_4_1', 'https://comic-invincible.fandom.com/wiki/Invincible_Vol_4_2', 'https://comic-invincible.fandom.com/wiki/Invincible_V

## Scrape issue links to get plot synopses

In [11]:
all_plot_synopses = []

def get_plot_synopsis(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the "Plot Synopsis" header
    plot_header = soup.find('span', id='Plot_Synopsis') or soup.find('span', id='Synopsis_for_the_1st_Story')
    if plot_header:
        plot_header = plot_header.parent
    else:
        print("No plot synopsis found for", url)
        return  # Skip if no relevant header is found
    
    # Initialize a list to store the plot paragraphs
    plot_paragraphs = []

    # Iterate over the siblings after the plot header until encountering a different type of tag
    for sibling in plot_header.find_next_siblings():
        if sibling.name == 'p':
            plot_paragraphs.append(sibling.get_text(strip=True))
        else:
            break
    
    full_synopsis = '\n'.join(plot_paragraphs)
    return full_synopsis

for link in issue_links:
    synopsis = get_plot_synopsis(link)
    all_plot_synopses.append(synopsis)

print(all_plot_synopses)




In [14]:
def count_total_words(text_array):
    total_words = sum(len(sentence.split()) for sentence in text_array)
    return total_words

total_words = count_total_words(all_plot_synopses)
print(f"Total number of words: {total_words}")

Total number of words: 20015
