<a href="https://colab.research.google.com/github/rogercost/epigraph-finder/blob/main/universe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build the Epigraph Universe
The workflow is:
1. Scrape Project Gutenberg for all poetry URLs
2. Invoke Gemini for each poem to extract potential epigraphs

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import os.path

In [None]:
# Function to get book links from a page
def get_book_links(page_url):
    response = requests.get(page_url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    book_links = soup.select('a.link')
    poetry_books = [link.get('href') for link in book_links if 'ebooks' in link.get('href')]
    return poetry_books, soup

# Function to get the next page URL
def get_next_page_url(soup):
    next_link = soup.find('a', string='Next')
    if next_link:
        return f"https://www.gutenberg.org{next_link.get('href')}"
    return None

In [None]:
# Initial URL of Project Gutenberg poetry section
base_url = "https://www.gutenberg.org/ebooks/bookshelf/60"

all_poetry_books = []
current_url = base_url

while current_url:
    # Get book links and the soup object for the current page
    print(f"Gathering links from {current_url}...")
    book_links, soup = get_book_links(current_url)
    all_poetry_books.extend(book_links)

    # Get the next page URL
    current_url = get_next_page_url(soup)

    # Sleep to avoid overwhelming the server
    time.sleep(1)

# Remove duplicates by converting the list to a set and back to a list
all_poetry_books = list(set(all_poetry_books))
print(f"Downloaded {len(all_poetry_books)} poetry book URLs, e.g. {all_poetry_books[0]}")

Gathering links from https://www.gutenberg.org/ebooks/bookshelf/60...
Gathering links from https://www.gutenberg.org/ebooks/bookshelf/60?start_index=26...
Gathering links from https://www.gutenberg.org/ebooks/bookshelf/60?start_index=51...
Gathering links from https://www.gutenberg.org/ebooks/bookshelf/60?start_index=76...
Downloaded 98 poetry book URLs, e.g. /ebooks/53385


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
!mkdir -p "/content/gdrive/My Drive/poems"
!echo "Disgusting Gus, scissors cut" > "/content/gdrive/My Drive/poems/example_poem.txt"

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
def extract_title_author(text):
    title = None
    author = None

    lines = text.splitlines()
    for line in lines:
        if line.startswith("Title: "):
            title = line.replace("Title: ", "").strip()
        elif line.startswith("Author: "):
            author = line.replace("Author: ", "").strip()
        elif line.startswith("Editor: "):
            author = line.replace("Editor: ", "").strip()
        if title and author:
            break

    if not author:
        author = "Unknown"
    return title, author

In [None]:
# Iterate over each book link and download the content
for book_link in all_poetry_books:
    book_url = f"https://www.gutenberg.org{book_link}"
    book_response = requests.get(book_url)
    book_response.raise_for_status()

    book_soup = BeautifulSoup(book_response.content, 'html.parser')
    download_links = book_soup.select('a.link')

    for download_link in download_links:
        if 'txt' in download_link.get('href'):
            download_url = f"https://www.gutenberg.org{download_link.get('href')}"
            book_content = requests.get(download_url).text

            title, author = extract_title_author(book_content)
            title = title.replace(" ", "_")
            author = author.replace(" ", "_")
            filename = f"{title}__by_{author}.txt"
            full_filename = f"/content/gdrive/My Drive/poems/{filename}"

            # Save or process the book content as needed
            with open(full_filename, 'w') as file:
                file.write(book_content)
                file.write("\n\n")

            print(f"Wrote poem file: {filename}")
            break  # Break after the first text format link is found

        time.sleep(1)

print(f"Total poetry books found: {len(all_poetry_books)}")

Wrote poem file: For_Your_Sweet_Sake:_Poems__by_James_E._McGirt.txt
Wrote poem file: La_Divina_Commedia_di_Dante__by_Dante_Alighieri.txt
Wrote poem file: The_Golden_Treasury_of_American_Songs_and_Lyrics__by_Frederic_Lawrence_Knowles.txt
Wrote poem file: Sappho:_One_Hundred_Lyrics__by_Bliss_Carman.txt
Wrote poem file: Sir_Gawayne_and_the_Green_Knight__by_Richard_Morris.txt
Wrote poem file: The_Works_of_Horace__by_Horace.txt
Wrote poem file: Poems_by_Walt_Whitman__by_Walt_Whitman.txt
Wrote poem file: Select_Poems_of_Sidney_Lanier__by_Sidney_Lanier.txt
Wrote poem file: The_Works_of_Lord_Byron._Vol._2__by_Baron_George_Gordon_Byron_Byron.txt
Wrote poem file: Amores:_Poems__by_D._H._Lawrence.txt
Wrote poem file: The_Complete_Poetical_Works_of_Percy_Bysshe_Shelley_—_Complete__by_Percy_Bysshe_Shelley.txt
Wrote poem file: The_Song_of_Hiawatha__by_Henry_Wadsworth_Longfellow.txt
Wrote poem file: A_Dome_of_Many-Coloured_Glass__by_Amy_Lowell.txt
Wrote poem file: Les_poésies_de_Sapho_de_Lesbos__by_S

In [None]:
# Now we have all content downloaded to Drive. Next step will be to chunk and process each file.
# 1. Break the file content into chunks, size TBD (check Gemini max window size) with some overlap.
# 2. For each chunk, prompt Gemini to pull out potential epigraphs, and poem subtitle. (Experiment with this prompt, also research various methodologies)
# 3. Write epigraphs into a tabular structure with columns for title, author, poem subtitle.
# 4. Load into a FAISS structure (see stanzas.ipynb)