<div id="singlestore-header" style="display: flex; background-color: rgba(255, 182, 176, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/file-export.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">Ask questions of your PDFs with PDFPlumber</h1>
    </div>
</div>

## Install PDFPlumber Library

We'll start by installing the PDFPlumber library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.

Reference for full installation details: [PDFPlumber Installation Guide](https://pypi.org/project/pdfplumber/#installation)

In [1]:
!pip install pdfplumber

## Import Libraries

In this section, we import the necessary libraries for our project. We'll use `pandas` to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system.

In [2]:
!pip install "openai"

In [3]:
import os
import json
import pandas as pd
import numpy as np
import singlestoredb as s2

import openai

## Configure OpenAI API and SingleStore Database

Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.

- Obtain your OpenAI API key here: [OpenAI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key)
- Set up your SingleStore account and workspace: [SingleStore Setup Guide](https://www.singlestore.com/blog/how-to-get-started-with-singlestore/)
- Connect to your SingleStore workspace: [SingleStore Connection Documentation](https://docs.singlestore.com/cloud/connect-to-your-workspace/)

In [4]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")

In [5]:
try:
    s2_conn = s2.connect()
    s2_conn.autocommit(True)
    s2_cur = s2_conn.cursor()
    print("SingleStore connection successful!")
except Exception as e:
    raise RuntimeError(f"SingleStore connection failed: {e}")

## PDF Extraction & Chunking (pdfplumber)

We use `pdfplumber` (a lightweight, standalone PDF text extraction library) flow. This approach:

- Opens the PDF and extracts raw text per page.
- Applies a simple heading regex to split pages into logical sections (chunks) based on visually uppercase or structured headings (e.g., SECTION 1, Chapter 2, POLICY GUIDELINES).
- Produces a list of chunk dictionaries you can load into a DataFrame and embed.


References:
- pdfplumber: https://github.com/jsvine/pdfplumber
- PyMuPDF (optional alternative): https://pymupdf.readthedocs.io/en/latest/

## Uploading PDF File to Stage

Upload the PDF to the Stage folder (Deployments tab) for the chosen workspace group before ingesting the contents

References:
- [Stage documentation](https://docs.singlestore.com/cloud/load-data/load-data-from-files/stage/)

In [6]:
%%sql
DOWNLOAD STAGE FILE 'Employee-Handbook.pdf' TO 'Employee-Handbook.pdf'OVERWRITE

In [7]:
pdf_filename = "Employee-Handbook.pdf"

In [8]:
import pdfplumber, re

# Extract pages
pages = []
try:
    with pdfplumber.open(pdf_filename) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            pages.append({"page_number": i+1, "text": text})
    print(f"Loaded {len(pages)} pages.")
except Exception as e:
    raise RuntimeError(f"pdfplumber failed to read PDF: {e}")

# heading regex
heading_re = re.compile(r"^(?:[A-Z][A-Z0-9 \-/]{3,}|Section\s+\d+|Chapter\s+\d+)$")
chunks = []
current_title = None
current_body = []
current_page_start = None

for page in pages:
    for line in page["text"].splitlines():
        line_stripped = line.strip()
        if heading_re.match(line_stripped) and len(line_stripped.split()) <= 15:
            # flush previous
            if current_body:
                chunks.append({
                    "title": current_title,
                    "body": "\n".join(current_body),
                    "page_start": current_page_start,
                    "page_end": last_page_num
                })
                current_body = []
            current_title = line_stripped
            current_page_start = page["page_number"]
        else:
            if line_stripped:
                current_body.append(line_stripped)
                last_page_num = page["page_number"]

# Flush last chunk
if current_body:
    chunks.append({
        "title": current_title,
        "body": "\n".join(current_body),
        "page_start": current_page_start,
        "page_end": last_page_num
    })

print(f"Chunking produced {len(chunks)} chunks.")

## Reformat JSON Output into Structured Dataframe Format

After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.

Reference for understanding metadata: [Unstructured Metadata Documentation](https://unstructured-io.github.io/unstructured/metadata.html)

In [9]:
# Convert chunk dictionaries into Pandas DataFrame
import pandas as pd

data = []
for c in chunks:
    row = {}
    row['Element Type'] = 'Chunk'
    row['Filename'] = pdf_filename
    row['Date Modified'] = None  # Not available via pdfplumber
    row['Filetype'] = 'pdf'
    # Use start page (could also store range)
    row['Page Number'] = c.get('page_start')
    # Combine title + body
    if c.get('title'):
        row['text'] = f"{c.get('title')}\n{c.get('body')}"
    else:
        row['text'] = c.get('body')
    data.append(row)

df = pd.DataFrame(data)
print(f"DataFrame rows: {len(df)}")
df.head()

## Make Connection to SingleStore Database

In this step, we establish a connection to the SingleStore Database. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.

References:
- [Creating a Database in SingleStoreDB Cloud](https://docs.singlestore.com/cloud/create-a-database/)
- [Loading Data into SingleStoreDB Cloud](https://docs.singlestore.com/cloud/load-data/)

In [10]:
s2_cur.execute("DROP TABLE IF EXISTS unstructured_data;")
create_query = (
    "CREATE TABLE unstructured_data ("
    "element_id INT AUTO_INCREMENT PRIMARY KEY, "
    "element_type VARCHAR(255), "
    "filename VARCHAR(255), "
    "date_modified DATETIME, "
    "filetype VARCHAR(255), "
    "page_number INT, "
    "text TEXT)"
)
s2_cur.execute(create_query)
print("Table unstructured_data ready.")

In [11]:
for i, row in df.iterrows():
    insert_query = (
        "INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text) "
        "VALUES (%s, %s, %s, %s, %s, %s);"
    )
    s2_cur.execute(insert_query, (
        row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']
    ))
print(f"Inserted {len(df)} rows into unstructured_data.")

## Create Text Embedding in the Table

Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's `get_embedding` function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.

Reference: [Understanding Text Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)

In [12]:
s2_cur.execute("ALTER TABLE unstructured_data ADD COLUMN  text_embedding TEXT;")
print("Added text_embedding column.")

In [13]:
import time, os

# Ensure API key is set (fallback to environment if not already assigned)
if not getattr(openai, 'api_key', None):
    env_key = os.getenv('OPENAI_API_KEY')
    if env_key:
        openai.api_key = env_key.strip()
        print('Hydrated openai.api_key from environment variable.')
    else:
        raise ValueError('OpenAI API key not set. Set OPENAI_API_KEY env or rerun key input cell.')

# Re-initialize new SDK client if available and was None
if _use_new and _openai_client is None:
    try:
        from openai import OpenAI
        _openai_client = OpenAI(api_key=openai.api_key)
        print('Reinitialized OpenAI client.')
    except Exception as e:
        print(f'Failed to reinitialize OpenAI client: {e}')
        _use_new = False

BATCH_SIZE = 10
MODEL = EMBED_MODEL
MAX_RETRIES = 3

s2_cur.execute("SELECT element_id, text FROM unstructured_data WHERE text_embedding IS NULL OR text_embedding = '';")
rows = s2_cur.fetchall()
print(f"Rows needing embeddings: {len(rows)}")

use_new = _use_new

def embed_batch(text_list):
    if use_new and _openai_client is not None:
        resp = _openai_client.embeddings.create(model=MODEL, input=text_list)
        return [item.embedding for item in resp.data]
    else:
        resp = openai.Embedding.create(model=MODEL, input=text_list)
        return [item['embedding'] for item in resp['data']]

for i in range(0, len(rows), BATCH_SIZE):
    batch = rows[i:i+BATCH_SIZE]
    texts = [t for _, t in batch]
    attempt = 0
    while True:
        try:
            embeddings = embed_batch(texts)
            break
        except Exception as e:
            attempt += 1
            if attempt >= MAX_RETRIES:
                print(f"Failed batch starting at index {i}: {e}")
                embeddings = [None]*len(batch)
                break
            sleep_time = 2 ** attempt
            print(f"Retry {attempt} for batch starting at {i} after error: {e}. Sleeping {sleep_time}s")
            time.sleep(sleep_time)
    for (element_id, _), emb in zip(batch, embeddings):
        if emb:
            s2_cur.execute("UPDATE unstructured_data SET text_embedding = %s WHERE element_id = %s;", (json.dumps(emb), element_id))

print("Embedding update complete.")

## Run User Query Based on Similarity Score

The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.

Reference: [How the Dot Product Measures Similarity](https://tivadardanka.com/blog/how-the-dot-product-measures-similarity)

In [14]:
search_string = "What are the emergency management provisions include?"
search_embedding = embed_text(search_string)
search_embedding_array = np.asarray(search_embedding, dtype=np.float32)

In [15]:
# Fetch text, type, filename, and embeddings from the unstructured_data table using singlestoredb
s2_cur.execute("SELECT text, element_type, filename, text_embedding FROM unstructured_data WHERE text_embedding IS NOT NULL;")
results = s2_cur.fetchall()

scores = []
for text, type_, filename, embedding_str in results:
    if embedding_str:
        embedding = json.loads(embedding_str)
        embedding_array = np.array(embedding)
        score = np.dot(search_embedding_array, embedding_array)
        scores.append((text, type_, filename, score))

# Sort by score and take the top 5
top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]

# Display top-k records
top_5

## Generate the Answer via OpenAI ChatCompletion

In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.

Reference: [OpenAI Chat Completions API Guide](https://platform.openai.com/docs/guides/gpt/chat-completions-api)

In [16]:
if top_5:
    try:
        response = openai.ChatCompletion.create(
            model="gpt-5",
            messages=[
                {"role": "system",
                 "content": "You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context."
                },
                {"role": "assistant", "content": str(top_5)},
                {"role": "user", "content": search_string},
            ],
            temperature=0
        )

        assistant_message = response['choices'][0]['message']['content']
        print("Assistant's Response:", assistant_message)

    except Exception as e:
        print(f"OpenAI API call failed: {e}")
else:
    print("No relevant documents found.")

<div id="singlestore-footer" style="background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px"></div>
<div><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png" style="padding: 0px; margin: 0px; height: 24px"/></div>