From 69992dbee3eced8d60aeeab7b6cd927066d3dcde Mon Sep 17 00:00:00 2001 From: Lakshya Singh <141630392+lsingh4634426@users.noreply.github.com> Date: Tue, 18 Nov 2025 19:29:34 +0530 Subject: [PATCH 1/2] Add PDF upload and access instructions Added instructions for uploading PDF files to the stage folder and accessing them within the notebook. --- .../notebook.ipynb | 31 +++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb b/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb index aa571bb..f948919 100644 --- a/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb +++ b/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb @@ -114,6 +114,37 @@ ], "id": "92ae5a1e" }, +{ + "attachments": {}, + "cell_type": "markdown", + "id": "b5cdd4f1-b27c-4921-ac9f-da41654fd28f", + "metadata": { + "language": "python" + }, + "source": [ + "## Uploading PDF File to Stage\n", + "\n", + "Upload the PDF to the Stage folder (Deployments tab) for the chosen workspace group before ingesting the contents\n", + "\n", + "References:\n", + "- [Stage documentation](https://docs.singlestore.com/cloud/load-data/load-data-from-files/stage/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9f964ec-0a77-4cd6-b98c-a9f07bfce293", + "metadata": { + "language": "python", + "trusted": true + }, + "outputs": [], + "source": [ + "# For accessing the stage file inside the notebook , we copy it locally on the container running the notebook using the following \n", + "# fusion SQL command\n", + "DOWNLOAD STAGE FILE 'Employee-Handbook.pdf' TO 'Employee-Handbook.pdf'" + ] + }, { "cell_type": "code", "execution_count": 4, From 16c973bcb50ac84879aae85b737840b91ff257d1 Mon Sep 17 00:00:00 2001 From: lsingh4634426 Date: Wed, 19 Nov 2025 01:18:56 +0530 Subject: [PATCH 2/2] use pdfplumber --- .../ingest-pdfs-with-pdfplumber/meta.toml | 10 + .../notebook.ipynb | 553 ++++++++++++++++++ .../ingest-pdfs-with-unstructured/meta.toml | 10 - .../notebook.ipynb | 487 --------------- 4 files changed, 563 insertions(+), 497 deletions(-) create mode 100644 notebooks/ingest-pdfs-with-pdfplumber/meta.toml create mode 100644 notebooks/ingest-pdfs-with-pdfplumber/notebook.ipynb delete mode 100644 notebooks/ingest-pdfs-with-unstructured/meta.toml delete mode 100644 notebooks/ingest-pdfs-with-unstructured/notebook.ipynb diff --git a/notebooks/ingest-pdfs-with-pdfplumber/meta.toml b/notebooks/ingest-pdfs-with-pdfplumber/meta.toml new file mode 100644 index 0000000..4e404f4 --- /dev/null +++ b/notebooks/ingest-pdfs-with-pdfplumber/meta.toml @@ -0,0 +1,10 @@ +[meta] +authors=["singlestore"] +title="Ask questions of your PDFs with PDFPlumber" +description="Ask questions of your unstructured PDFs. In this notebook, PDFPlumber ingests pdfs, then Open AI is used to create embeddings, the vector data is stored in SingleStore and finally ask questions of your PDF data" +icon="file-export" +difficulty="beginner" +tags=["ingest", "pdf","vector","pdfplumber"] +lesson_areas=["AI", "Integrations"] +destinations=["spaces"] +minimum_tier="standard" diff --git a/notebooks/ingest-pdfs-with-pdfplumber/notebook.ipynb b/notebooks/ingest-pdfs-with-pdfplumber/notebook.ipynb new file mode 100644 index 0000000..a6b6d68 --- /dev/null +++ b/notebooks/ingest-pdfs-with-pdfplumber/notebook.ipynb @@ -0,0 +1,553 @@ +{ + "cells": [ + { + "id": "3ba63f11", + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "
SingleStore Notebooks
\n", + "

Ask questions of your PDFs with PDFPlumber

\n", + "
\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "0680197e", + "metadata": {}, + "source": [ + "## Install PDFPlumber Library\n", + "\n", + "We'll start by installing the PDFPlumber library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.\n", + "\n", + "Reference for full installation details: [PDFPlumber Installation Guide](https://pypi.org/project/pdfplumber/#installation)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "3a3fee0a", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pdfplumber" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "6a27e7f1", + "metadata": {}, + "source": [ + "## Import Libraries\n", + "\n", + "In this section, we import the necessary libraries for our project. We'll use `pandas` to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"openai\"" + ], + "id": "87c6c286" + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "6a076d8b", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "import pandas as pd\n", + "import numpy as np\n", + "import singlestoredb as s2\n", + "\n", + "import openai" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "c40456f7", + "metadata": {}, + "source": [ + "## Configure OpenAI API and SingleStore Database\n", + "\n", + "Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.\n", + "\n", + "- Obtain your OpenAI API key here: [OpenAI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key)\n", + "- Set up your SingleStore account and workspace: [SingleStore Setup Guide](https://www.singlestore.com/blog/how-to-get-started-with-singlestore/)\n", + "- Connect to your SingleStore workspace: [SingleStore Connection Documentation](https://docs.singlestore.com/cloud/connect-to-your-workspace/)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from getpass import getpass\n", + "os.environ[\"OPENAI_API_KEY\"] = getpass(\"OpenAI API key: \")" + ], + "id": "9dbe989a" + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "e8826a8c", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " s2_conn = s2.connect()\n", + " s2_conn.autocommit(True)\n", + " s2_cur = s2_conn.cursor()\n", + " print(\"SingleStore connection successful!\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"SingleStore connection failed: {e}\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "92ae5a1e", + "metadata": {}, + "source": [ + "## PDF Extraction & Chunking (pdfplumber)\n", + "\n", + "We use `pdfplumber` (a lightweight, standalone PDF text extraction library) flow. This approach:\n", + "\n", + "- Opens the PDF and extracts raw text per page.\n", + "- Applies a simple heading regex to split pages into logical sections (chunks) based on visually uppercase or structured headings (e.g., SECTION 1, Chapter 2, POLICY GUIDELINES).\n", + "- Produces a list of chunk dictionaries you can load into a DataFrame and embed.\n", + "\n", + "\n", + "References:\n", + "- pdfplumber: https://github.com/jsvine/pdfplumber\n", + "- PyMuPDF (optional alternative): https://pymupdf.readthedocs.io/en/latest/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Uploading PDF File to Stage\n", + "\n", + "Upload the PDF to the Stage folder (Deployments tab) for the chosen workspace group before ingesting the contents\n", + "\n", + "References:\n", + "- [Stage documentation](https://docs.singlestore.com/cloud/load-data/load-data-from-files/stage/)" + ], + "id": "050463a1" + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "DOWNLOAD STAGE FILE 'Employee-Handbook.pdf' TO 'Employee-Handbook.pdf'OVERWRITE" + ], + "id": "91b47930" + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "5f4be9dc", + "metadata": {}, + "outputs": [], + "source": [ + "pdf_filename = \"Employee-Handbook.pdf\"" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import pdfplumber, re\n", + "\n", + "# Extract pages\n", + "pages = []\n", + "try:\n", + " with pdfplumber.open(pdf_filename) as pdf:\n", + " for i, page in enumerate(pdf.pages):\n", + " text = page.extract_text() or \"\"\n", + " pages.append({\"page_number\": i+1, \"text\": text})\n", + " print(f\"Loaded {len(pages)} pages.\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"pdfplumber failed to read PDF: {e}\")\n", + "\n", + "# heading regex\n", + "heading_re = re.compile(r\"^(?:[A-Z][A-Z0-9 \\-/]{3,}|Section\\s+\\d+|Chapter\\s+\\d+)$\")\n", + "chunks = []\n", + "current_title = None\n", + "current_body = []\n", + "current_page_start = None\n", + "\n", + "for page in pages:\n", + " for line in page[\"text\"].splitlines():\n", + " line_stripped = line.strip()\n", + " if heading_re.match(line_stripped) and len(line_stripped.split()) <= 15:\n", + " # flush previous\n", + " if current_body:\n", + " chunks.append({\n", + " \"title\": current_title,\n", + " \"body\": \"\\n\".join(current_body),\n", + " \"page_start\": current_page_start,\n", + " \"page_end\": last_page_num\n", + " })\n", + " current_body = []\n", + " current_title = line_stripped\n", + " current_page_start = page[\"page_number\"]\n", + " else:\n", + " if line_stripped:\n", + " current_body.append(line_stripped)\n", + " last_page_num = page[\"page_number\"]\n", + "\n", + "# Flush last chunk\n", + "if current_body:\n", + " chunks.append({\n", + " \"title\": current_title,\n", + " \"body\": \"\\n\".join(current_body),\n", + " \"page_start\": current_page_start,\n", + " \"page_end\": last_page_num\n", + " })\n", + "\n", + "print(f\"Chunking produced {len(chunks)} chunks.\")" + ], + "id": "53fc1109" + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a8fefdba", + "metadata": {}, + "source": [ + "## Reformat JSON Output into Structured Dataframe Format\n", + "\n", + "After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.\n", + "\n", + "Reference for understanding metadata: [Unstructured Metadata Documentation](https://unstructured-io.github.io/unstructured/metadata.html)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b4f19b22", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert chunk dictionaries into Pandas DataFrame\n", + "import pandas as pd\n", + "\n", + "data = []\n", + "for c in chunks:\n", + " row = {}\n", + " row['Element Type'] = 'Chunk'\n", + " row['Filename'] = pdf_filename\n", + " row['Date Modified'] = None # Not available via pdfplumber\n", + " row['Filetype'] = 'pdf'\n", + " # Use start page (could also store range)\n", + " row['Page Number'] = c.get('page_start')\n", + " # Combine title + body\n", + " if c.get('title'):\n", + " row['text'] = f\"{c.get('title')}\\n{c.get('body')}\"\n", + " else:\n", + " row['text'] = c.get('body')\n", + " data.append(row)\n", + "\n", + "df = pd.DataFrame(data)\n", + "print(f\"DataFrame rows: {len(df)}\")\n", + "df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e1cfcd38", + "metadata": {}, + "source": [ + "## Make Connection to SingleStore Database\n", + "\n", + "In this step, we establish a connection to the SingleStore Database. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.\n", + "\n", + "References:\n", + "- [Creating a Database in SingleStoreDB Cloud](https://docs.singlestore.com/cloud/create-a-database/)\n", + "- [Loading Data into SingleStoreDB Cloud](https://docs.singlestore.com/cloud/load-data/)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "7a9d094a", + "metadata": {}, + "outputs": [], + "source": [ + "s2_cur.execute(\"DROP TABLE IF EXISTS unstructured_data;\")\n", + "create_query = (\n", + " \"CREATE TABLE unstructured_data (\"\n", + " \"element_id INT AUTO_INCREMENT PRIMARY KEY, \"\n", + " \"element_type VARCHAR(255), \"\n", + " \"filename VARCHAR(255), \"\n", + " \"date_modified DATETIME, \"\n", + " \"filetype VARCHAR(255), \"\n", + " \"page_number INT, \"\n", + " \"text TEXT)\"\n", + ")\n", + "s2_cur.execute(create_query)\n", + "print(\"Table unstructured_data ready.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "ba220cc1", + "metadata": {}, + "outputs": [], + "source": [ + "for i, row in df.iterrows():\n", + " insert_query = (\n", + " \"INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text) \"\n", + " \"VALUES (%s, %s, %s, %s, %s, %s);\"\n", + " )\n", + " s2_cur.execute(insert_query, (\n", + " row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']\n", + " ))\n", + "print(f\"Inserted {len(df)} rows into unstructured_data.\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "95f9443b", + "metadata": {}, + "source": [ + "## Create Text Embedding in the Table\n", + "\n", + "Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's `get_embedding` function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.\n", + "\n", + "Reference: [Understanding Text Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "c95bc511", + "metadata": {}, + "outputs": [], + "source": [ + "s2_cur.execute(\"ALTER TABLE unstructured_data ADD COLUMN text_embedding TEXT;\")\n", + "print(\"Added text_embedding column.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "00b7c77b", + "metadata": {}, + "outputs": [], + "source": [ + "import time, os\n", + "\n", + "# Ensure API key is set (fallback to environment if not already assigned)\n", + "if not getattr(openai, 'api_key', None):\n", + " env_key = os.getenv('OPENAI_API_KEY')\n", + " if env_key:\n", + " openai.api_key = env_key.strip()\n", + " print('Hydrated openai.api_key from environment variable.')\n", + " else:\n", + " raise ValueError('OpenAI API key not set. Set OPENAI_API_KEY env or rerun key input cell.')\n", + "\n", + "# Re-initialize new SDK client if available and was None\n", + "if _use_new and _openai_client is None:\n", + " try:\n", + " from openai import OpenAI\n", + " _openai_client = OpenAI(api_key=openai.api_key)\n", + " print('Reinitialized OpenAI client.')\n", + " except Exception as e:\n", + " print(f'Failed to reinitialize OpenAI client: {e}')\n", + " _use_new = False\n", + "\n", + "BATCH_SIZE = 10\n", + "MODEL = EMBED_MODEL\n", + "MAX_RETRIES = 3\n", + "\n", + "s2_cur.execute(\"SELECT element_id, text FROM unstructured_data WHERE text_embedding IS NULL OR text_embedding = '';\")\n", + "rows = s2_cur.fetchall()\n", + "print(f\"Rows needing embeddings: {len(rows)}\")\n", + "\n", + "use_new = _use_new\n", + "\n", + "def embed_batch(text_list):\n", + " if use_new and _openai_client is not None:\n", + " resp = _openai_client.embeddings.create(model=MODEL, input=text_list)\n", + " return [item.embedding for item in resp.data]\n", + " else:\n", + " resp = openai.Embedding.create(model=MODEL, input=text_list)\n", + " return [item['embedding'] for item in resp['data']]\n", + "\n", + "for i in range(0, len(rows), BATCH_SIZE):\n", + " batch = rows[i:i+BATCH_SIZE]\n", + " texts = [t for _, t in batch]\n", + " attempt = 0\n", + " while True:\n", + " try:\n", + " embeddings = embed_batch(texts)\n", + " break\n", + " except Exception as e:\n", + " attempt += 1\n", + " if attempt >= MAX_RETRIES:\n", + " print(f\"Failed batch starting at index {i}: {e}\")\n", + " embeddings = [None]*len(batch)\n", + " break\n", + " sleep_time = 2 ** attempt\n", + " print(f\"Retry {attempt} for batch starting at {i} after error: {e}. Sleeping {sleep_time}s\")\n", + " time.sleep(sleep_time)\n", + " for (element_id, _), emb in zip(batch, embeddings):\n", + " if emb:\n", + " s2_cur.execute(\"UPDATE unstructured_data SET text_embedding = %s WHERE element_id = %s;\", (json.dumps(emb), element_id))\n", + "\n", + "print(\"Embedding update complete.\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "fa56983d", + "metadata": {}, + "source": [ + "## Run User Query Based on Similarity Score\n", + "\n", + "The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.\n", + "\n", + "Reference: [How the Dot Product Measures Similarity](https://tivadardanka.com/blog/how-the-dot-product-measures-similarity)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "35e10fa7", + "metadata": {}, + "outputs": [], + "source": [ + "search_string = \"What are the emergency management provisions include?\"\n", + "search_embedding = embed_text(search_string)\n", + "search_embedding_array = np.asarray(search_embedding, dtype=np.float32)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "876a636b", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch text, type, filename, and embeddings from the unstructured_data table using singlestoredb\n", + "s2_cur.execute(\"SELECT text, element_type, filename, text_embedding FROM unstructured_data WHERE text_embedding IS NOT NULL;\")\n", + "results = s2_cur.fetchall()\n", + "\n", + "scores = []\n", + "for text, type_, filename, embedding_str in results:\n", + " if embedding_str:\n", + " embedding = json.loads(embedding_str)\n", + " embedding_array = np.array(embedding)\n", + " score = np.dot(search_embedding_array, embedding_array)\n", + " scores.append((text, type_, filename, score))\n", + "\n", + "# Sort by score and take the top 5\n", + "top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]\n", + "\n", + "# Display top-k records\n", + "top_5" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "bd06c2d8", + "metadata": {}, + "source": [ + "## Generate the Answer via OpenAI ChatCompletion\n", + "\n", + "In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.\n", + "\n", + "Reference: [OpenAI Chat Completions API Guide](https://platform.openai.com/docs/guides/gpt/chat-completions-api)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "8a57d965", + "metadata": {}, + "outputs": [], + "source": [ + "if top_5:\n", + " try:\n", + " response = openai.ChatCompletion.create(\n", + " model=\"gpt-5\",\n", + " messages=[\n", + " {\"role\": \"system\",\n", + " \"content\": \"You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context.\"\n", + " },\n", + " {\"role\": \"assistant\", \"content\": str(top_5)},\n", + " {\"role\": \"user\", \"content\": search_string},\n", + " ],\n", + " temperature=0\n", + " )\n", + "\n", + " assistant_message = response['choices'][0]['message']['content']\n", + " print(\"Assistant's Response:\", assistant_message)\n", + "\n", + " except Exception as e:\n", + " print(f\"OpenAI API call failed: {e}\")\n", + "else:\n", + " print(\"No relevant documents found.\")" + ] + }, + { + "id": "f034fab2", + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/ingest-pdfs-with-unstructured/meta.toml b/notebooks/ingest-pdfs-with-unstructured/meta.toml deleted file mode 100644 index 9520a92..0000000 --- a/notebooks/ingest-pdfs-with-unstructured/meta.toml +++ /dev/null @@ -1,10 +0,0 @@ -[meta] -authors=["singlestore"] -title="Ask questions of your PDFs with Unstructured" -description="Ask questions of your unstructured PDFs. In this notebook, Unstructured.io ingests pdfs accurately, then Open AI is used to create embeddings, the vector data is stored in SingleStore and finally ask questions of your PDF data" -icon="file-export" -difficulty="beginner" -tags=["ingest", "pdf","vector","unstructured"] -lesson_areas=["AI", "Integrations"] -destinations=["spaces"] -minimum_tier="standard" diff --git a/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb b/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb deleted file mode 100644 index f948919..0000000 --- a/notebooks/ingest-pdfs-with-unstructured/notebook.ipynb +++ /dev/null @@ -1,487 +0,0 @@ -{ - "cells": [ - { - "id": "3ba63f11", - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "
\n", - " \n", - "
\n", - "
\n", - "
SingleStore Notebooks
\n", - "

Ask questions of your PDFs with Unstructured

\n", - "
\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Install Unstructured Library\n", - "\n", - "We'll start by installing the Unstructured library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.\n", - "\n", - "Reference for full installation details: [Unstructured Installation Guide](https://unstructured-io.github.io/unstructured/installation/full_installation.html)" - ], - "id": "0680197e" - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install \"unstructured[pdf]\"" - ], - "id": "3a3fee0a" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Import Libraries\n", - "\n", - "In this section, we import the necessary libraries for our project. We'll use `pandas` to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system." - ], - "id": "6a27e7f1" - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import json\n", - "import mysql.connector\n", - "import pandas as pd\n", - "import numpy as np\n", - "\n", - "import openai\n", - "from openai.embeddings_utils import get_embedding" - ], - "id": "6a076d8b" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Configure OpenAI API and SingleStore Database\n", - "\n", - "Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.\n", - "\n", - "- Obtain your OpenAI API key here: [OpenAI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key)\n", - "- Set up your SingleStore account and workspace: [SingleStore Setup Guide](https://www.singlestore.com/blog/how-to-get-started-with-singlestore/)\n", - "- Connect to your SingleStore workspace: [SingleStore Connection Documentation](https://docs.singlestore.com/cloud/connect-to-your-workspace/)" - ], - "id": "c40456f7" - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# OpenAI API Key\n", - "openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", - "\n", - "# SingleStore DB Connection\n", - "host=os.environ[\"SS_HOST\"]\n", - "port=3306\n", - "username=os.environ[\"SS_USERNAME\"]\n", - "password=os.environ[\"SS_PASSWORD\"]\n", - "database=os.environ[\"SS_DATABASE\"]" - ], - "id": "e8826a8c" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Unstructured PDF Partition\n", - "\n", - "The PDF Partition step is critical for ingesting and processing the PDF document. Here, we define the filename of the PDF to be processed. We then use the `partition_pdf` function to segment the PDF document, extracting various elements such as text, images, and tables. The function can execute locally or make a call to a remote inference server, depending on your setup.\n", - "\n", - "Additionally, the `chunk_by_title` function is used to organize the document into sections based on the presence of titles, with non-text elements being treated as separate sections. The \"fast\" strategy is applied for quick text extraction, which is suitable for text-heavy PDFs.\n", - "\n", - "References:\n", - "- [Partition PDF Documentation](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf)\n", - "- [Chunk by Title Documentation](https://unstructured-io.github.io/unstructured/bricks/chunking.html)\n", - "- [Strategy Documentation](https://unstructured-io.github.io/unstructured/best_practices/strategies.html)" - ], - "id": "92ae5a1e" - }, -{ - "attachments": {}, - "cell_type": "markdown", - "id": "b5cdd4f1-b27c-4921-ac9f-da41654fd28f", - "metadata": { - "language": "python" - }, - "source": [ - "## Uploading PDF File to Stage\n", - "\n", - "Upload the PDF to the Stage folder (Deployments tab) for the chosen workspace group before ingesting the contents\n", - "\n", - "References:\n", - "- [Stage documentation](https://docs.singlestore.com/cloud/load-data/load-data-from-files/stage/)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d9f964ec-0a77-4cd6-b98c-a9f07bfce293", - "metadata": { - "language": "python", - "trusted": true - }, - "outputs": [], - "source": [ - "# For accessing the stage file inside the notebook , we copy it locally on the container running the notebook using the following \n", - "# fusion SQL command\n", - "DOWNLOAD STAGE FILE 'Employee-Handbook.pdf' TO 'Employee-Handbook.pdf'" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "pdf_filename = \"Employee-Handbook.pdf\"" - ], - "id": "5f4be9dc" - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "from unstructured.partition.pdf import partition_pdf\n", - "from unstructured.chunking.title import chunk_by_title\n", - "\n", - "elements = partition_pdf(pdf_filename,\n", - " strategy=\"fast\",\n", - " )\n", - "\n", - "chunks = chunk_by_title(elements)" - ], - "id": "24879122" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Reformat JSON Output into Structured Dataframe Format\n", - "\n", - "After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.\n", - "\n", - "Reference for understanding metadata: [Unstructured Metadata Documentation](https://unstructured-io.github.io/unstructured/metadata.html)" - ], - "id": "a8fefdba" - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "# Convert JSON output into Pandas DataFrame\n", - "data = []\n", - "\n", - "for c in chunks:\n", - " row = {}\n", - " row['Element Type'] = type(c).__name__\n", - " row['Filename'] = c.metadata.filename\n", - " row['Date Modified'] = c.metadata.last_modified\n", - " row['Filetype'] = c.metadata.filetype\n", - " row['Page Number'] = c.metadata.page_number\n", - " row['text'] = c.text\n", - " data.append(row)\n", - "\n", - "df = pd.DataFrame(data)\n", - "\n", - "# Show the DataFrame\n", - "df.head()" - ], - "id": "b4f19b22" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Make Connection to SingleStore Database\n", - "\n", - "In this step, we establish a connection to the SingleStore Database using the MySQL connector. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.\n", - "\n", - "References:\n", - "- [Creating a Database in SingleStoreDB Cloud](https://docs.singlestore.com/cloud/create-a-database/)\n", - "- [Loading Data into SingleStoreDB Cloud](https://docs.singlestore.com/cloud/load-data/)" - ], - "id": "e1cfcd38" - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "# Create connection to S2 Database\n", - "cnx = mysql.connector.connect(user=username,\n", - " password=password,\n", - " host=host,\n", - " database=database)\n", - "cnx" - ], - "id": "7a9d094a" - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "# Drop the existing table\n", - "drop_cursor = cnx.cursor()\n", - "drop_query = \"DROP TABLE IF EXISTS unstructured_data;\"\n", - "drop_cursor.execute(drop_query)\n", - "\n", - "# Create a new table\n", - "create_cursor = cnx.cursor()\n", - "create_query = (\"CREATE TABLE unstructured_data (\"\n", - " \"element_id INT AUTO_INCREMENT PRIMARY KEY, \"\n", - " \"element_type VARCHAR(255), \"\n", - " \"filename VARCHAR(255), \"\n", - " \"date_modified DATETIME, \"\n", - " \"filetype VARCHAR(255), \"\n", - " \"page_number INT, \"\n", - " \"text TEXT);\")\n", - "create_cursor.execute(create_query)\n", - "\n", - "cnx.commit()\n", - "drop_cursor.close()\n", - "create_cursor.close()" - ], - "id": "ba220cc1" - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "cursor = cnx.cursor()\n", - "\n", - "# Loop through the DataFrame and insert each row into the table\n", - "for i, row in df.iterrows():\n", - " insert_query = \"\"\"INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text)\n", - " VALUES (%s, %s, %s, %s, %s, %s);\"\"\"\n", - " cursor.execute(insert_query, (row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']))\n", - "\n", - "cnx.commit()\n", - "cursor.close()" - ], - "id": "3f7cbbdb" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create Text Embedding in the Table\n", - "\n", - "Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's `get_embedding` function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.\n", - "\n", - "Reference: [Understanding Text Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)" - ], - "id": "95f9443b" - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "cursor = cnx.cursor(buffered=True)\n", - "\n", - "# Add a new column for text embedding\n", - "alter_query = \"ALTER TABLE unstructured_data ADD text_embedding TEXT;\"\n", - "cursor.execute(alter_query)" - ], - "id": "c95bc511" - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "# Select and embed all text in table\n", - "query = \"SELECT text FROM unstructured_data;\"\n", - "cursor.execute(query)\n", - "rows = cursor.fetchall()\n", - "\n", - "for i in rows:\n", - " text_embedding = json.dumps(get_embedding(i[0], engine=\"text-embedding-ada-002\"))\n", - " update_query = (\"UPDATE unstructured_data SET text_embedding = %s WHERE text = %s;\")\n", - " data = (text_embedding, i[0])\n", - " cursor.execute(update_query, data)\n", - "\n", - "cnx.commit()\n", - "cursor.close()" - ], - "id": "00b7c77b" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Run User Query Based on Similarity Score\n", - "\n", - "The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.\n", - "\n", - "Reference: [How the Dot Product Measures Similarity](https://tivadardanka.com/blog/how-the-dot-product-measures-similarity)" - ], - "id": "fa56983d" - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "# User query\n", - "search_string = \"What are the emergency management provisions include?\"\n", - "search_embedding = get_embedding(search_string, engine=\"text-embedding-ada-002\")\n", - "search_embedding_array = np.array(search_embedding)" - ], - "id": "35e10fa7" - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "cursor = cnx.cursor()\n", - "\n", - "# Fetch text, type, filename, and embeddings from the unstructured_data table\n", - "query = \"SELECT text, element_type, filename, text_embedding FROM unstructured_data;\"\n", - "cursor.execute(query)\n", - "\n", - "results = cursor.fetchall()\n", - "\n", - "# Compute dot product scores\n", - "scores = []\n", - "for res in results:\n", - " text = res[0]\n", - " type_ = res[1]\n", - " filename = res[2]\n", - " embedding_str = res[3]\n", - "\n", - " if embedding_str is not None:\n", - " embedding = json.loads(embedding_str)\n", - " embedding_array = np.array(embedding)\n", - "\n", - " # Compute dot product for all records\n", - " score = np.dot(search_embedding_array, embedding_array)\n", - " scores.append((text, type_, filename, score))\n", - "\n", - "# Sort by score and take the top 5\n", - "top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]\n", - "\n", - "# Close the connection\n", - "cursor.close()\n", - "cnx.close()\n", - "\n", - "# Display top-k records\n", - "top_5" - ], - "id": "876a636b" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Generate the Answer via OpenAI ChatCompletion\n", - "\n", - "In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.\n", - "\n", - "Reference: [OpenAI Chat Completions API Guide](https://platform.openai.com/docs/guides/gpt/chat-completions-api)" - ], - "id": "bd06c2d8" - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "if top_5:\n", - " try:\n", - " response = openai.ChatCompletion.create(\n", - " model=\"gpt-4\",\n", - " messages=[\n", - " {\"role\": \"system\",\n", - " \"content\": \"You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context.\"\n", - " },\n", - " {\"role\": \"assistant\", \"content\": str(top_5)},\n", - " {\"role\": \"user\", \"content\": search_string},\n", - " ],\n", - " temperature=0\n", - " )\n", - "\n", - " assistant_message = response['choices'][0]['message']['content']\n", - " print(\"Assistant's Response:\", assistant_message)\n", - "\n", - " except Exception as e:\n", - " print(f\"OpenAI API call failed: {e}\")\n", - "else:\n", - " print(\"No relevant documents found.\")" - ], - "id": "8a57d965" - }, - { - "id": "f034fab2", - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "
" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "dashboards": [], - "language": "python", - "notebookMetadata": { - "pythonIndentUnit": 4 - }, - "notebookName": "PDF Processing with Tokenizer & Embedding using UnstructuredIO Python SDK & Delta Table", - "widgets": {} - }, - "kernelspec": { - "display_name": "unstructured-3.10.12", - "language": "python", - "name": "unstructured-3.10.12" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.12" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}