# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [None]:
%pip install -q cassio datasets langchain openai tiktoken

Import the packages you'll need:

In [2]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [3]:
%pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


In [4]:
from PyPDF2 import PdfReader

### Setup

In [6]:
import os
from dotenv import load_dotenv
load_dotenv()

ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_ID = os.getenv("ASTRA_DB_ID")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ASTRA_DB_APPLICATION_TOKEN

'AstraCS:ShAodXveSQmTYponXAWQWPLd:887a0929f48b582889626eca040e6f889c613b5b9d086b186a902689971aa605'

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [8]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('us_budget_2025.pdf')

In [9]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [10]:
raw_text

"OFFICE OF MANAGEMENT AND BUDGET\nBudget\nof the U.S.\nGovernment\nFISCAL YEAR 2025U.S. GOVERNMENT PUBLISHING OFFICE, WASHINGTON 2024Budget of the United States Government , \nFiscal Year 2025 contains the Budget Message of the \nPresident, information on the President’s priorities, \nand summary tables.\nAnalytical Perspectives , Budget of the United \nStates Government, Fiscal Year 2025 contains anal -\nyses that are designed to highlight specified subject \nareas or provide other significant presentations of \nbudget data that place the Budget in perspective.  \nThis volume includes economic and accounting anal -\nyses, information on Federal receipts and collections, \nanalyses of Federal spending, information on Federal \nborrowing and debt, baseline or current services es -\ntimates, and other technical presentations.  \nSupplemental tables and other materials that \nare part of the Analytical Perspectives  volume \nare available at https: //whitehouse.gov /omb/\nanalytical-persp

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [11]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [12]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  llm = OpenAI(openai_api_key=OPENAI_API_KEY)
  embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


Create your LangChain vector store ... backed by Astra DB!

In [16]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [13]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [14]:
texts[:100]

['OFFICE OF MANAGEMENT AND BUDGET\nBudget\nof the U.S.\nGovernment\nFISCAL YEAR 2025U.S. GOVERNMENT PUBLISHING OFFICE, WASHINGTON 2024Budget of the United States Government , \nFiscal Year 2025 contains the Budget Message of the \nPresident, information on the President’s priorities, \nand summary tables.\nAnalytical Perspectives , Budget of the United \nStates Government, Fiscal Year 2025 contains anal -\nyses that are designed to highlight specified subject \nareas or provide other significant presentations of \nbudget data that place the Budget in perspective.  \nThis volume includes economic and accounting anal -\nyses, information on Federal receipts and collections, \nanalyses of Federal spending, information on Federal \nborrowing and debt, baseline or current services es -',
 'yses, information on Federal receipts and collections, \nanalyses of Federal spending, information on Federal \nborrowing and debt, baseline or current services es -\ntimates, and other technical presenta

### Load the dataset into the vector store



In [17]:

astra_vector_store.add_texts(texts[:100])

print("Inserted %i headlines." % len(texts[:100]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 100 headlines.


In [24]:
# data can be added incrementally
astra_vector_store.add_texts(texts[100:150])

['cd390fa3e15b46b28f07de225584365f',
 '46f2be05c095462baacbd8398d6d0861',
 'f026275a33e049dcaecfaacbfefaf68d',
 'c9fda29beeac4e8696810fa6334fbf85',
 '5324c310bfec40428f557a5fb636e569',
 '52040a8962b14386a39b11ce3d465f3d',
 '2bfcbd34baee4af8a8ef45459c60380b',
 '3267ddaf9190478bb4ba20bbd069f194',
 'ba3270f5c6ca422baab67f6bc98f219c',
 '4dc9cb33f4254811b1e55e8a38159922',
 'a19bbea931364693b89a2e93042cc897',
 'd02ef731da634ecd9db9203a54c6b6ce',
 '8ffee2f5ae0d41488997fb7b73b6b311',
 '22f6e9313ffb49f1960b07ed73118437',
 'c6cc5a9b21bf4c78a3447640619d74d0',
 'f8cb81db48bc4be2bf98b3633cafd5c9',
 'ed047fc1e3b541189e70f8539623052c',
 '650585e58e454ff684b6fcaf5368b2d2',
 '0b7a389ca5fa4865b759fc523199cd4f',
 '549f0e1fccd34e1b8e20819bd86bc2cc',
 'deea9624d71a472e84760e3184d614fa',
 '4fa63ef4ec18426bb4057fc253b5c7f0',
 'd4a9a0c3e51345a79ff13190c15ea1b5',
 'be2da754b24e465d9072f12896b943f5',
 'c43e2800e4f643d7bc955a02b9e6950d',
 'aec7f887b8534e1e9aee4a5ab9946d41',
 '61b03f6d417249ce813a2bfd777ba094',
 

### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What will be us governtment spending targets in 2025?_
- _How much military target spending will be in 2025 ?_
- _What other areas, the govertment will spend on ?_


In [21]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


QUESTION: "How much military target spending will be in 2025"
ANSWER: "The budget for Fiscal Year 2025 includes significant funding for the military, including support for NATO allies and partners, strengthening deterrence in the Indo-Pacific, and improving readiness. However, the exact amount of military spending for 2025 is not specified in the given context."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8982] "that America can once again be relied on to stand up for freedom.  In October, I sub ..."
    [0.8946] "worked together to expand access to healthcare, 
improve access to child and long-te ..."
    [0.8945] "Administration:  established Dependent Care 
Flexible Spending Accounts for service  ..."
    [0.8935] "ability to enforce the law.  In addition, the Budget 
includes $275  million over 10 ..."

QUESTION: "What other areas, the govertment will spend on ?"
ANSWER: "The government will also spend on securing the border, infrastructure projects, clean energy, and providing housing 