# Quickstart: Querying PDF With Astra and LangChain 
A question-answering demo using Astra DB and LangChain, powered by Vector Search

### Pre-requisites:
Yon need a Serverless Cassandra with Vector Search database on Astra DB to run this demo. As Outlined in more detail here, you should get a DB Token with role Database Administrator and copy your Database ID: These connection parameters are needed momentarily. 

You also need an OpenAI API Key for this demo to work.

What you will do: 
* Setup: import dependencies, provide secrets, create the LangChain vector store;
* Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

In [1]:
! pip install -q cassio datasets tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
#LangChain components to use 
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings 

# Support for dataset retrieval with Hugging Face 
from datasets import load_dataset 

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [3]:
! pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
from PyPDF2 import PdfReader

## Setup

#### Provide your secrets:
Replace the following with your Astra DB connection details and your OpenAI API key:

In [6]:
import os
from dotenv import load_dotenv
load_dotenv()
ASTRA_DB_APPLICATION_TOKEN=os.getenv('ASTRA_DB_APPLICATION_TOKEN')
ASTRA_DB_ID=os.getenv('ASTRA_DB_ID')

OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")

In [7]:
#provide the path of pdf file.
pdfreader=PdfReader('budget_speech.pdf')

In [8]:
from typing_extensions import Concatenate 
# read text from pdf 
raw_text=''
for i,page in enumerate(pdfreader.pages):
    content=page.extract_text()
    if content:
        raw_text+=content

In [9]:
raw_text

'GOVERNMENT OF INDIA\nBUDGET 2025-2026\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2025 \nCONTENTS  \n \nPART – A \n Page No.  \nIntroduction  1 \nBudget Theme  1 \nAgriculture as the 1st engine  3 \nMSMEs as the 2nd engine  6 \nInvestment as the 3rd engine  8 \nA. Investing in People  8 \nB. Investing in  the Economy  10 \nC. Investing in Innovation  14 \nExports as the 4th engine  15 \nReforms as the Fuel  16 \nFiscal Policy  18 \n \n \nPART – B \nIndirect taxes  20 \nDirect Taxes   23 \n \nAnnexure to Part -A 29 \nAnnexure to Part -B 31 \n \n   \n \nBudget 202 5-2026 \n \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government ’s efforts to:  \na) accelerate growth,  \nb) secure inclusive development,  \nc) invigorate private sector investments,  \nd) uplift household sentiments, and \ne) enhance spending power of India’s ris

Initialize the connection to your database:
(do you worry if you see a few warnings,its just that the drivers are chatty about negotiating protocol versions with DB.)

In [10]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN,database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage

In [13]:
llm=OpenAI(openai_api_key=OPENAI_API_KEY)
embeddings=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  embeddings=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


## Create your LangChain vector store ... backed by Astra DB!

In [14]:
astra_vector_store=Cassandra(
    embedding=embeddings,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [15]:
from langchain.text_splitter import CharacterTextSplitter
# we need to split the text using Character Text Split such that it should not increase the token size
text_splitter=CharacterTextSplitter(
    separator="\n",
    chunk_size=800,
    chunk_overlap=200,
    length_function=len,
)

In [16]:
texts=text_splitter.split_text(raw_text)

In [17]:
texts[:50]

['GOVERNMENT OF INDIA\nBUDGET 2025-2026\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2025 \nCONTENTS  \n \nPART – A \n Page No.  \nIntroduction  1 \nBudget Theme  1 \nAgriculture as the 1st engine  3 \nMSMEs as the 2nd engine  6 \nInvestment as the 3rd engine  8 \nA. Investing in People  8 \nB. Investing in  the Economy  10 \nC. Investing in Innovation  14 \nExports as the 4th engine  15 \nReforms as the Fuel  16 \nFiscal Policy  18 \n \n \nPART – B \nIndirect taxes  20 \nDirect Taxes   23 \n \nAnnexure to Part -A 29 \nAnnexure to Part -B 31 \n \n   \n \nBudget 202 5-2026 \n \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government ’s efforts to:  \na) accelerate growth,',
 'Minister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government

In [18]:
astra_vector_store.add_texts(texts)
print("Inserted %i headlines."%len(texts))
astra_vector_index=VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 155 headlines.


## Run the QA cycle 
Simply run the cells and ask a question -- or quit to stop.(you can also stop execution with the "." button on the top toolbar)

Here are some suggested questions:
* What is the current GDP?
* How much the agriculture target will be increase to and what the focus will be

In [19]:
first_question=True
while True:
    if first_question:
        query_text=input("\nEnter your question (or type 'quit' to exit):").strip()
    else:
        query_text=input("\nWhat's your next question (or type 'quit' to exit):").strip()

    if query_text.lower()=="quit":
        break
    if query_text=='':
        continue
    first_question=False
    print("\nQUESTION\"%s\""%query_text)
    answer=astra_vector_index.query(query_text,llm=llm).strip()
    print("ANSWER: \"%s\"\n"%answer)

    print("FIRST DOCUMENTS BY RELEVENCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text,k=4):
        print("     [%0.4f] \"%s ...\""%(score,doc.page_content[:84]))
                                                            


QUESTION"How much the agriculture target will be increase to and what the focus will be"
ANSWER: "The agriculture target for the Mission for Cotton Productivity will be increased to 1.7 crore farmers. The focus will be on improving productivity and sustainability, promoting extra-long staple cotton varieties, and providing science and technology support to farmers."

FIRST DOCUMENTS BY RELEVENCE:
     [0.9168] "rural areas so that migration is an option, but not a necessity.  
12. The programme ..."
     [0.9160] "and training support to makhana farmers and will also work to ensure they 
receive t ..."
     [0.9153] "Seafood exports are valued at ` 60 thousand crore. To unlock the untapped 
potential ..."
     [0.9139] "sustainable agriculture practices, (3) augment post -harvest stor age at the 
pancha ..."
