# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

In [2]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [5]:
from PyPDF2 import PdfReader

In [3]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper


# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

import os
from dotenv import load_dotenv
load_dotenv()

True

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [4]:
ASTRA_API_KEY = os.getenv("ASTRA_API_KEY") # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = os.getenv("ASTRA_DB_ID") # enter your Database ID

In [10]:
pdfreader = PdfReader('budget-2025.pdf')

In [15]:
#Read text from PDF file
raw_text = ''

for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

raw_text

' \n \nBudget 202 5-2026 \n \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government ’s efforts to:  \na) accelerate growth,  \nb) secure inclusive development,  \nc) invigorate private sector investments,  \nd) uplift household sentiments, and \ne) enhance spending power of India’s rising middle class.  \n2. Together, we embark on a journey to unlock our nation’s treme ndous \npotential for greater prosperity and global positioning under the leadership of \nHon’ble Prime Minister Shri Narendra Modi.  \n3. As we complete the first quarter of the 21st century, continuing \ngeopolitical headwinds suggest lower  global economic growth ov er the \nmedium term. However, our aspiration for a Viksit Bharat inspires us, and the \ntransformative work we have done during our Government ’s first two terms \nguides us, to march forward resolutely.  \nBudget Them

### Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [16]:
cassio.init(token=ASTRA_API_KEY, database_id=ASTRA_DB_ID)

### Create the LangChain embedding and LLM objects for later usage:

In [24]:
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_groq import ChatGroq

groq_api_key=os.getenv("GROQ_API_KEY")

llm=ChatGroq(groq_api_key=groq_api_key,model_name="Llama3-8b-8192")

embeddings = (
    OllamaEmbeddings(model="gemma:2b")
)

print("Embeddings: ", embeddings)
print("LLM: ", llm)

Embeddings:  model='gemma:2b' base_url=None client_kwargs={} mirostat=None mirostat_eta=None mirostat_tau=None num_ctx=None num_gpu=None keep_alive=None num_thread=None repeat_last_n=None repeat_penalty=None temperature=None stop=None tfs_z=None top_k=None top_p=None
LLM:  client=<groq.resources.chat.completions.Completions object at 0x000001FCF78D6800> async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x000001FCF79087F0> model_name='Llama3-8b-8192' model_kwargs={} groq_api_key=SecretStr('**********')


#### Create your LangChain vector store ... backed by Astra DB!

In [25]:
astra_vector_store = Cassandra(
    embedding=embeddings,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [26]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [27]:
texts[:50]

['Budget 202 5-2026 \n \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government ’s efforts to:  \na) accelerate growth,  \nb) secure inclusive development,  \nc) invigorate private sector investments,  \nd) uplift household sentiments, and \ne) enhance spending power of India’s rising middle class.  \n2. Together, we embark on a journey to unlock our nation’s treme ndous \npotential for greater prosperity and global positioning under the leadership of \nHon’ble Prime Minister Shri Narendra Modi.  \n3. As we complete the first quarter of the 21st century, continuing \ngeopolitical headwinds suggest lower  global economic growth ov er the',
 'Hon’ble Prime Minister Shri Narendra Modi.  \n3. As we complete the first quarter of the 21st century, continuing \ngeopolitical headwinds suggest lower  global economic growth ov er the \nmedium term. However, ou

### Load the dataset into the vector store



In [28]:
astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [30]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


QUESTION: "what is the curent GDP"
ANSWER: "I don't have information on the current GDP. The text provided does not mention the current GDP. It only mentions that India's economy is the fastest-growing among all major global economies."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8042] "per cent contribution by the Government , and the balance will be mobiliz ed 
from p ..."
    [0.7900] "3) Urban Development;  
4) Mining;  
5) Finan cial Sector; and  
6) Regulatory Refor ..."
    [0.7892] "Hon’ble Prime Minister Shri Narendra Modi.  
3. As we complete the first quarter of  ..."
    [0.7874] "will be identified based on objective criteria.  
89. Facilitation groups with parti ..."

QUESTION: "give me summary for this pdf"
ANSWER: "The PDF appears to be a document outlining the goals and objectives of a national development mission. Here is a summary:

The mission aims to promote economic growth, employment, and innovation by:

1. Fostering "Make in India" by providing policy support, execut