# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
pip install typing-extensions==4.5.0

Collecting typing-extensions==4.5.0
  Using cached typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.9.0
    Uninstalling typing_extensions-4.9.0:
      Successfully uninstalled typing_extensions-4.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
openai 1.8.0 requires typing-extensions<5,>=4.7, but you have typing-extensions 4.5.0 which is incompatible.[0m[31m
[0mSuccessfully installed typing-extensions-4.5.0


In [2]:
!pip install -q cassio datasets langchain openai tiktoken

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.9.0 which is incompatible.[0m[31m
[0m

Import the packages you'll need:

In [3]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [4]:
!pip install PyPDF2



In [5]:
from PyPDF2 import PdfReader

### Setup

In [15]:
OPENAI_API_KEY = "OPENAI key"

In [7]:
ASTRA_DB_APPLICATION_TOKEN = "...." # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "..." # enter your Database ID

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [8]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('budget_speech.pdf')

In [9]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [10]:
raw_text

' \n  \n \n \n \n \n \nBUDGET SPEECH  \n2023-2024 \n \n \n \n \n \n \n \n \n  \n  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 1 \n [PART -A] \nHon‘ble Speaker sir and members of this august House,      \n  \n1. Today, I am present before all of you as the Finance Minister to \npresent the budget for the year 2023 -24 in the Delhi Assembly. I feel \nvery grateful, honored and humbled for this responsibility. I would \nhave been more happy if this budget was presented by our respected \nformer Deputy CM Shri Manish Sisodia ji as always. This is the 9th \nbudget of this government and my first as the Finance Minister. I \nwould like to thank the honorable Chief Minister from the bottom of \nmy heart for giving me this opportunity t o present the budget for the 2 \ncrore people of Delhi. We all know that the Budget is not just a \ndocument of numbers and announcements, it is a  manifestation of \nthe hopes and aspirations of every aam-aadmi.  \n \nThis budget has been prepared

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [11]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(135641341774336) dc0e62c6-01f1-42ad-ab5b-397f9427997f-us-east1.db.astra.datastax.com:29042:56e52adb-9887-482b-a4ce-9ccf4b574572> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [16]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

Create your LangChain vector store ... backed by Astra DB!

In [17]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [18]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [19]:
texts[:50]

['BUDGET SPEECH  \n2023-2024 \n \n \n \n \n \n \n \n \n  \n  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 1 \n [PART -A] \nHon‘ble Speaker sir and members of this august House,      \n  \n1. Today, I am present before all of you as the Finance Minister to \npresent the budget for the year 2023 -24 in the Delhi Assembly. I feel \nvery grateful, honored and humbled for this responsibility. I would \nhave been more happy if this budget was presented by our respected \nformer Deputy CM Shri Manish Sisodia ji as always. This is the 9th \nbudget of this government and my first as the Finance Minister. I \nwould like to thank the honorable Chief Minister from the bottom of \nmy heart for giving me this opportunity t o present the budget for the 2 \ncrore people of Delhi. We all know that the Budget is not just a',
 'my heart for giving me this opportunity t o present the budget for the 2 \ncrore people of Delhi. We all know that the Budget is not just a \ndocument of numbers and a

### Load the dataset into the vector store



In [20]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [22]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content))


Enter your question (or type 'quit' to exit): What is the current GDP?

QUESTION: "What is the current GDP?"




ANSWER: "I don't know."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9103] "prices and 9.14% at constant prices during the year 2021 -22, which 
reflects the impact of the effective measures taken for control of 
COVID -19 pandemic.  
 
15. The contribution of Delhi's real GSDP to the national GDP is 
estimated to increase from 3.94% in 2011 -12 to 4.09% in 2022 -23, 
whereas Delhi accounts for only 1.53% of the country's total 
population. I would like to point out that the service sector contrib utes 
mainly to the economy of Delhi and contributes 84.84% to the Gross 
State Value Added      at prevailing market prices, while the 
secondary sector contributes 12.53% and the primary sector 
contributes 2.63%.  
 
16. Delhi's per capita income is likely to increase to ₹ 4,44,768 at current 
prices in the financial year 2022 -23. In the year 2021 -22, it was ₹ ..."
    [0.9065] "12. Delhi's economy is now slowly emerging from the economic 
challenges of the Covid -19 pandemic. As a result, Delh



ANSWER: "I don't know."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8746] "largest among all states in India by the end  of 2023. We will also 
begin the installation of 1400 new and modern bus queue shelters 
(BQS) with digital screens w ith modern Passenger Informa tion 
System (PIS) will  display the  arrival time  of buses. The dream of a 
clean and beautiful Delhi is incomplete without a clean and beautiful 
Yamuna. In the next year, we will rapidly expand the reach of the 5 
 sewer network to all the colonies and JJ cluste rs of Delhi and 
upgrade the capacities of our sewage treatment plants on a war 
footing to achieve the vision of Clean Yamuna.  
 
9. Speaker sir, the three garbage mountains  of Delhi have been a dark 
spot on Delhi‘s image for several decades now. Though the task of 
clearing these garbage mountains falls in the domain of MCD, for the ..."
    [0.8738] "rationalisation stu dy and worked out an extensive set of last -mile 
connectivity routes that will connect all 