# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m809.2 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.4/794.4 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.8/18.8 MB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

Import the packages you'll need:

In [2]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [3]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
from PyPDF2 import PdfReader

### Setup

In [5]:
ASTRA_DB_APPLICATION_TOKEN = "Your AstraDB Token"
ASTRA_DB_ID = "Your AstraDB ID"

OPENAI_API_KEY = "Your open AI API key"

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [6]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('Your File Name.pdf')

In [7]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [8]:
raw_text

'14L2DART: A TrustManagementSystem Integrating\nBlockchain and Off-Chain Computation\nANDREA DE SALVE ,ConsiglioNazionale delle Ricerche -ISASI\nLUCAFRANCESCHI ,Universityof Pisa\nANDREA LISI ,Universityof Pisa,Italyand ConsiglioNazionale delle Ricerche - IIT\nPAOLO MORI ,ConsiglioNazionale delle Ricerche - IIT\nLAURA RICCI ,Universityof Pisa\nTheblockchaintechnologyhasbeengaininganincreasingpopularityforthelastyears,andsmartcontracts\nare being used for a growing number of applications in several scenarios. The execution of smart contracts\non public blockchains can be invoked by any user with a transaction, although in many scenarios there\nwould be the need for restricting the right of executing smart contracts only to a restricted set of users.\nTo help deal with this issue, this article proposes a system based on a popular access control framework\ncalledRT,Role-basedTrustManagement,toregulatesmartcontractsexecutionrights.Theproposedsystem,\ncalled Layer 2 DecentrAlized Role-based

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [9]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(140240600987968) 55c8b0b7-80f1-4d0e-8355-c350e79a17d4-us-east1.db.astra.datastax.com:29042:4139b007-aa45-41a0-b4b7-d39107002efc> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [10]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

Create your LangChain vector store ... backed by Astra DB!

In [11]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [12]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [13]:
texts[:50]

['14L2DART: A TrustManagementSystem Integrating\nBlockchain and Off-Chain Computation\nANDREA DE SALVE ,ConsiglioNazionale delle Ricerche -ISASI\nLUCAFRANCESCHI ,Universityof Pisa\nANDREA LISI ,Universityof Pisa,Italyand ConsiglioNazionale delle Ricerche - IIT\nPAOLO MORI ,ConsiglioNazionale delle Ricerche - IIT\nLAURA RICCI ,Universityof Pisa\nTheblockchaintechnologyhasbeengaininganincreasingpopularityforthelastyears,andsmartcontracts\nare being used for a growing number of applications in several scenarios. The execution of smart contracts\non public blockchains can be invoked by any user with a transaction, although in many scenarios there\nwould be the need for restricting the right of executing smart contracts only to a restricted set of users.',
 'would be the need for restricting the right of executing smart contracts only to a restricted set of users.\nTo help deal with this issue, this article proposes a system based on a popular access control framework\ncalledRT,Role-basedTr

### Load the dataset into the vector store



In [14]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)


In [15]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): How Layer-2 DART is designed

QUESTION: "How Layer-2 DART is designed"




ANSWER: "Layer-2 DART is designed as a layer-2 system that enhances DART by using an off-chain computation model and a verifiable computation approach. Its framework, L2DART, is based on the idea that computing a solution off-chain and verifying it on the blockchain is cheaper than computing it entirely on the blockchain. L2DART stores trust credentials on a public blockchain and uses a backward search algorithm to infer users' roles from these credentials. It also includes an on-chain module and an off-chain module for cost-effective execution and auditability."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9316] "tionofpublicblockchains,inthisarticle,weenhancedDARTmakingitalayer-2systemfollowing
 ..."
    [0.9296] "the art best practices, consisting of an on-chain module as a Solidity smart contrac ..."
    [0.9223] "functionalities to reduce the blockchain costs while keeping blockchain auditability ..."
    [0.9145] "organizations than the one that deployed the smart contract, and the role