# Querying PDF With Astra and LangChain

### A question-answering project using Astra DB and LangChain, powered by Vector Search

Install the required dependencies:

In [1]:
# !pip install -q cassio datasets langchain openai tiktoken

Import the packages you'll need:

In [None]:
# LangChain components to use

# specifically going to use cassandra DB so in Lang chain you have all these libraries which will actually help you to connect with cassendra DB and perform all the necessary tasks like text embeddings, creating vectors and probably storing it in the database itself
from langchain.vectorstores.cassandra import Cassandra

# Vector store index wrapper it is going to wrap all those particular vectors in one specific package so that it can be used quickly

from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI

# AI embeddings which will be responsible for converting your text into vectors
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

In [None]:
# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [None]:
%pip install cassandra-driver

In [None]:
%pip install --upgrade astrapy

In [None]:
from astrapy import DataAPIClient

# Initialize the client
client = DataAPIClient("YOUR_TOKEN")
db = client.get_database_by_api_endpoint(
  "YOUT_LINK"
)

print(f"Connected to Astra DB: {db.list_collection_names()}")

In [35]:
# this will actually help you to read any PDF And read the text inside the PDF itself
%pip install PyPDF2

# this will be the functionality that will be used in order to read the document 
from PyPDF2 import PdfReader

### Setup

In [37]:
ASTRA_DB_APPLICATION_TOKEN = "YOUR_TOKEN" # Enter your Token

ASTRA_DB_ID = "YOUR_DATABASE_ID" # Enter your Database ID

OPENAI_API_KEY = "YOUR_OPENAI_KEY" # Enter your OpenAI key

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [38]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('asset/Financial Report 2023.pdf')

In [39]:
from typing_extensions import Concatenate
# read text from pdf

raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [None]:
raw_text

Initialize the connection to your database:

In [None]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [None]:
llm = OpenAI(openai_api_key="OPENAI_API_KEY")
embedding = OpenAIEmbeddings(openai_api_key="OPENAI_API_KEY")

Create your LangChain vector store by Astra DB!

In [None]:
astra_vector_store = Cassandra(
    embedding=embedding, 
    table_name="querypdf_db",
    session=None,
    keyspace=None,
)



In [None]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
texts[:50]

### Load the dataset into the vector store



In [None]:

astra_vector_store.add_texts(texts[:50])  # Also act as Embedding vector an insert in the Astra DB which is having that cassendra over there

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

### Run the Question Answering cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:

Revenue and Income:

What is the average revenue per year?
What is the total revenue for the reporting period?
How does the revenue compare to the same period last year or previous quarters?
Are there any significant fluctuations in revenue, and if so, what are the reasons behind them?
What are the sources of revenue, and how do they contribute to the overall income?

Expenses:

What are the major expense categories, and how do they compare to budgeted amounts?
Have there been any unexpected or unusual expenses during the reporting period?
How are operating expenses trending over time, and what strategies are in place to manage them?
Are there any cost-saving initiatives or efficiency measures being implemented?


Profitability:

What is the net profit or loss for the period, and how does it compare to expectations or targets?
What is the gross profit margin, and how does it compare to industry benchmarks?
Are there any specific factors influencing profitability, such as pricing changes, market conditions, or competition?


Financial Position:

What is the current financial position of the organization, including assets, liabilities, and equity?
How does the current financial position compare to previous periods, and what are the main drivers of change?
Are there any significant changes in the balance sheet items, such as inventory levels, accounts receivable, or debt obligations?


Cash Flow:

What is the cash flow statement showing operating, investing, and financing activities?
Is the organization generating sufficient cash flow to meet its obligations and fund future growth?
Are there any concerns or challenges related to cash flow management?
Financial Ratios and Metrics:

What are the key financial ratios and metrics, such as liquidity ratios, solvency ratios, and profitability ratios?
How do these ratios compare to industry standards or benchmarks, and what do they indicate about the financial health of the organization?


Risk Management:

What are the key financial risks facing the organization, such as market risk, credit risk, or operational risk?
How is the organization managing these risks, and are there any emerging risks that need to be addressed?
What contingency plans are in place to mitigate potential financial risks or uncertainties?
Future Outlook and Plans:

What is the organization's outlook for future performance and growth?
Are there any strategic initiatives, expansion plans, or investments planned for the upcoming periods?
How do external factors, such as economic trends, regulatory changes, or industry developments, impact the organization's future prospects?


In [51]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=5): # This will print the top 5 documents/results
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:100])) # This will print the first 100 characters of the document


