# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

Import the packages you'll need:

In [2]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper  # Wraping the vectors, we can use it quickly
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings # For embedding

# Support for dataset retrieval with Hugging Face
# from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio # Help us to intrgrated with Cassandra Db and db connections. Here Apache Cassandra product is astra db

In [3]:
 # For reading pdf
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
from PyPDF2 import PdfReader

### Setup

In [10]:
ASTRA_DB_APPLICATION_TOKEN = "" # comming from Generate Token --- enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "" # enter your Database ID (from the top)

OPENAI_API_KEY = "" # enter your OpenAI key

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [6]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('Abhisek-Datta-Resume_LL.pdf')

In [7]:
# Here we will extract all the text from the PDF
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text() # Extract the text
    if content:
        raw_text += content # Concat in a single variable

In [8]:
raw_text

"PROFILE SUMMARY\n•With 3 years of experience, I fully utilize Data Science techniques to analyze and understand data effectively and efficiently.\n•Hands on expertise in Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, and Generative AI,\napplying these methods to solve real-world problems.\n•Proficient in Python for designing and developing algorithms.\n•I'm dedicated to learning and staying updated with the latest trends, contributing to innovative projects and fostering\nteamwork for effective problem-solving.\nWORK EXPERIENCE\nCapgemini India\n•Smart Risk Monitoring: Enhancing Employee Communication Monitoring\n•Data Integration and Embedding: Implemented a comprehensive data integration strategy following a RAG architecture. \nInitially, sensitive documents like chats and call records undergo metadata formation before proceeding to embedding and \nstorage in the vector space to establish a knowledge base. This approach ensures the security and integr

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [11]:
# Initializing the connection to the astra database
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [12]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY) ## This the OpenAI LLM
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY) ## This the OpenAI Embedding

  warn_deprecated(
  warn_deprecated(


Create your LangChain vector store ... backed by Astra DB!

In [13]:
astra_vector_store = Cassandra(
    embedding=embedding,  # This is the embedding. so it will convert all the text into embedding
    table_name="qa_resume_demo", # This is the Table name
    session=None,
    keyspace=None,
)

##### Still we are not converting to vector from text. We will convert when we push the data to DB

In [16]:
### For Text Chucks

from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n", # Seperator
    chunk_size = 200, # no of characters (token size)
    chunk_overlap  = 50,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [18]:
texts

['PROFILE SUMMARY\n•With 3 years of experience, I fully utilize Data Science techniques to analyze and understand data effectively and efficiently.',
 '•Hands on expertise in Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, and Generative AI,\napplying these methods to solve real-world problems.',
 "•Proficient in Python for designing and developing algorithms.\n•I'm dedicated to learning and staying updated with the latest trends, contributing to innovative projects and fostering",
 'teamwork for effective problem-solving.\nWORK EXPERIENCE\nCapgemini India\n•Smart Risk Monitoring: Enhancing Employee Communication Monitoring',
 '•Data Integration and Embedding: Implemented a comprehensive data integration strategy following a RAG architecture.',
 'Initially, sensitive documents like chats and call records undergo metadata formation before proceeding to embedding and',
 'storage in the vector space to establish a knowledge base. This approach ensures the se

### Load the dataset into the vector store



In [19]:
## Loadin to astra database
astra_vector_store.add_texts(texts)

print("Inserted %i headlines." % len(texts))
## Indexing
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 47 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the experience and all


In [22]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
     ## Questions comes to query_text, along with LLM. astra_vector_index.query help to do that
    answer = astra_vector_index.query(query_text, llm=llm).strip() 
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content))


QUESTION: "Machine Learning"
ANSWER: "Machine Learning is a subset of Artificial Intelligence that involves the use of algorithms and statistical models to enable computer systems to learn and improve from experience without explicitly being programmed. It involves training a computer model on a large dataset and using that model to make predictions or decisions on new data. Some common Machine Learning algorithms include Linear Regression, Logistic Regression, KNN, Decision Tree, Clustering, DBScan, Random Forest, Adaboost, and others. Performance metrics and performance tuning are also important aspects of Machine Learning."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9377] "Network, Transfer Learning
Natural Language Processing / Large Language Model / Generative AI ..."
    [0.9374] "Machine Learning Algorithms
Linear Regression, Logistic Regression, KNN, Decision Tree, Clustering, DBScan, Random Forest, Adaboost, Performance 
Metrics, Performance Tuning
Deep Learning Algorithms ..."
  