# Querying PDF With Astra database and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

This notebook demonstrates how to perform semantic search and question-answering over your own PDF documents using [Astra DB](https://astra.datastax.com) with Vector Search and [LangChain](https://python.langchain.com/). You'll learn how to:

- Extract text from a PDF file using PyPDF2.
- Split the text into manageable chunks for embedding.
- Generate vector embeddings with OpenAI and store them in Astra DB.
- Use LangChain to perform semantic search and retrieve relevant document excerpts.
- Interactively ask questions about your PDF and get answers powered by large language models.

**Requirements:**  
- Astra DB account with a serverless database enabled for vector search  
- OpenAI API key  
- A PDF file to analyze

Follow the steps in this notebook to set up your environment, connect to Astra DB, process your PDF, and start querying your documents with natural language!

#### Pre-requisites

- Astra DB account with a serverless database enabled for vector search
- OpenAI API key
- A PDF file to analyze (e.g., `budget_speech.pdf`)

#### Steps

1. Install required Python packages.
2. Enter your Astra DB credentials and OpenAI API key.
3. Import dependencies.
4. Load and extract text from your PDF.
5. Split the text into chunks.
6. Generate embeddings and store them in Astra DB.
7. Ask questions about your PDF and get answers using vector search and LLM.


Install the required dependencies:

Import the packages you'll need:

In [13]:
# Import LangChain components for vector store, index wrapper, LLM, and embeddings
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq

# Import Hugging Face dataset loader for optional dataset retrieval
from datasets import load_dataset

# Import CassIO for Astra DB integration and PyPDF2 for PDF reading
import cassio
from PyPDF2 import PdfReader

# Groq and Astra DB keys
GROQ_API_KEY = "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"
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:vCwxHYfhNjZQYDpsWroJxZvz:fa3914d1e5d696abd142d29aae29008e1c8113621dbed09dc4b1143f31f2bcf8" 
ASTRA_DB_ID = "3b0814ca-1f8e-4458-8a17-f0986718b395"

# Specify the path to your PDF file
pdf_path = 'attention.pdf'

# Initialize the PDF reader
pdfreader = PdfReader(pdf_path)

# Extract text from all pages in the PDF
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

# Initialize CassIO with Astra DB credentials
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

# Set up Groq LLM and Hugging Face embedding model
llm = ChatGroq(groq_api_key=GROQ_API_KEY, model_name="llama3-8b-8192")
embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

# Create the Astra vector store for storing embeddings
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

# Import text splitter for chunking the PDF text
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=800,
    chunk_overlap=200,
    length_function=len,
)
texts = text_splitter.split_text(raw_text)

# Add the first 50 text chunks to the Astra vector store
astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

# Wrap the vector store with a LangChain index for querying
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

  embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")





To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Inserted 50 headlines.


Above workflow demonstrates a complete Retrieval-Augmented Generation (RAG) pipeline using free and efficient alternatives to OpenAI: Groq as the LLM and Hugging Face for embeddings. It starts by importing essential LangChain components and reading a local PDF using PyPDF2. The extracted content is split into manageable chunks using CharacterTextSplitter to optimize embedding quality. CassIO initializes integration with AstraDB, a cloud-native vector database, using secure credentials. Instead of OpenAI, the script leverages Groq's blazing-fast LLaMA3 model (llama3-8b-8192) via ChatGroq to serve as the reasoning engine. For embeddings, it switches to BAAI/bge-small-en-v1.5, a well-performing open-source model via HuggingFaceEmbeddings, eliminating the need for paid APIs. These chunks are embedded and stored in a Cassandra-based vector store hosted on AstraDB using langchain.vectorstores.cassandra. Finally, the store is wrapped with VectorStoreIndexWrapper, enabling semantic search and natural language querying powered by Groq. This setup offers a scalable, low-cost alternative to proprietary solutions and lays the foundation for building personalized AI apps, such as document summarizers or question-answering interfaces.


This code block implements an interactive question-answering loop that leverages LangChain’s VectorStoreIndexWrapper along with Groq’s LLaMA3 model to respond to natural language queries over embedded PDF data stored in AstraDB. It begins by prompting the user to enter a question and continues the conversation until "quit" is typed. Each query is processed by the Groq-powered LLM using stored document chunks for context-aware answers. Additionally, the loop performs a similarity search using Hugging Face embeddings to fetch the top four most relevant text snippets from the PDF, along with their similarity scores, offering insight into the sources behind each answer. This setup completes a lightweight, real-time retrieval-augmented generation (RAG) pipeline using entirely free and open tools.


In [14]:
first_question = True
while True:
    prompt = (
        "\nEnter your question (or type 'quit' to exit): "
        if first_question else
        "\nWhat's your next question (or type 'quit' to exit): "
    )
    query_text = input(prompt).strip()

    if query_text.lower() == "quit":
        break
    if not query_text:
        continue

    first_question = False

    print(f'\nQUESTION: "{query_text}"')
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print(f'ANSWER: "{answer}"\n')

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        snippet = doc.page_content[:84].replace("\n", " ")
        print(f'    [{score:.4f}] "{snippet} ..."')


QUESTION: "What is supervise machine learning ?"
ANSWER: "Supervised machine learning is a type of machine learning where the algorithm is trained on labeled data, meaning the data has been pre-classified or pre-labeled with the correct output or response. The goal of supervised learning is to train the algorithm to predict the output or response for new, unseen data based on the patterns and relationships learned from the labeled data.

In other words, supervised learning involves learning from a dataset where the correct output is already known, and the algorithm is trained to predict the correct output for new, unseen data. This is in contrast to unsupervised learning, where the algorithm is trained on unlabeled data and must discover patterns and relationships on its own.

Some common examples of supervised learning tasks include:

* Image classification: training an algorithm to classify images as cats or dogs
* Sentiment analysis: training an algorithm to classify text as positi

### Interview Questions and Answers Based on the Workflow

1. **What is Retrieval-Augmented Generation (RAG) and how is it implemented in this notebook?**  
    *RAG combines information retrieval with generative models to answer questions using both retrieved context and language generation. Here, text is extracted from a PDF, chunked, embedded, stored in Astra DB, and relevant chunks are retrieved to provide context for Groq’s LLaMA3 model to generate answers.*

2. **How does Astra DB enable semantic search, and why is it suitable for document retrieval?**  
    *Astra DB stores high-dimensional embeddings and supports similarity queries using vector distance metrics, enabling semantic search that matches queries to document chunks based on meaning rather than keywords.*

3. **Explain the role of Hugging Face embeddings in this pipeline.**  
    *Hugging Face embeddings convert text chunks into numerical vectors that capture semantic meaning, which are then stored and used for similarity search in Astra DB.*

4. **What are the advantages of using Groq’s LLaMA3 model via ChatGroq?**  
    *Groq’s LLaMA3 is open-source, cost-effective, and accessible without proprietary APIs, reducing dependency on paid services and offering flexibility.*

5. **Describe the process and importance of chunking text from a PDF for embedding.**  
    *Text is extracted from the PDF and split into manageable chunks to fit embedding model and database limits. Chunking improves retrieval granularity and embedding quality.*

6. **How does LangChain facilitate integration between vector stores and LLMs for question answering?**  
    *LangChain provides abstractions to connect vector stores (like Astra DB) with LLMs, handling retrieval and passing relevant context to the LLM for answer generation.*

7. **What security practices should be followed when handling API keys and credentials in notebooks?**  
    *Use environment variables or secret management tools for sensitive information, avoid hardcoding, and restrict notebook access to authorized users.*

8. **How does the similarity search mechanism work, and what does it provide to the user?**  
    *It compares the embedding of the user’s query with stored document embeddings to find the most relevant chunks, displaying top matches and their similarity scores.*

9. **What are the potential limitations of using open-source embeddings and LLMs?**  
    *They may have lower accuracy or fewer features than proprietary solutions, but offer transparency, flexibility, and cost savings.*

10. **How can this workflow be extended to support multiple documents or real-time updates?**  
     *Assign unique identifiers to each document and store their embeddings separately. For real-time updates, monitor for changes and update the vector store as needed.*

11. **Why is chunk overlap important when splitting text for embeddings, and how is it configured here?**  
     *Chunk overlap preserves context at chunk boundaries, reducing loss of meaning. Here, it’s set to 200 characters.*

12. **What challenges might arise when extracting text from PDFs, and how can they be addressed?**  
     *Complex layouts or encodings can hinder extraction. Use robust libraries like PyPDF2, and for more complex cases, consider preprocessing or OCR tools.*

13. **How does the workflow ensure only the most relevant document chunks are used for answering a query?**  
     *The similarity search retrieves the top-k most relevant chunks based on vector similarity, and only these are provided as context to the LLM.*

14. **How could you adapt this workflow to support other file formats, such as Word or plain text files?**  
     *Replace the PDF extraction step with appropriate libraries (e.g., python-docx for Word, or direct reading for text files), then follow the same chunking, embedding, and storage process.*

15. **Suppose I am a data scientist with many Excel files containing datasets. How can I use this workflow to gain insights from them?**  
     *You can use pandas to read Excel files, extract and preprocess the relevant text or data, chunk the content as needed, and then embed and store it in Astra DB. For example:*

     ```python
     import pandas as pd
     df = pd.read_excel('example.xlsx')
     text_data = df.to_csv(index=False)  # Convert data to text
     # Then chunk, embed, and store as with PDF text
     ```
     *This allows you to semantically search and query across large collections of Excel datasets for insights.*