<a href="https://colab.research.google.com/github/sundarramamurthy/llm/blob/main/LangChain_PDFQuery.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Querying Local PDF With Astra and LangChain

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken PyPDF2

This below code snippet is initializing various components for a language processing pipeline using LangChain, a framework for building language models and related applications, as well as integrating vector storage and retrieval functionalities:

**from langchain.vectorstores.cassandra import Cassandra**: This line imports the Cassandra module from langchain's vectorstores package. It is likely to be used to interact with a Cassandra database, which is a type of NoSQL database known for its scalability and distributed architecture, for storing and retrieving vector data.

**from langchain.indexes.vectorstore import VectorStoreIndexWrappe**r: This line imports the VectorStoreIndexWrapper, which is a utility class for integrating a vector store with indexing capabilities. This would allow for efficient storage and retrieval of vectors, which are fundamental in natural language processing for representing word embeddings or sentence embeddings.

**from langchain.llms import OpenAI**: Here, an OpenAI large language model (LLM) is being imported from LangChain's llms module. This LLM used to process language data, generate text, or understand human language input.

**from langchain.embeddings import OpenAIEmbeddings**: This line imports the OpenAIEmbeddings module, which is used to generate embeddings (vector representations) of text using models provided by OpenAI. These embeddings capture semantic information and can be used for various natural language processing tasks.

**from datasets import load_dataset**: This imports a function from the Hugging Face datasets library, which is a popular tool for loading and processing datasets in machine learning. This is used to fetch datasets, which can then be used for training, evaluating, or running inferences with the language model.




In [2]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [3]:
# enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:OPsaxcyJebPCNZpfIYPZIHHY:c60fe2f4a348f50c1a88f9da4442df94e69374b1f87d4ec51bc356281b171a52"
# AstraCS:hCldEpJIXLNbSjzEbHMkfwpB:c2831dbf27bd9310c40e9292a1e8160a682f0d9bca2455504de2b04505bc1c94
# enter your Database ID
ASTRA_DB_ID = "532327bc-5afc-4462-b617-3bf881edf982"
# enter your OpenAI key
OPENAI_API_KEY = "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"

In [4]:
from PyPDF2 import PdfReader

In [5]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/content/History of Indian Cinema.pdf')

In [6]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [7]:
raw_text

' 30  \nCHAPTER – 2 \n \nA BRIEF HISTORY OF INDIAN CINEMA \n \n       Indian films are unquestionably the most –seen movies in the world. Not just \ntalking about the billion- strong audiences in India itself, where 12 million people are \nsaid to go to the cinema every day, but of large audiences well beyond the Indian \nsubcontinent and the Diaspora, in such unlikely places as Russia, China, the Middle \nEast, the Far East Egypt, Turkey and Africa. People from very different cultural and \nsocial worlds have a great love for Indian popular cinema, and many have been Hindi \nFilms fans for over fifty years. \n     Indian cinema is world – famous for the staggering amount of films it produces: \nthe number is constantly on the increase, and recent sources estimate that a total \noutput of some 800 films a year are made in different cities including Madrass , \nBangalore , Calcutta and Hyderabad . Of this astonishing number, those films made in \nBombay, in a seamless blend of Hindi and

Initialize the connection to your database:

In [8]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(140712176184784) 532327bc-5afc-4462-b617-3bf881edf982-us-east1.db.astra.datastax.com:29042:3345e8ad-bcb9-4af9-b93a-e71d67f95deb> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


In [9]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  warn_deprecated(
  warn_deprecated(


Create your LangChain vector store using by Astra DB

In [10]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="v_table",
    session=None,
    keyspace=None,
)

In [11]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it should not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [12]:
texts[:20]

['30  \nCHAPTER – 2 \n \nA BRIEF HISTORY OF INDIAN CINEMA \n \n       Indian films are unquestionably the most –seen movies in the world. Not just \ntalking about the billion- strong audiences in India itself, where 12 million people are \nsaid to go to the cinema every day, but of large audiences well beyond the Indian \nsubcontinent and the Diaspora, in such unlikely places as Russia, China, the Middle \nEast, the Far East Egypt, Turkey and Africa. People from very different cultural and \nsocial worlds have a great love for Indian popular cinema, and many have been Hindi \nFilms fans for over fifty years. \n     Indian cinema is world – famous for the staggering amount of films it produces: \nthe number is constantly on the increase, and recent sources estimate that a total',
 'Indian cinema is world – famous for the staggering amount of films it produces: \nthe number is constantly on the increase, and recent sources estimate that a total \noutput of some 800 films a year are made 

Load the dataset into the vector store

In [13]:
astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("\n")
    print("ANSWER: \"%s\"\n" % answer)

    # print("FIRST DOCUMENTS BY RELEVANCE:")
    # for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
    #     print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): family subject

QUESTION: "family subject"






ANSWER: "Muslim socials were later developed into the ‘ Muslim Social’ genre, which often featured family conflicts and reunions."


What's your next question (or type 'quit' to exit): tamil movies

QUESTION: "tamil movies"






ANSWER: "Tamil movies also incorporated themes and episodes from the Puranas, the Ramayana, and the Mahabharata, as well as folk tales and legends, into their cinematic storytelling. Some examples include Nallathangal, Bhakta Prahlada, and Keechakavadham. Additionally, Tamil movies also explored socially relevant and bold themes, such as socio-economic disparities and potential social revolution, as seen in the film Navalokam."

