<a href="https://colab.research.google.com/github/shum05/cassandraDB_LangChain_QA/blob/main/AstraQA_NLP_Powered_Inquiry_into_PDF_Documents_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AstraQA: NLP-Powered Inquiry into PDF Documents with LangChain"
## Project Description
This innovative project showcases the power of AstraDB, LangChain, and Vector Search in transforming traditional PDF documents into dynamic sources of information. Seamlessly integrated, our solution allows users to pose questions, receiving accurate and contextually relevant answers drawn from PDF content. Experience the synergy of AstraDB's serverless Cassandra with Vector Search, coupled with LangChain's natural language processing capabilities, as we redefine the way you interact with textual data. Unleash the potential of smart document interrogation through this question-answering journey.

### **1. Install the required dependencies:**

In [11]:
!pip install -q cassio datasets langchain openai tiktoken

**Cassio:** a library used for interacting with Apache Cassandra, to be used for integrating with Astra DB (AstraDB)and it facilitates the integration between your Python code and the AstraDB instance, enabling you to interact with and query the database seamlessly.

**Datasets** (Hugging Face):library, often associated with Hugging Face, is a powerful tool for managing and working with various datasets. It provides a standardized interface to access and load datasets for NLP and ML tasks.

**Tiktoken:**a library designed to count the number of tokens in a text string without making API calls,useful when working with models that have token-based limits, such as OpenAI's language models.


In [12]:
!pip install PyPDF2



In [13]:

!pip install nltk




In [14]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### **2. Import the packages you'll need:**

In [15]:
# LangChain components
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# dataset retrieval with Hugging Face
from datasets import load_dataset

# CassIO for Astra DB integration in LangChain
import cassio

# to read the content of a PDF file
from PyPDF2 import PdfReader
from typing_extensions import Concatenate

# word_tokenize to split the input text into words
from nltk.tokenize import word_tokenize

#### **purpose of each tool or module imported:**
These tools collectively form a pipeline for querying and analyzing PDF documents using Astra DB, LangChain, and OpenAI-powered language models.

**Cassandra:** A vector store implementation designed to work with Cassandra. It allows for storing and retrieving vectorized representations of text data.\
**VectorStoreIndexWrapper:** A wrapper for a vector store that provides additional indexing functionality. It aids in efficiently searching and retrieving relevant information from the vector store.\
**CharacterTextSplitter:** A tool for splitting text into chunks based on characters. In this specific use case, it is configured to split text from a PDF file into manageable chunks.\
**OpenAI** (from langchain.llms): An implementation of a Language Model Microservice (LLM) using OpenAI. It allows for language-related tasks, such as generating responses or embeddings, using OpenAI's language models.\
**OpenAIEmbeddings:** A tool for obtaining embeddings (vector representations) of text using OpenAI's language models. Embeddings capture semantic information about the input text.\
**load_dataset:** A function from the Hugging Face datasets library that facilitates loading and working with various datasets. It's commonly used in natural language processing (NLP) projects for data retrieval and exploration.\
**cassio:**: A module used for initializing and managing the connection to an Astra DB (Database), which is a Cassandra-based NoSQL database. It provides the integration between CassIO and AstraDB in the LangChain project.\
**PdfReader** : A tool for reading the content of PDF files. In the given code, it is used to extract text from a PDF file.


### **3. Seamless Integration:** Configuring AstraDB and OpenAI Credentials

In [16]:

OPENAI_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

ASTRA_DB_ID = "0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxb"
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxd"


### **4. Google Drive Integration for Seamless File Access in Colab**
- prompted to authenticate and provide an authorization code

In [17]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [18]:
# Provide paths to PDF files
pdf_file_path1 = '/content/drive/MyDrive/NLP_PDF_Files/about_NLP.pdf'
pdf_file_path2 = '/content/drive/MyDrive/NLP_PDF_Files/note_on_NLP.pdf'

# Read PDF files
pdf_reader1 = PdfReader(pdf_file_path1)
pdf_reader2 = PdfReader(pdf_file_path2)

# Extract text and merge
raw_text1 = ''.join(page.extract_text() for page in pdf_reader1.pages)
raw_text2 = ''.join(page.extract_text() for page in pdf_reader2.pages)

# Merge the text from both PDF files
separator = " ========== SPACE BETWEEN FILES ========== "
merged_raw_text = f"{raw_text1}\n\n{separator}{separator}\n\n{raw_text2}"


In [19]:
merged_raw_text



**Initialize the connection to database:**

In [20]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(138722702811408) 038b5603-5e01-46e8-a03f-bde5ad2c288b-us-east-2.db.astra.datastax.com:29042:7d1e19c9-12a6-4994-98b3-32954b433065> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


**Create the LangChain embedding and LLM objects for later usage:**

In [21]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

**Initializing an instance of the Cassandra class with certain parameters**

In [22]:
astra_vs = Cassandra(
    embedding=embedding,
    table_name="astra_langchain_qa",
    session=None, # If None, the Cassandra class may create a new session.
    keyspace=None, # defines data replication on nodes. If None, the default keyspace may be used.
)

In [23]:
# split the text using Character Text Split to not increase token size
def token_length_function(merged_raw_text):
    # Use NLTK to tokenize the text into words
    tokens = word_tokenize(merged_raw_text)
    # Return the count of tokens (words)
    return len(tokens)
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 400,
    chunk_overlap  = 50, # number of characters that can overlap between adjacent chunks
    length_function = token_length_function # we can also use the built in len function
)
texts = text_splitter.split_text(merged_raw_text)

In [24]:
texts[:100]

['1Natural Language Processing\nCS 6320  \nLecture 1\nIntroduction to NLP\nInstructor: Sanda HarabagiuDefinition\n•NLP is concerned with the computational techniques used for \nprocessing human language. It creates and implements computer \nmodels  for the purpose of performing various natural language tasks. \n•These tasks include :\n•Mundane applications , e.g. word counting, spell checking, automatic \nhyphenation\n•Cutting edge applications , e.g. automated question answering on the Web, \nbuilding NL interfaces to databases, machine translation,  and others.\n•What distinguished these applications from other data processing \napplications is their use of knowledge of language .\n•NLP is playing an increasing role in curbing the information  explosion \non Internet and corporate America.  AI vs. NLP\n•People refer to many AI techniques – like Chat GPT, \nwhich are in fact novel NLP methods using GPT3 in an \ninteractive mode.\n•GPT4  and GPT3  are Large  Language  Models (LLMs)\n•L

In [25]:
# Load the dataset into the vector store
astra_vs.add_texts(texts[:50])
# Print the number of headlines inserted
print("Inserted %i headlines." % len(texts[:50]))
# Create an index wrapper for the vector store
# creating an index wrapper improves the efficiency of searching and retrieving similar vectors
astra_v_index = VectorStoreIndexWrapper(vectorstore=astra_vs)

Inserted 50 headlines.


###**6.  Run the QA cycle**

Run the cells and ask a question -- or `exit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Sample questions questions:
- What distinguished NLP from other data processing applications?
- What makes NLP applications unique?
- List some NLP applications


In [28]:
my_question = True
while True:
    if my_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "exit":
        break

    if query_text == "":
        continue

    my_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_v_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vs.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): What distinguished NLP from other data processing applications?

QUESTION: "What distinguished NLP from other data processing applications?"




ANSWER: "NLP applications require a lot of hand-coded linguistic knowledge, and researchers were interested in developing symbolic and statistical techniques for processing and understanding language."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9349] "forprocessing, with some researchers downplaying syntax, inparticular ,infavourofwor ..."
    [0.9346] "forprocessing, with some researchers downplaying syntax, inparticular ,infavourofwor ..."
    [0.9346] "forprocessing, with some researchers downplaying syntax, inparticular ,infavourofwor ..."
    [0.9311] "centrating onthe`agent-lik e'applications andneglecting theuser aids. Although thesy ..."

What's your next question (or type 'quit' to exit): List some NLP applications

QUESTION: "List some NLP applications"




ANSWER: "NLP applications include document classification, information extraction, reading comprehension, translation, summarization, knowledge acquisition, question-answering, conversation agents, tutoring systems, problem solving, speech processing, and language generation."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9401] "Modern software development methods play an important role.Applications of NLP
•Text ..."
    [0.9401] "Modern software development methods play an important role.Applications of NLP
•Text ..."
    [0.9401] "Modern software development methods play an important role.Applications of NLP
•Text ..."
    [0.9368] "1Natural Language Processing
CS 6320  
Lecture 1
Introduction to NLP
Instructor: San ..."

What's your next question (or type 'quit' to exit): exit
