## Project (Question Answering on Private Documents) 📃🤔⁉️⁉️

Question Answering on private documents, Creating a chatbot which could give you answers on basis of the question, from the private documents. first, you have to install following libraries. 

<span style="color:red">`conda env create -f environment.yml`</span>

This will install all necessary packages required for Project (Question Answering on private Documents)

To use conda environment, 

you must download the anaconda from this link, <span style="color:blue"> 👉👉 [anaconda](https://www.anaconda.com/)</span>. Go to <span style="color:blue">`Free Download`</span> button, and you have to provide your email and anaconda team will sent you a download link. Whether you are windows user or mac user, download the compatible version of anconda navigator and anaconda prompt. 

If you have any problem while downloading Anaconda Navigator, go to this link, <span style="color:blue"> 👉👉 [troubleshooting](https://www.anaconda.com/docs/reference/troubleshooting#anaconda-distribution-installation-issues)</span>. 

But, if you have already downloaded the Anaconda navigator make sure to delete it. and Download latest version compatible with your device.

Other Required Packages are:
- <span style="color:red">pip install pypdf</span>
- <span style="color:red">pip install doc2txt</span>
- <span style="color:red">pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence</span>
- <span style="color:red">pip install wikipedia -q</span>

If you have not installed langchain, then head over to this link ➡️ [langchain installation](https://python.langchain.com/docs/how_to/installation/) other wise you can simply install through, 

- <span style="color:red">pip install langchain</span>
- <span style="color:red">pip install langchain-google-genai</span>

NOTE: If you have any queries or need to get information on different types of documentation through the langchain you can go to this link. 
👉👉[langchain documentation loader](https://python.langchain.com/docs/integrations/document_loaders/)


In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI

import os 
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

# os.path.splitext('../files/us_constitution.pdf')

True

## Loading custom (Private) PDF Documents

In [4]:
# Load Private Data 
def load_document(file):
    import os
    name, extension = os.path.splitext(file)

    # what if the extension is pdf
    if extension == ".pdf":
        from langchain.document_loaders import PyPDFLoader
        print(f"Loading PDF file ......{file}")
        loader = PyPDFLoader(file)
        print(f"Done........")
    
    # what if the extension is docx 
    elif extension == ".docx":
        from langchain.document_loaders import Docx2txtLoader
        print(f"Loading txt file.......{file}")
        loader = Docx2txtLoader(file)
        print(f"Done.......")

    # if the extension is txt, i am accessing with simple document opening style
    elif extension == ".txt":
        print("Loading the txt file.....{file}")
        with open(file, 'rb') as f:
            loader = f.read()
            print("Done......")

            # since the loader is in the bytes structure and RecursiveCharacterTextSplitter wants
            # in the string in the further code. so we might need to convert it into the string 
            if isinstance(loader, bytes):
                loader = loader.decode('utf-8')             
                
            return loader
        
    else:
        print("Document format is not supported!")
        return None 

    data = loader.load()
    return data

## there are other types of data also that I must need to use. 

In [34]:
file_name = "../files/sj.txt"
data = load_document(file_name)
print("let's look at whether our code is working : ", data[:100])

Loading the txt file.....{file}
Done......
let's look at whether our code is working :  I am honored to be with you today at your commencement from one of the finest universities in the wo


## Loading Public documents (wikipedia)

In [5]:
# Load public document

# Query is the what thing you want to search and load max = how many iterations of pages you want to show
def load_wikipedia_document(query, lang='en', load_max=2):
    from langchain_community.document_loaders import WikipediaLoader
    docs = WikipediaLoader(query=query, load_max_docs=load_max).load()
    return docs

## Chunking Strategies and Splitting in Documents

In [26]:
# Splitting the paragraphs into smaller chunks. so we can easily store in the vector db 
def chunk_data(data, file_name, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
    
    # If the user send us a file with extension 'txt' then i am not able to chunked it down with just splitting documents. 
    # and if the document consist of bytes then the we have to convert it to string so, 
    ext = os.path.splitext(file_name)
    if '.txt' in ext:
        if isinstance(data, bytes):
            data = data.decode('utf-8')

        chunks = text_splitter.create_documents([data])
        return chunks
    
    chunks = text_splitter.split_documents(data)

    return chunks

# Embedding Cost 
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f"Total Tokens: {total_tokens}")
    print(f"Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}")


In [36]:
chunks = chunk_data(data, file_name)
print("example: ", chunks[1].page_content)

example:  from my life. That’s it. No big deal. Just three stories.


# Deleting Pinecone Index 

In [38]:
def delete_pinecone_index(index_name="all"):
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    if index_name == "all":
        indexes = pc.list_indexes()
        print("Deleting all indexes...")
        for index in indexes:
            pc.delete_index(index.name)
            print(f"completed deleting index {index.name}")
    else:
        print(f"Deleting index {index_name}")
        pc.delete_index(index_name)
        print("Done...")

In [39]:
print(f"Deleting the indexes if it exist in the pinecone .")
delete_pinecone_index()

Deleting the indexes if it exist in the pinecone .
Deleting all indexes...
completed deleting index askadocument


# Initializing a pinecone api 

In [41]:
from pinecone import Pinecone

pc = Pinecone(
        api=os.environ.get("PINECONE_API_KEY")
    )

# Creating a New index 

In [42]:
def creating_new_index(index_name):
    from pinecone import ServerlessSpec
    from langchain_pinecone import PineconeVectorStore
    from langchain_google_genai import GoogleGenerativeAIEmbeddings

    if index_name not in pc.list_indexes():
        # if we could not find the index-name in the pinecone we have to create a new one
        print(f"Creating an index name...........{index_name}")
        pc.create_index(
            index_name,
            dimension=3072,
            metric='cosine',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-east-1'
            )
        )
        print("Done creating Index..")

    else:
        print(f"Index {index_name} already exists.....", ends='')
    

In [43]:
index_name = "askadocument"
creating_new_index(index_name)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


Creating an index name...........askadocument
Done creating Index..


# Fetching a Index if index_name exists 

In [44]:
def fetching_index(index_name):
    from langchain_pinecone import PineconeVectorStore

    if index_name in pc.list_indexes():
        print(f"Index {index_name} already exists. Loading embeddings.....", ends='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print("Done ")

    return vector_store
        

# Connecting with Index in Pinecone

In [45]:
print(f"Connecting with index....{index_name}")
index = pc.Index(index_name)
print("Completed....")
print(index.describe_index_stats())

Connecting with index....askadocument
Completed....
{'dimension': 3072,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


# Storing the chunks in the vectors 

In [49]:
# $if you have created new index and you are storing the document in to the vector then only use this function 
def embedding_and_storing(index_name, chunks):

    from langchain_pinecone import PineconeVectorStore
    from langchain_google_genai import GoogleGenerativeAIEmbeddings

    embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

    print("only applicable if you have created a new Index....")
    vectorstore = PineconeVectorStore.from_documents(
        documents=chunks,
        embedding=embeddings,
        index_name=index_name
    )
    print("Documents successfully uploaded to pinecone!!")

    return vectorstore

In [50]:
vector = embedding_and_storing(index_name, chunks)
print(vector)

only applicable if you have created a new Index....
Documents successfully uploaded to pinecone!!
<langchain_pinecone.vectorstores.PineconeVectorStore object at 0x0000022BBCF714D0>


# Chatting a chatbot 

In [47]:
def ask_and_get_answer(vectorstore, q):
    from langchain.chains import RetrievalQA
    from langchain_google_genai import ChatGoogleGenerativeAI

    llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
    retriever = vectorstore.as_retriever(search_type='similarity', search_kwargs={'k': 3})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever)
    answer = chain.invoke(q)
    return answer

In [None]:
import time 
i = 0
print("write Quit or Exit to quit: ")
while True:
    q = input(f"Question #{i}")
    i = i + 1
    if q.lower() in ['quit', 'exit']:
        print("Quitting...bye bye")
        time.sleep(2)
        break
    print(f"\nquestion #{i}:", q)
    answer = ask_and_get_answer(vector, q)
    print(f"\nanswer #{i}:", answer['result'])
    print(f"\n{'-' * 40}\n")


write Quit or Exit to quit: 

question #1: 
 what was the second and third story in the given context.
answer #1 The second story was about love and loss.
The third story was about death.

----------------------------------------


question #2: 
 What is the conclusion and what narrative is telling us 
answer #2 Based on the provided fragments, it's impossible to identify a conclusion or fully understand what narrative is being told.

*   The phrase "very, very clear looking backward 10 years later" suggests a **reflection** on past events or decisions, implying a lesson or realization has been made over time. However, we don't know *what* became clear.
*   "My third story is about death" indicates that the speaker is sharing a series of personal narratives, and this is the introduction to one of them. It's an **introduction to a topic**, not a conclusion to the overall narrative.

Without the preceding "stories" or the context of what became "very, very clear," we cannot determine the

# Asking a chatbot with memory 

In [48]:
def ask_with_memory(vector_store, question, chat_history=[]):
    from langchain.chains import ConversationalRetrievalChain
    from langchain_google_genai import ChatGoogleGenerativeAI

    llm = ChatGoogleGenerativeAI(model='gemini-2.5-flash' , temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})

    crc = ConversationalRetrievalChain.from_llm(llm, retriever)
    result = crc({'question': question, 'chat_history': chat_history})
    chat_history.append((question, result['answer']))
    return result, chat_history

In [52]:
chat_history = []
question1 = "How many words are present in the total documents ?"
result, chat_history = ask_with_memory(vector, question1, chat_history)
print(result['answer'])

  result = crc({'question': question, 'chat_history': chat_history})


There are 57 words in the provided documents.


In [53]:
print(chat_history)

[('How many words are present in the total documents ?', 'There are 57 words in the provided documents.')]


In [55]:
question2 = "multiply the above answer by 2"
result, chat_history = ask_with_memory(vector, question2, chat_history)
print(result['answer'])

Let's count the words in each document:

1.  "together." - 1 word
2.  "very, very clear looking backward 10 years later." - 8 words
3.  "and overflowing with neat tools and great notions." - 9 words

Total number of words = 1 + 8 + 9 = 18 words.

Now, let's perform the calculation:
(18 words * 2) * 2 = 36 * 2 = 72


In [56]:
print(chat_history)

[('How many words are present in the total documents ?', 'There are 57 words in the provided documents.'), ('multiply the number of words in document by 2', "I don't know the answer to that, as the provided context does not contain information to solve mathematical problems."), ('multiply the above answer by 2', 'Let\'s count the words in each document:\n\n1.  "together." - 1 word\n2.  "very, very clear looking backward 10 years later." - 8 words\n3.  "and overflowing with neat tools and great notions." - 9 words\n\nTotal number of words = 1 + 8 + 9 = 18 words.\n\nNow, let\'s perform the calculation:\n(18 words * 2) * 2 = 36 * 2 = 72')]


In [57]:
# In here i have told the chat to multiply the document , no of words by 2, but it assume that i have told him to count the number of word present in 
# chunk no 1, and i have already told chat to multiply by 2 (two time), so  it calculated 72, as we can see in the chat history

# Checking and Debbugging

In [None]:
data = load_document('../files/attention_is_all_you_need.pdf')
print(data[0].page_content)
print(data[1].metadata)
print(f"you have {len(data)} pages in your data")
print(f"There are {len(data[14].page_content)} characters in the page no 19")

# data = load_document('../files/the_great_gatsby.docx')
# print(f"You have {len(data)} number of pages in your data.")

# data = load_document("../files/churchill_speech.txt")
# print(f"You have number of pages inside {len(data)} and appoximately.")

# result = load_wikipedia_document("hunter x hunter", 2)
# print(result[0].metadata)
# print()
# print(result[0].page_content)


In [None]:
chunk = chunk_data(data) 
print("size of chunk: ", len(chunk))
print(chunk[0].page_content)

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0, length_function=len)
if isinstance(data, bytes):
    data = data.decode('utf-8')
chunks = text_splitter.create_documents([data])

print(len(chunks))

60


In [43]:
print(len(chunk))

191


In [10]:
print_embedding_cost(chunk)

Total Tokens: 9842
Embedding Cost in USD: 0.003937


In [None]:
query = "After 17 years later why the person did go to college and naively chose a most expensive college ?"
answer = ask_and_get_answer(vector, query)
print(answer['result'])

The provided text states that the person went to college 17 years later, but it does not explain *why* they went at that specific time.

It also states that they "naively chose a college that was almost as expensive as Stanford," but it does not explain *why* they made that naive choice.


In [74]:
query = "what is ChatGPT ? and which model is the latest ?"
data = load_wikipedia_document('ChatGPT', 'nep')
chunks = chunk_data(data)

In [None]:
index_name = "chatgpt"

In [75]:
print(chunks)

[Document(metadata={'title': 'ChatGPT', 'summary': "ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released in 2022. It currently uses GPT-5, a generative pre-trained transformer (GPT), to generate text, speech, and images in response to user prompts. It is credited with accelerating the AI boom, an ongoing period of rapid investment in and public attention to the field of artificial intelligence (AI). OpenAI operates the service on a freemium model. ChatGPT's website is among the 5 most-visited websites globally as of 2025.\nBy January 2023, ChatGPT had become the fastest-growing consumer software application in history, gaining over 100 million users in two months. Users can interact with ChatGPT through text, audio, and image prompts. It has been lauded as a revolutionary tool that could transform numerous professional fields. At the same time, its release prompted extensive media coverage and public debate about the nature of creativity and the futu

In [79]:
question = 'what is chatgpt ?'

In [81]:
chat_history = []

In [96]:
chat_history = []
question1 = "How many words are in the paragraph ?"
result, chat_history = ask_with_memory(vectorstore, question1, chat_history)
print(result['answer'])
print(chat_history)

There are 20 words in the provided text.
[('How many words are in the paragraph ?', 'There are 20 words in the provided text.')]


In [97]:
question2 = "multiply that number by 2"
reuslt, chat_history = ask_with_memory(vectorstore, question2, chat_history)
print(result['answer'])
print(chat_history)

There are 20 words in the provided text.
[('How many words are in the paragraph ?', 'There are 20 words in the provided text.'), ('multiply that number by 2', "I don't know. The provided text does not contain the answer to that question.")]
