## Project (Question Answering on Private Documents) 📃🤔⁉️⁉️

Question Answering on private documents, Creating a chatbot which could give you answers on basis of the question, from the private documents. first, you have to install following libraries. 

<span style="color:red">`conda env create -f environment.yml`</span>

This will install all necessary packages required for Project (Question Answering on private Documents)

To use conda environment, 

you must download the anaconda from this link, <span style="color:blue"> 👉👉 [anaconda](https://www.anaconda.com/)</span>. Go to <span style="color:blue">`Free Download`</span> button, and you have to provide your email and anaconda team will sent you a download link. Whether you are windows user or mac user, download the compatible version of anconda navigator and anaconda prompt. 

If you have any problem while downloading Anaconda Navigator, go to this link, <span style="color:blue"> 👉👉 [troubleshooting](https://www.anaconda.com/docs/reference/troubleshooting#anaconda-distribution-installation-issues)</span>. 

But, if you have already downloaded the Anaconda navigator make sure to delete it. and Download latest version compatible with your device.

Other Required Packages are:
- <span style="color:red">pip install pypdf</span>
- <span style="color:red">pip install doc2txt</span>
- <span style="color:red">pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence</span>
- <span style="color:red">pip install wikipedia -q</span>

If you have not installed langchain, then head over to this link ➡️ [langchain installation](https://python.langchain.com/docs/how_to/installation/) other wise you can simply install through, 

- <span style="color:red">pip install langchain</span>
- <span style="color:red">pip install langchain-google-genai</span>

NOTE: If you have any queries or need to get information on different types of documentation through the langchain you can go to this link. 
👉👉[langchain documentation loader](https://python.langchain.com/docs/integrations/document_loaders/)


In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI

import os 
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

# os.path.splitext('../files/us_constitution.pdf')

True

## Loading custom (Private) PDF Documents

In [2]:
# Load Private Data 
def load_document(file):
    import os
    name, extension = os.path.splitext(file)

    # what if the extension is pdf
    if extension == ".pdf":
        from langchain.document_loaders import PyPDFLoader
        print(f"Loading PDF file ......{file}")
        loader = PyPDFLoader(file)
        print(f"Done........")
    
    # what if the extension is docx 
    elif extension == ".docx":
        from langchain.document_loaders import Docx2txtLoader
        print(f"Loading txt file.......{file}")
        loader = Docx2txtLoader(file)
        print(f"Done.......")

    # if the extension is txt, i am accessing with simple document opening style
    elif extension == ".txt":
        print("Loading the txt file.....{file}")
        with open(file, 'rb') as f:
            loader = f.read()
            print("Done......")

            # since the loader is in the bytes structure and RecursiveCharacterTextSplitter wants
            # in the string in the further code. so we might need to convert it into the string 
            if isinstance(loader, bytes):
                loader = loader.decode('utf-8')             
                
            return loader
        
    else:
        print("Document format is not supported!")
        return None 

    data = loader.load()
    return data

## there are other types of data also that I must need to use. 

In [3]:
file_name = "../files/sj.txt"
data = load_document(file_name)
print("let's look at whether our code is working : ", data[:100])

Loading the txt file.....{file}
Done......
let's look at whether our code is working :  I am honored to be with you today at your commencement from one of the finest universities in the wo


## Loading Public documents (wikipedia)

In [4]:
# Load public document

# Query is the what thing you want to search and load max = how many iterations of pages you want to show
def load_wikipedia_document(query, lang='en', load_max=2):
    from langchain_community.document_loaders import WikipediaLoader
    docs = WikipediaLoader(query=query, load_max_docs=load_max).load()
    return docs

## Chunking Strategies and Splitting in Documents

In [5]:
# Splitting the paragraphs into smaller chunks. so we can easily store in the vector db 
def chunk_data(data, file_name, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
    
    # If the user send us a file with extension 'txt' then i am not able to chunked it down with just splitting documents. 
    # and if the document consist of bytes then the we have to convert it to string so, 
    ext = os.path.splitext(file_name)
    if '.txt' in ext:
        if isinstance(data, bytes):
            data = data.decode('utf-8')

        chunks = text_splitter.create_documents([data])
        return chunks
    
    chunks = text_splitter.split_documents(data)

    return chunks

# Embedding Cost 
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f"Total Tokens: {total_tokens}")
    print(f"Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}")


In [6]:
chunks = chunk_data(data, file_name)
print("example: ", chunks[1].page_content)

example:  from my life. That’s it. No big deal. Just three stories.


# Deleting Pinecone Index 

In [7]:
def delete_pinecone_index(index_name="all"):
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    if index_name == "all":
        indexes = pc.list_indexes()
        print("Deleting all indexes...")
        for index in indexes:
            pc.delete_index(index.name)
            print(f"completed deleting index {index.name}")
    else:
        print(f"Deleting index {index_name}")
        pc.delete_index(index_name)
        print("Done...")

In [8]:
print(f"Deleting the indexes if it exist in the pinecone .")
delete_pinecone_index()

Deleting the indexes if it exist in the pinecone .
Deleting all indexes...
completed deleting index askadocument


# Initializing a pinecone api 

In [9]:
from pinecone import Pinecone

pc = Pinecone(
        api=os.environ.get("PINECONE_API_KEY")
    )

# Creating a New index 

In [10]:
def creating_new_index(index_name):
    from pinecone import ServerlessSpec
    from langchain_pinecone import PineconeVectorStore
    from langchain_google_genai import GoogleGenerativeAIEmbeddings

    if index_name not in pc.list_indexes():
        # if we could not find the index-name in the pinecone we have to create a new one
        print(f"Creating an index name...........{index_name}")
        pc.create_index(
            index_name,
            dimension=3072,
            metric='cosine',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-east-1'
            )
        )
        print("Done creating Index..")

    else:
        print(f"Index {index_name} already exists.....", ends='')
    

In [11]:
index_name = "askadocument"
creating_new_index(index_name)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


Creating an index name...........askadocument
Done creating Index..


# Fetching a Index if index_name exists 

In [12]:
def fetching_index(index_name):
    from langchain_pinecone import PineconeVectorStore

    if index_name in pc.list_indexes():
        print(f"Index {index_name} already exists. Loading embeddings.....", ends='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print("Done ")

    return vector_store
        

# Connecting with Index in Pinecone

In [13]:
print(f"Connecting with index....{index_name}")
index = pc.Index(index_name)
print("Completed....")
print(index.describe_index_stats())

Connecting with index....askadocument
Completed....
{'dimension': 3072,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


# Storing the chunks in the vectors 

In [14]:
# $if you have created new index and you are storing the document in to the vector then only use this function 
def embedding_and_storing(index_name, chunks):

    from langchain_pinecone import PineconeVectorStore
    from langchain_google_genai import GoogleGenerativeAIEmbeddings

    embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

    print("only applicable if you have created a new Index....")
    vectorstore = PineconeVectorStore.from_documents(
        documents=chunks,
        embedding=embeddings,
        index_name=index_name
    )
    print("Documents successfully uploaded to pinecone!!")

    return vectorstore

In [15]:
vector = embedding_and_storing(index_name, chunks)
print(vector)

only applicable if you have created a new Index....
Documents successfully uploaded to pinecone!!
<langchain_pinecone.vectorstores.PineconeVectorStore object at 0x000001C0FF2F0910>


# Summarization

In [31]:
def summarizing(docs):

    from langchain import PromptTemplate
    from langchain_google_genai import ChatGoogleGenerativeAI
    from langchain.chains.summarize import load_summarize_chain

    max_prompt="Write a short and concise summary of the following " \
    "Text: {text}" \
    "consise summary: "

    combine_prompt="Write a concise summary of the following text that covers the key points. " \
    "Add a title to the summary" \
    "Start your summary with an `Introduction Paragraph` that gives an overview of" \
    "the topic followed by `BULLET POINTS` if possible and the summary with " \
    "CONCLUSION PHRASE:" \
    "Text: {text}" 

    combine_prompt_template = PromptTemplate(
        template=combine_prompt, input_variables=['text']
    )   

    max_prompt_template = PromptTemplate(
        input_variables=["text"],
        template=max_prompt
    )


    llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=1)

    chain = load_summarize_chain(
        llm,
        chain_type="map_reduce",
        map_prompt=max_prompt_template,
        combine_prompt=combine_prompt_template,
        verbose=False
    )

    output = chain.invoke(docs)

    return output['output_text']

# Chatting a chatbot 

In [None]:
def ask_and_get_answer(vectorstore, q):
    from langchain.chains import RetrievalQA
    from langchain_google_genai import ChatGoogleGenerativeAI

    from langchain import PromptTemplate
    from langchain.chains.summarize import load_summarize_chain

    llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)

    if "summarize" in q.lower():
        retriever = vectorstore.as_retriever(search_type='similarity', search_kwargs={'k': 10})
        docs = retriever.invoke("summarize all content")
        answer = summarizing(docs)
        return answer

    retriever = vectorstore.as_retriever(search_type='similarity', search_kwargs={'k': 3})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever)
    answer = chain.invoke(q)
    return answer

In [34]:
import time 
i = 0
print("write Quit or Exit to quit: ")
while True:
    q = input(f"Question #{i}")
    i = i + 1
    if q.lower() in ['quit', 'exit']:
        print("Quitting...bye bye")
        time.sleep(2)
        break
    print(f"\nquestion #{i}:", q)
    answer = ask_and_get_answer(vector, q)
    if "summarize" in q.lower():
        print(f"\nanswer #{i}:", answer)
    else:
        print(f"\nanswer #{i}:", answer['result'])
    print(f"\n{'-' * 40}\n")


write Quit or Exit to quit: 

question #1: please summarize the given docs

answer #1: Life's Essential Tools and Lessons

This text presents a profound exploration of life's most crucial lessons and tools, primarily focusing on how the awareness of one's mortality profoundly influences major life choices and personal growth. Drawing from deeply personal experiences, the author offers guidance on navigating setbacks, embracing creativity, and maintaining a purposeful existence.

*   Remembering one's mortality is highlighted as the most crucial tool for making significant life choices, as it effectively strips away external expectations, pride, and fear.
*   The author illustrates these insights through three straightforward personal stories, including an experience of devastating loss of focus, a subsequent period of immense creativity, and a direct engagement with the concept of death.
*   A recent diagnosis of an almost certainly incurable pancreatic tumor provides a poignant and ur

# Asking a chatbot with memory 

In [None]:
def ask_with_memory(vector_store, question, chat_history=[]):
    from langchain.chains import ConversationalRetrievalChain
    from langchain_google_genai import ChatGoogleGenerativeAI

    llm = ChatGoogleGenerativeAI(model='gemini-2.5-flash' , temperature=1)

    if "summarize" in question.lower():
        retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 10})
        docs = retriever.invoke("summarize all content")
        answer = summarizing(docs)
        return answer
    
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})

    crc = ConversationalRetrievalChain.from_llm(llm, retriever)
    result = crc({'question': question, 'chat_history': chat_history})
    chat_history.append((question, result['answer']))
    return result, chat_history

In [47]:
chat_history = []
question1 = "How many words are present in the total documents ?"
result, chat_history = ask_with_memory(vector, question1, chat_history)
print(result['answer'])

There are 58 words in the provided text.


In [20]:
print(chat_history)

[('How many words are present in the total documents ?', 'There are 57 words in the provided documents.')]


In [46]:
question2 = "please summarize the given docs"
result, chat_history = ask_with_memory(vector, question2, chat_history)
print(result['answer'])

TypeError: string indices must be integers, not 'str'

In [22]:
print(chat_history)

[('How many words are present in the total documents ?', 'There are 57 words in the provided documents.'), ('since, you have given me number of words present in provided document, multiply that total number by 2', "I don't know the answer, as the provided context does not contain information to answer this question.")]


In [23]:
# In here i have told the chat to multiply the number of document, previously given, it said 63, and simply i told chat to multiply that by2, but chatbot 
# was confused, so i clearly mentioned, "you have given me number of word present in document and multiply by 2" and it assume 
# gemini should calculate once again, and now, it calculate 61, and multiplied by 2 became 122

# Checking and Debbugging

In [24]:
data = load_document('../files/attention_is_all_you_need.pdf')
# print(data[0].page_content)
# print(data[1].metadata)
print(f"you have {len(data)} pages in your data")
print(f"There are {len(data[14].page_content)} characters in the page no 19")

# data = load_document('../files/the_great_gatsby.docx')
# print(f"You have {len(data)} number of pages in your data.")

# data = load_document("../files/churchill_speech.txt")
# print(f"You have number of pages inside {len(data)} and appoximately.")

# result = load_wikipedia_document("hunter x hunter", 2)
# print(result[0].metadata)
# print()
# print(result[0].page_content)


Loading PDF file ......../files/attention_is_all_you_need.pdf
Done........
you have 15 pages in your data
There are 818 characters in the page no 19


In [25]:
query = "After 17 years later why the person did go to college and naively chose a most expensive college ?"
answer = ask_and_get_answer(vector, query)
print(answer['result'])

The provided text states that the person went to college 17 years later, but it does not explain *why* they went at that specific time.

It also states that they "naively chose a college that was almost as expensive as Stanford," but it does not explain *why* they made that naive choice.
