<a href="https://colab.research.google.com/github/satvik314/ai_experiments/blob/main/QA_on_Docs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Docx Reader - pypdf
* VectorDB - FAISS
* Embeddings - OpenAI
* Model - OpenAI

In [None]:
!pip install langchain docx2txt pypdf openai prompts faiss-cpu tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken
  Downloading tiktoken-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.3.3


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain import OpenAI
from langchain.chains import LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.utils import get_from_dict_or_env
from langchain.docstore.document import Document
from langchain.vectorstores import FAISS, VectorStore
from langchain.memory import ConversationBufferMemory
import docx2txt
from typing import List, Dict, Any
import os
import re
import faiss
import numpy as np
from io import StringIO
from io import BytesIO
from pypdf import PdfReader
from openai.error import AuthenticationError

In [None]:
os.environ["OPENAI_API_KEY"]=input("Paste the key:")

Paste the key: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


In [None]:
# Reading a pdf file and converting to text
def pdf_reader(file: BytesIO) -> List[str]:
    pdf = PdfReader(file)
    output = []
    for page in pdf.pages:
        text = page.extract_text()
        # Merge hyphenated words
        text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)
        # Fix newlines in the middle of sentences
        text = re.sub(r"(?<!\n\s)\n(?!\s\n)", " ", text.strip())
        # Remove multiple newlines
        text = re.sub(r"\n\s*\n", "\n\n", text)

        output.append(text)

    return output

In [None]:
def text_to_docs(text: str) -> List[Document]:
    """Converts a string or list of strings to a list of Documents
    with metadata."""
    if isinstance(text, str):
        # Take a single string as one page.
        text = [text]
    page_docs = [Document(page_content=page) for page in text]

    # Add page numbers as metadata
    for i, doc in enumerate(page_docs):
        doc.metadata["page"] = i + 1

    # Split pages into chunks
    doc_chunks = []

    for doc in page_docs:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800,
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
            chunk_overlap=0,
        )
        chunks = text_splitter.split_text(doc.page_content)
        for i, chunk in enumerate(chunks):
            doc = Document(
                page_content=chunk, metadata={"page": doc.metadata["page"], "chunk": i}
            )
            # Add sources a metadata
            doc.metadata["source"] = f"{doc.metadata['page']}-{doc.metadata['chunk']}"
            doc_chunks.append(doc)
    return doc_chunks

In [None]:
def embed_docs(docs: List[Document]) -> VectorStore:
    """Embeds a list of Documents and returns a FAISS index"""
        # Embed the chunks
    embeddings = OpenAIEmbeddings()
    index = FAISS.from_documents(docs, embeddings)
    #Saving the Vectorstore
    index.save_local("Embeddings_db")
    return index

In [None]:
## Use a shorter template to reduce the number of tokens in the prompt
query_template = """Create a final answer to the given questions using the provided document excerpts(in no particular order) as references and also answer based on the chat history.
---------
QUESTION: What  is the purpose of ARPA-H?
=========
Content: More support for patients and families. \n\nTo get there, I call on Congress to fund ARPA-H, the Advanced Research Projects Agency for Health. \n\nIt’s based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more.  \n\nARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more.
Source: 1-32
Content: While we’re at it, let’s make sure every American can get the health care they need. \n\nWe’ve already made historic investments in health care. \n\nWe’ve made it easier for Americans to get the care they need, when they need it. \n\nWe’ve made it easier for Americans to get the treatments they need, when they need them. \n\nWe’ve made it easier for Americans to get the medications they need, when they need them.
Source: 1-33
Content: The V.A. is pioneering new ways of linking toxic exposures to disease, already helping  veterans get the care they deserve. \n\nWe need to extend that same care to all Americans. \n\nThat’s why I’m calling on Congress to pass legislation that would establish a national registry of toxic exposures, and provide health care and financial assistance to those affected.
Source: 1-30
=========
FINAL ANSWER: The purpose of ARPA-H is to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more.
---------
{chat_history}
QUESTION: {question}
=========
{summaries}
=========
FINAL ANSWER:"""

STUFF_PROMPT = PromptTemplate(
    template=query_template, input_variables=["chat_history","summaries", "question"]
)

In [None]:
contents_template = '''Please give the main topics of the following document: {document} in the following format
Main heading
- Subheading 1
- Subheading 2
- Subheading 3
... '''

# Function to produce topics and subtopics from the document.
def get_contents(docs: List[Document]):
    prompt = PromptTemplate(
    input_variables=["document"],
    template= contents_template,
)

    document_content = "Your document content here"
    formatted_prompt = prompt.format(document=docs)
    llm = OpenAI(temperature=0)
    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain({'document': docs}, return_only_outputs = True)
    return response

In [None]:
def search_docs(index: VectorStore, query: str) -> List[Document]:
    """Searches a FAISS index for similar chunks to the query
    and returns a list of Documents."""
    # Search for similar chunks. k determines the number of nearest neighbours
    # Number of k also impacts the number of tokens passing to the prompt
    docs = index.similarity_search(query, k=3)
    return docs


In [None]:
def get_answer(docs: List[Document], query: str) -> Dict[str, Any]:
    """Gets an answer to a question from a list of Documents."""

    #Adding a memory
    memory = ConversationBufferMemory(memory_key="chat_history", input_key="question")

    chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff",memory=memory, prompt=STUFF_PROMPT)
    # retrievalQA = RetrievalQA.from_llm(llm=OpenAI(), retriever=index)

    answer = chain(
        {"input_documents": docs, "question": query}, return_only_outputs=True
    )
    return answer

In [None]:
# Discard the function (Not currently in use)
def get_sources(answer: Dict[str, Any], docs: List[Document]) -> List[Document]:
    """Gets the source documents for an answer."""

    # Get sources for the answer
    source_keys = [s for s in answer["output_text"].split("SOURCES: ")[-1].split(", ")]

    source_docs = []
    for doc in docs:
        if doc.metadata["source"] in source_keys:
            source_docs.append(doc)

    return source_docs

In [None]:
def ask_the_bot(pdf='/content/lawsofmotion.pdf'):
  # Loading the vectorstore db if exists
  if os.path.exists('/content/Embeddings_db'):
    index = FAISS.load_local('/content/Embeddings_db',OpenAIEmbeddings())
  else:
    text = pdf_reader(pdf)
    processed_txt = text_to_docs(text)
    index = embed_docs(processed_txt)
  # Producing topic list for only one portion of document.
  text = pdf_reader(pdf)
  processed_txt = text_to_docs(text)
  print("Welcome to the File Chat Bot. Ask the bot anything about the context. Enter exit to exit the chat.")
  print("Do you want to show the contents of the document?")
  if input("Enter your respones").lower()=="yes":
    print(get_contents(processed_txt[0]))
  while True:
    query = input("Enter your query : ")
    if query.lower()=='exit':
      break
    else:
      sources = search_docs(index, query)
      answer = get_answer(sources, query)
      print("Bot : {}".format(answer['output_text']))




In [None]:
ask_the_bot()

Welcome to the File Chat Bot. Ask the bot anything about the context. Enter exit to exit the chat.
Do you want to show the contents of the document?
Enter your responesyes
{'text': '\n\nMain heading: Motion and its Causes \nSubheading 1: Motion Along a Straight Line \nSubheading 2: Causes of Motion \nSubheading 3: Natural State of an Object'}
Enter your query : What is motion?
Bot :  Motion is the natural tendency of objects to resist a change in their state of rest or of uniform motion. It is measured in terms of position, velocity and acceleration, and can be uniform or non-uniform. The rate of change of momentum of an object is proportional to the applied unbalanced force in the direction of the force.
Enter your query : What is the first question that someone can ask?
Bot :  The first question that someone can ask is: "Which of the following has more inertia: (a) a rubber ball and a stone of the same size? (b) a bicycle and a train? (c) a five - rupees coin and a one-rupee coin?"
E

KeyboardInterrupt: ignored