## **Question Answering Over a Directory**

Question Answering refers to the process of extracting relevant information from a given set of documents in order to provide accurate and concise answers to user queries.

Installing Necessary Libraries using Following Commands

In [None]:
!pip install langchain
!pip install cohere
!pip install chromadb
!pip install pypdf
!pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.209-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.8-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk>=0.0.13 (from langchain)
  Downloading langchainplus_sdk-0.0.16-py3-none-any.whl (24 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading o

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import getpass

password = getpass.getpass("Enter your API Key: ")

Import the necessary libraries

Langchain Documentation- https://python.langchain.com/docs/use_cases/question_answering/

Cohere- https://cohere.com/

ChromaDB- https://www.trychroma.com/

In [None]:
# Importing the CharacterTextSplitter class from the langchain.text_splitter module
from langchain.text_splitter import CharacterTextSplitter

# Importing the TextLoader class from the langchain.document_loaders module
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.document_loaders import DirectoryLoader
# Importing the CohereEmbeddings class from the langchain.embeddings module
from langchain.embeddings import CohereEmbeddings

# Importing the Chroma class from the langchain.vectorstores module
from langchain.vectorstores import Chroma
import os
import PyPDF2


Multiple Files- Can be a combination of PDF files plus Text Files, thus we convert the PDF files to TXT files and use the Directory Loader


In [None]:
#Function to convert PDF Files to TXT Files
def pdf_to_txt(pdf_path, output_folder):
    # Open the PDF file
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Create the output file path
        output_filename = os.path.splitext(os.path.basename(pdf_path))[0] + ".txt"
        output_path = os.path.join(output_folder, output_filename)

        # Extract text from each page and write to the output file
        with open(output_path, "w") as txt_file:
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text = page.extract_text()
                txt_file.write(text)

In [None]:
!pip install SpeechRecognition

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting SpeechRecognition
  Downloading SpeechRecognition-3.10.0-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.10.0


In [None]:
import speech_recognition as sr

def audio_to_text(audio_file):
    # Create a recognizer object
    recognizer = sr.Recognizer()

    # Load the audio file
    with sr.AudioFile(audio_file) as source:
        # Read the audio data from the file
        audio_data = recognizer.record(source)

        # Perform speech recognition
        text = recognizer.recognize_google(audio_data)

    # Return the recognized text
    return text

In [None]:
import os
dir_path = "/content/drive/MyDrive/QA_LLM_Internship"
files=os.listdir(dir_path)

for i in files:
  if os.path.isfile(os.path.join(dir_path, i)):
      file_extension = os.path.splitext(i)[1].lower()
      if file_extension == ".pdf":
        pdf_to_txt(os.path.join(dir_path, i), dir_path)
      if file_extension == ".wav":
        result = audio_to_text(os.path.join(dir_path, i))
        output_file_path = os.path.join(dir_path, "AudioFile.txt")
        print(output_file_path)
        with open(output_file_path, 'w') as txt_file:
          txt_file.write(result)

/content/drive/MyDrive/QA_LLM_Internship/AudioFile.txt


In [None]:
loader = DirectoryLoader('/content/drive/MyDrive/QA_LLM_Internship', glob="./*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) #Splitting the text and creating chunks
docs = text_splitter.split_documents(documents)
embeddings = CohereEmbeddings(cohere_api_key=password) #Creating Cohere Embeddings

In [None]:
db = Chroma.from_documents(docs, embeddings) #Storing the embeddings in the vector database

Query From Text File

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query) #Searching for the query in the Vector Database and using cosine similarity for the same.
#Cosine Similarity- https://www.machinelearningplus.com/nlp/cosine-similarity/

In [None]:
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


Query From PDF File

In [None]:
query = "What was the Russian Ukraine War previsouly reffered as?"
docs = db.similarity_search(query)

In [None]:
print(docs[0].page_content)

The Russo-Ukrainian War,[e] previously referred to as the Ukrainian crisis in its early stages,[4] is an ong
oing international conflict between Russia, alongside Russian-backed separatists, and Ukraine, which beg
an in February 2014.[f] Following Ukraine's Revolution of Dignity, Russia annexed Crimea from Ukraine a
nd supported pro-Russian separatists fighting the Ukrainian military in the Donbas war. The first eight yea
rs of conflict also included naval incidents, cyberwarfare, and heightened political tensions. In February 20
22, Russia launched a full-scale invasion of Ukraine.  
  
In early 2014, the Euromaidan protests led to the Revolution of Dignity and the ousting of Ukraine's pro-R
ussian president Viktor Yanukovych. Shortly after, pro-Russian unrest erupted in eastern and southern Uk
raine. Simultaneously, unmarked Russian troops moved into Ukraine's Crimea and took over government 
buildings, strategic sites and infrastructure. Russia soon annexed Crimea after a highly-dis

Audio Querying

In [None]:
query = "What is today?"
docs = db.similarity_search(query)

In [None]:
print(docs[0].page_content)

hello today is a good day


In [None]:
retriever = db.as_retriever(search_type="mmr")

In [None]:
retriever.get_relevant_documents(query)[0]

Document(page_content='hello today is a good day', metadata={'source': '/content/drive/MyDrive/QA_LLM_Internship/AudioFile.txt'})

Other ways to reduce Hallucinations

1- Prompt Engineering Methods

a) Request for Evidence

b) Set Boundarie: Ask the model to answer from the text only and if thats not possible tell it to return a paticular answer

c) Ask the model to describe the question in detail before answering

d) Step by Step Reasoning
