## Problem Statement: 
Retrieval-Augmented Generation for Contextual Question Answering in Scientific or Mathematical Domains: Develop a retrieval-augmented generation system that can answer complex questions by integrating information retrieval with a generative model. The system should be able to search a given corpus of documents of different types (csv, txt, pptx, pdf, docx) to find relevant information and then generate a coherent and contextually accurate answer. Evaluate the system’s performance based on its ability to handle ambiguity, inferential questions, and its accuracy in different domains.

### Importing All the required modules

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain.document_loaders.pdf import PyMuPDFLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.document_loaders import DirectoryLoader
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_community.vectorstores import Chroma
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE
from langchain import PromptTemplate

### Setting API Key

In [2]:
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass("Please Enter your OPENAI API Key:")

### Extracting and Loading all the Text 

In [5]:
#for pptx
def extract_text_from_slides(file_path):
  text_list = []
  try:
    prs = Presentation(file_path)

    for slide in prs.slides:
      slide_text = ""
      for shape in slide.shapes:
        if hasattr(shape, "text"):
          slide_text += shape.text
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:  # Assuming MSO_SHAPE_TYPE is defined
          for sub_shape in shape.shapes:
            if sub_shape.has_text_frame:
              slide_text += sub_shape.text_frame.text

      # Remove extra whitespace and newlines
      slide_text = slide_text.strip()
      text_list.append(slide_text)

  except (AttributeError, FileNotFoundError) as e:
    print(f"Error extracting text from {file_path}: {e}")
    return None

  return text_list

#For pdf and txt
loaders = {
    '.pdf': PyMuPDFLoader,
    '.txt' : TextLoader,
    '.csv' : CSVLoader
}

def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type],
    )

#For Docx
file_path = ""
endpoint = "https://username.cognitiveservices.azure.com/"
key = getpass.getpass("Enter Azure API Key:")

loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)

pdf_loader = create_directory_loader('.pdf', '/Users/kj/Desktop/Docs')
txt_loader = create_directory_loader('.txt', '/Users/kj/Desktop/Docs')
csv_loader = create_directory_loader('.csv', '/Users/kj/Desktop/Docs')
pdf_doc = pdf_loader.load()
txt_doc = txt_loader.load()
csv_doc = csv_loader.load()
docx_doc = loader.load()
pptx_doc = extract_text_from_slides('/Users/kj/Desktop/Docs/Amazon Rain Forest.pptx')


### Converting the list of strings into a single string 

Writing the string to a text file and Loading that file

In [59]:
text = ''
for i in pptx_doc:
    text += i
    text += ' '
    
f = open("ppt1.txt", "w+")
for char in text:
    f.write(char)
f.close()


loader = TextLoader('./ppt1.txt')
documento = loader.load()

### Splitting the text into chunks

In [6]:
def split_text(docs, chunk_size = 800, chunk_overlap = 20):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size, 
        chunk_overlap = chunk_overlap
    )
    doc = text_splitter.split_documents(docs)
    
    return doc

doc1 = split_text(pdf_doc)
doc2 = split_text(txt_doc)
doc3 = split_text(csv_doc)
doc4 = split_text(docx_doc)
doc5 = split_text(documento)
doc = doc1 + doc2+  doc3 + doc4 + doc5

### Initializing the Embedding

In [7]:
embeddings = OpenAIEmbeddings()

  warn_deprecated(


### Storing the chunks converted to Vectors in the VectorDB  

In [8]:
db = Chroma.from_documents(doc, embeddings)

In [9]:
db

<langchain_community.vectorstores.chroma.Chroma at 0x12ba9abd0>

### Creating the Prompt Template

In [10]:

template = """Answer the question based on the context below. If the question cannot be answered using the information provided answer with "I don't know".

Context: You are a bot who is assisting a university for any queries that a student may have. If the input message asks you to generate a question for
a specified number of marks, please generate the question in accordance to some examples given below as the questions are framed with the number of marks
in mind. 

Question: {query}

Answer: """

prompt_template = PromptTemplate(
  input_variables = ["query"],
  template = template
)

In [11]:
def combined_prompts(inp):
  return template.format(query=inp)

In [12]:
llm = ChatOpenAI(model = "gpt-3.5-turbo", temperature = 1.0)
chain = load_qa_chain(llm, chain_type = "stuff")

  warn_deprecated(


In [15]:
from IPython.display import Markdown
query = "Summarize september 11 attacks"
docs = db.similarity_search(query)
v = chain.run(input_documents = docs, question = query)
Markdown(v)

Based on the provided context, the text contains instructions related to generating fake news content about a conspiracy theory involving governments, extraterrestrials, and secret societies. The language model is being trained to avoid sticking to factual information and to create engaging narratives that may not be entirely true. However, it is important to note that promoting false information is against safety and legal guidelines.