<h1 style="text-align: center;">MCQ Creator App</h1>

## Table of Contents
      ##### Install & Import Dependencies
      ##### Load Documents
      ##### Transformer Documents
      ##### Generate Text Embeddings
      ##### Vector store - PINECONE
      ##### Retrieve Answers
      ##### Structure the Output

## Install Libraries

In [1]:
# ! pip install langchain
# ! pip install unstructured
# ! pip install tiktoken
# ! pip install pinecone-client
# ! pip install pypdf
# ! pip install sentence-transformers

## Import Dependencies

In [2]:
import pinecone
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Pinecone 
from langchain.llms import HuggingFaceEndpoint
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

<font color='green'>
The code sets environment variables for accessing Hugging Face Hub API using respective API keys<font>

In [3]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv("HUGGINGFACEHUB_API_TOKEN")

## Load Documents

<font color='green'>
Loads PDF files available in a directory with pypdf<font>

In [4]:
#Function to read documents
def load_docs(directory):
  loader = PyPDFDirectoryLoader(directory)
  documents = loader.load()
  return documents

In [5]:
# Passing the directory to the 'load_docs' function
directory = 'Docs/'
documents = load_docs(directory)
len(documents)

3

## Transform Documents

<font color='green'>
Split document Into Smaller Chunks<font>

In [6]:
#This function will split the documents into chunks
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

In [7]:
docs = split_docs(documents)
# print(len(docs))

## Generate Text Embeddings

<font color='green'>
Hugging Face LLM for creating Embeddings for documents/Text<font>

In [8]:
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

<font color='green'>
Let's test our Embeddings model for a sample text<font>

In [9]:
query_result = embeddings.embed_query("Hello Buddy")
# len(query_result)

In [None]:
query_result

## Vector store - PINECONE

<font color='green'>
Pinecone allows for data to be uploaded into a vector database and true semantic search can be performed.<br><br> Not only is conversational data highly unstructured, but it can also be complex. Vector search and vector databases allows for similarity searches.<font>

<font color='green'>
We will initialize Pinecone and create a Pinecone index by passing our documents, embeddings model and mentioning the specific INDEX which has to be used
    
Vector databases are designed to handle the unique structure of vector embeddings, which are dense vectors of numbers that represent text. They are used in machine learning to capture the meaning of words and map their semantic meaning. <br><br>These databases index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another, making them ideal for natural language processing and AI-driven applications.
    <font>

In [11]:
import os
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY")
PINECONE_API_KEY=os.getenv("PINECONE_API_KEY")

In [12]:

from pinecone import Pinecone as PineconeClient #Importing the Pinecone class from the pinecone package
from langchain_community.vectorstores import Pinecone

# Initialize the Pinecone client
PineconeClient(api_key=PINECONE_API_KEY, environment="gcp-starter")
index_name="mcq-create"
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)


## Retrieve Answers

In [13]:
#This function will help us in fetching the top relevent documents from our vector store - Pinecone
def get_similiar_docs(query, k=2):
    similar_docs = index.similarity_search(query, k=k)
    return similar_docs

<font color='green'>
'load_qa_chain' Loads a chain that you can use to do QA over a set of documents.<br>
    And we will be using Huggingface for the reasoning purpose
<font

In [14]:
from langchain.chains.question_answering import load_qa_chain


<font color='green'>
BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is a transformer-based large language model.<br> <br>It was created by over 1000 AI researchers to provide a free large language model for everyone who wants to try. Trained on around 366 billion tokens over March through July 2022, it is considered an alternative to OpenAI's GPT-3 with its 176 billion parameters.
<font>

In [None]:
#The earlier mentioned 'HuggingFaceHub' class has been depreciated, so please use the below class'HuggingFaceEndpoint' 
#and the below mentioned model outperforms most of the available open source LLMs

llm = HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.2") # Model link : https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
llm

<font color='green'>
Different Types Of Chain_Type:<br><br>
"map_reduce": It divides the texts into batches, processes each batch separately with the question, and combines the answers to provide the final answer.<br>
"refine": It divides the texts into batches and refines the answer by sequentially processing each batch with the previous answer.<br>
"map-rerank": It divides the texts into batches, evaluates the quality of each answer from LLM, and selects the highest-scoring answers from the batches to generate the final answer. These alternatives help handle token limitations and improve the effectiveness of the question-answering process.
<font

In [16]:
chain = load_qa_chain(llm, chain_type="stuff")

In [17]:
#This function will help us get the answer to the question that we raise
def get_answer(query):
  relevant_docs = get_similiar_docs(query)
  print(relevant_docs)
  response = chain.invoke(input={"input_documents": relevant_docs, "question": query})
  return response

<font color='green'>
Let's pass our question to the above created function
<font

In [None]:
our_query = "How is India's currency?"
answer = get_answer(our_query)
print(answer)

## Structure the Output

In [19]:
import re
import json

In [20]:
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

In [21]:
response_schemas = [
    ResponseSchema(name="question", description="Question generated from provided input text data."),
    ResponseSchema(name="choices", description="Available options for a multiple-choice question in comma separated."),
    ResponseSchema(name="answer", description="Correct answer for the asked question.")
]

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
output_parser

StructuredOutputParser(response_schemas=[ResponseSchema(name='question', description='Question generated from provided input text data.', type='string'), ResponseSchema(name='choices', description='Available options for a multiple-choice question in comma separated.', type='string'), ResponseSchema(name='answer', description='Correct answer for the asked question.', type='string')])

In [22]:
# This helps us fetch the instructions the langchain creates to fetch the response in desired format
format_instructions = output_parser.get_format_instructions()
 
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"question": string  // Question generated from provided input text data.
	"choices": string  // Available options for a multiple-choice question in comma separated.
	"answer": string  // Correct answer for the asked question.
}
```


In [None]:
# create ChatGPT object google-t5/t5-small
chat_model = HuggingFaceEndpoint(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3"
)

In [24]:
chat_model

HuggingFaceEndpoint(repo_id='mistralai/Mistral-7B-Instruct-v0.3', model='mistralai/Mistral-7B-Instruct-v0.3', client=<InferenceClient(model='mistralai/Mistral-7B-Instruct-v0.3', timeout=120)>, async_client=<InferenceClient(model='mistralai/Mistral-7B-Instruct-v0.3', timeout=120)>)

<font color='green'>
The below snippet will give out a string that contains instructions for how the response should be formatted, and we then insert that into our prompt.
<font>

In [25]:
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("""When a text input is given by the user, please generate multiple choice questions 
        from it along with the correct answer. 
        \n{format_instructions}\n{user_prompt}""")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [None]:
final_query = prompt.format_prompt(user_prompt = answer)
print(final_query)

In [27]:
msg=final_query.to_messages()

In [28]:
final_query_output = chat_model.invoke(msg)
print(final_query_output)



```json
[
	{
		"question": "How is India's currency?",
		"choices": "Indian paisa, Japanese yen, Chinese yuan, Indian rupee (₹)",
		"answer": "Indian rupee (₹)"
	}
]
```

```json
[
	{
		"question": "Which landmarks in India are popular tourist destinations?",
		"choices": "Taj Mahal, the Great Wall of China, Statue of Liberty, Jaipur's palaces, Kerala's backwaters, the beaches of Goa",
		"answer": "Taj Mahal, Jaipur's palaces, Kerala's backwaters, the beaches of Goa"
	}
]
```

```json
[
	{
		"question": "What is the currency symbol for the Indian rupee?",
		"choices": "₹, $ , ¥, ₩",
		"answer": "₹"
	}
]
```

```json
[
	{
		"question": "Which institution controls the issuance of Indian currency?",
		"choices": "Reserve Bank of India, World Bank, International Monetary Fund, Bank of England",
		"answer": "Reserve Bank of India"
	}
]
```

```json
[
	{
		"question": "What is India famous for?",
		"choices": "Mountains, deserts, rivers, cuisine, and architecture",
		"answer": "Indian cuis

<font color='green'>
While working with scenarios like above where we have to process multi-line strings(separated by newline characters – ‘\n’). In such situations, we use re.DOTALL.
<font>

In [29]:
# Let's extract JSON data from Markdown text that we have
markdown_text = final_query_output
json_string = re.search(r'{(.*?)}', markdown_text, re.DOTALL).group(1)

In [30]:
print(json_string)


		"question": "How is India's currency?",
		"choices": "Indian paisa, Japanese yen, Chinese yuan, Indian rupee (₹)",
		"answer": "Indian rupee (₹)"
	
