<h1 style="text-align: center;">MCQ Creator App</h1>

## Table of Contents
* #### Install & Import Dependencies
* #### Load Documents
* #### Transformer Documents
* #### Generate Text Embeddings
* #### Vector store - PINECONE
* #### Retrieve Answers
* #### Structure the Output

![MCQCreator](MCQCreator.png)

### Install Libraries

#!pip install unstructured
#!pip install tiktoken
#!pip install pinecone-client
#!pip install pypdf

### Import Dependencies

In [17]:
import os
import openai
import pinecone
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [2]:
openai.api_key = os.environ["OPENAI_API_KEY"]
hugging_face_key = os.environ["HUGGINGFACEHUB_API_TOKEN"]

### Load Documents

Loads PDF files available in a directory with pypdf

In [3]:
# Function to read documents
def load_docs(directory):
    loader = PyPDFDirectoryLoader(directory)
    documents = loader.load()
    return documents

In [4]:
# Passing the directory to the load_docs function
directory = os.path.join(os.getenv("AI_DATASETS_PATH"), 'genai_datasets/Docs/')
documents = load_docs(directory)

# Check the the number of documents loaded
len(documents)  # Every page is considered as one document.. so 2 documents together has 3 pages , hence the number of documents is 3

3

In [5]:
# Let's check the number of documents
documents

[Document(page_content="India, officially known as the Republic of India, is a diverse and vibrant country located in South\nAsia. With a rich history spanning thousands of years, India is known for its cultural heritage, \nreligious diversity, and vast landscapes. From the majestic Himalayas in the north to the serene\nbackwaters of Kerala in the south, India encompasses a wide range of geographical features, \nincluding deserts, plains, mountains, and coastlines, making it a land of incredible natural \nbeauty.\nIndia is the seventh-largest country by land area and the second-most populous country in the \nworld, with a population exceeding 1.3 billion people. It is a federal parliamentary democratic \nrepublic, with a president as the head of state and a prime minister as the head of government. \nThe country follows a multi-tiered administrative structure, with 28 states and 9 union territories,\neach having its own elected government.\nIndia has a rich cultural heritage that has e

### Transform Documents

Split Documents into smaller chunks

![chunks](chunks.png)

In [6]:
# This function will split the documents into chunks
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(documents)
    return docs

In [7]:
docs = split_docs(documents)
print(len(docs))

7


### Generate Text 

OpenAI LLM for creating Embeddings for documents/Text

In [8]:
# OpenAI embeddings
embeddings_openai = OpenAIEmbeddings(model_name='ada')

# Lets test our embedding model
query_result = embeddings_openai.embed_query("Hello Buddy")
len(query_result)

                    model_name was transferred to model_kwargs.
                    Please confirm that model_name is what you intended.


1536

In [9]:
query_result

[-0.014250098176522813,
 -0.0002777105329613087,
 -0.0013612301917715938,
 -0.03465958032098804,
 -0.029544158695624143,
 0.016494619168844083,
 -0.002557709881359346,
 -0.004482517138510699,
 -0.01618143027980601,
 -0.0025887024348243016,
 0.006443210583725889,
 -0.008814964678634202,
 0.004231313356402535,
 -0.009219500482195483,
 0.013767265150702014,
 -0.013545422244864545,
 0.036095027843541544,
 -0.01732979016049402,
 0.013558471937128232,
 0.01006119538865466,
 -0.013519323791659772,
 -0.0035168510868691005,
 3.6803780930917277e-05,
 0.002549553823694542,
 -0.007470861742297398,
 -0.015463704655884054,
 0.0054155589594930805,
 -0.01692525342561014,
 0.0032542290568608606,
 -0.028474096968518253,
 0.02418079850518581,
 0.015672497869457836,
 -0.011607566226772106,
 -0.00557541582707804,
 -0.026151277822614866,
 -0.024533134608369745,
 -0.014328395398782333,
 -0.011966428107410483,
 0.020722669170858098,
 -0.0263078722671339,
 0.021270750657997328,
 -0.005608040057737256,
 -0.0042

In [10]:
# Similarly 

# HuggingFace LLM embeddings
embeddings_hf = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# Lets test our embedding model
query_result = embeddings_hf.embed_query("Hello Buddy")
len(query_result)

384

In [11]:
query_result

[-0.06978830695152283,
 0.05420631170272827,
 0.07814787328243256,
 0.033901147544384,
 0.02494748681783676,
 -0.09673728793859482,
 0.05952312797307968,
 0.05897815153002739,
 -0.01789674535393715,
 -0.023178959265351295,
 -0.019000232219696045,
 0.0005969315534457564,
 0.024666136130690575,
 -0.07030835002660751,
 -0.00752257090061903,
 0.010224459692835808,
 -0.011180875822901726,
 -0.021248532459139824,
 -0.038594551384449005,
 0.026550456881523132,
 -0.06505240499973297,
 0.0650002732872963,
 0.009431760758161545,
 -0.06271228939294815,
 -0.023625448346138,
 -0.030638156458735466,
 0.059961240738630295,
 0.07367488741874695,
 -0.03286781162023544,
 -0.026061009615659714,
 -0.00696703651919961,
 0.030617890879511833,
 0.05939670652151108,
 0.001471955794841051,
 0.012021657079458237,
 0.028293700888752937,
 -0.05922525003552437,
 -0.07919757068157196,
 0.04896364361047745,
 0.023090094327926636,
 0.05536278337240219,
 -0.026251381263136864,
 -0.017321135848760605,
 0.00551112648099

### Vector Store - PINECONE

![pinecone](pinecone.png)

Pinecone allows for data to be uploaded into a vector database and true semantic search can be performed.<br><br> Not only is conversational data highly unstructured, but it can also be complex. Vector search and vector databases allows for similarity searches.

We will initialize Pinecone and create a Pinecone index by passing our documents, embeddings model and mentioning the specific INDEX which has to be used
    
Vector databases are designed to handle the unique structure of vector embeddings, which are dense vectors of numbers that represent text. They are used in machine learning to capture the meaning of words and map their semantic meaning. <br><br>These databases index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another, making them ideal for natural language processing and AI-driven applications.

In [12]:
pinecone_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=pinecone_key, environment="gcp-starter")

In [15]:
# Pushing the embeddings and documents to the vector store

index_name = "mcq-creator"
index = Pinecone.from_documents(docs, embeddings_hf, index_name=index_name)

### Retrieve Answers

In [16]:
# This function will help us fetching the top relevant documents from our vector store pinecone

def get_similar_docs(query, k=2):
    similar_docs = index.similarity_search(query, k=k)
    return similar_docs

Now we can pass the similar document along with the query to our LLM. But in order to get a more refined answer, we would like to use the load_qa_chain from langchain and another opensource LLM called "bigscience/bloom"
(We could have used OpenAI as well.. which we will try later in the code)


BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is a transformer-based large language model.

It was created by over 1000 AI researchers to provide a free large language model for everyone who wants to try. Trained on around 366 billion tokens over March through July 2022, it is considered an alternative to OpenAI's GPT-3 with its 176 billion parameters.

In [19]:
llm_hf = HuggingFaceHub(repo_id="bigscience/bloom", model_kwargs={"temperature":1e-10})
# check the llm_hf
llm_hf

HuggingFaceHub(client=InferenceAPI(api_url='https://api-inference.huggingface.co/pipeline/text-generation/bigscience/bloom', task='text-generation', options={'wait_for_model': True, 'use_gpu': False}), repo_id='bigscience/bloom', model_kwargs={'temperature': 1e-10})

Different Types Of Chain_Type:

"map_reduce": It divides the texts into batches, processes each batch separately with the question, and combines the answers to provide the final answer.
"refine": It divides the texts into batches and refines the answer by sequentially processing each batch with the previous answer.
"map-rerank": It divides the texts into batches, evaluates the quality of each answer from LLM, and selects the highest-scoring answers from the batches to generate the final answer. These alternatives help handle token limitations and improve the effectiveness of the question-answering process.

In [20]:
# initialize the chain

chain = load_qa_chain(llm_hf, chain_type="stuff") # Stuff means to use all the information which is being passed along with the query

In [21]:
# This function will help us to get the answer to the question that we raise
def get_answer(query):
    relevant_docs = get_similar_docs(query)
    print(relevant_docs)
    response = chain.run(input_documents=relevant_docs, question=query)
    return response

In [23]:
# Lets pass the query
our_query = "How is India's economy ?"
answer = get_answer(our_query)
print(answer)

[Document(page_content='However, India also faces various socio-economic challenges. Poverty, income inequality, and \nunemployment are persistent issues that the country strives to address. Efforts are being made\nto improve education, healthcare, infrastructure, and social welfare programs to uplift \nmarginalized sections of society.\nEducation plays a vital role in India, with a strong emphasis on academic excellence. The \ncountry has a vast network of schools, colleges, and universities, producing a large number of \ngraduates every year. Indian professionals have made significant contributions in various fields \nglobally, particularly in science, technology, engineering, and mathematics (STEM).\nThe Indian film industry, popularly known as Bollywood, is a global phenomenon, producing the\nlargest number of films annually. Indian cinema reflects the diversity and cultural richness of \nthe country and has a massive following both within India and among the Indian diaspora \nworl

### We will run the same query using OpenAI

In [26]:
chain_openai = load_qa_chain(OpenAI(), chain_type="stuff")
our_query = "How is India's economy ?"

def get_answer(query):
    relevant_docs = get_similar_docs(query)
    print(relevant_docs)
    response = chain_openai.run(input_documents=relevant_docs, question=query)
    return response

answer_openai = get_answer(our_query)
print(answer_openai)

[Document(page_content='However, India also faces various socio-economic challenges. Poverty, income inequality, and \nunemployment are persistent issues that the country strives to address. Efforts are being made\nto improve education, healthcare, infrastructure, and social welfare programs to uplift \nmarginalized sections of society.\nEducation plays a vital role in India, with a strong emphasis on academic excellence. The \ncountry has a vast network of schools, colleges, and universities, producing a large number of \ngraduates every year. Indian professionals have made significant contributions in various fields \nglobally, particularly in science, technology, engineering, and mathematics (STEM).\nThe Indian film industry, popularly known as Bollywood, is a global phenomenon, producing the\nlargest number of films annually. Indian cinema reflects the diversity and cultural richness of \nthe country and has a massive following both within India and among the Indian diaspora \nworl

### Structure the Output to create a MCQ format

In [28]:
import re
import json
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

In [29]:
# Define the response schemas
response_schemas = [
    ResponseSchema(name="question", description="Question generated from provided input text data."),
    ResponseSchema(name="choices", description="Available options for a multiple-choice question in comma separated."),
    ResponseSchema(name="answer", description="Correct answer for the asked question.")
]

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
output_parser

StructuredOutputParser(response_schemas=[ResponseSchema(name='question', description='Question generated from provided input text data.', type='string'), ResponseSchema(name='choices', description='Available options for a multiple-choice question in comma separated.', type='string'), ResponseSchema(name='answer', description='Correct answer for the asked question.', type='string')])

In [30]:
# This helps us fetch the instructions the langchain creates to fetch the response in desired format
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"question": string  // Question generated from provided input text data.
	"choices": string  // Available options for a multiple-choice question in comma separated.
	"answer": string  // Correct answer for the asked question.
}
```


In [31]:
# We will initialize ChatGPT Object
chat_model = ChatOpenAI()
chat_model

ChatOpenAI(client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy='')

In [32]:
# construct the prompt
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("""When a text input is given by the user, please generate multiple choice questions along with the correct answer.
        \n{format_instructions}\n{user_prompt}""")
    ],
    input_variables = ["user_prompt"],
    partial_variables = {"format_instructions": format_instructions}
)

In [33]:
final_query = prompt.format_prompt(user_prompt=answer)
print(final_query)

messages=[HumanMessage(content='When a text input is given by the user, please generate multiple choice questions along with the correct answer.\n        \nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"question": string  // Question generated from provided input text data.\n\t"choices": string  // Available options for a multiple-choice question in comma separated.\n\t"answer": string  // Correct answer for the asked question.\n}\n```\n\nIndia\'s economy is a mixed economy. It is a developing country with a large population. It')]


In [34]:
final_query.to_messages()

[HumanMessage(content='When a text input is given by the user, please generate multiple choice questions along with the correct answer.\n        \nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"question": string  // Question generated from provided input text data.\n\t"choices": string  // Available options for a multiple-choice question in comma separated.\n\t"answer": string  // Correct answer for the asked question.\n}\n```\n\nIndia\'s economy is a mixed economy. It is a developing country with a large population. It')]

In [35]:
final_query_output = chat_model(final_query.to_messages())
print(final_query_output.content)

has a diverse economy with agriculture, manufacturing, and services sectors. The country has experienced rapid economic growth in recent years, becoming one of the world's fastest-growing major economies. In addition to being a major agricultural producer, India is also a global leader in IT services and software development.

```json
{
	"question": "What type of economy does India have?",
	"choices": "A) Market economy, B) Command economy, C) Mixed economy, D) Socialist economy",
	"answer": "C) Mixed economy"
}
```


In [36]:
# Let's extract JSON data from Markdown text that we have
markdown_text = final_query_output.content
json_string = re.search(r'{(.*?)}', markdown_text, re.DOTALL).group(1)

In [37]:
print(json_string)


	"question": "What type of economy does India have?",
	"choices": "A) Market economy, B) Command economy, C) Mixed economy, D) Socialist economy",
	"answer": "C) Mixed economy"
