<a href="https://colab.research.google.com/github/siddhapurahet/AI_Concepts_Projects/blob/Interview_questions_answers_project/Interview_Question_Answers_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

==> Installing Dependencies

In [None]:
!pip install langchain langchain-community langchain-core openai pypdf tiktoken aiofiles fastapi uvicorn jinja2 PyPDF2 faiss-cpu

==> Open AI Api Key

In [None]:
%env OPENAI_API_KEY = ""


==> Taking the raw pdf as input file and extracting the content from the file

In [4]:
from langchain.document_loaders import PyPDFLoader

file_path = "/content/Interview_Question_input_data.pdf"
loader = PyPDFLoader(file_path)
data = loader.load()

==> Displaying the content of the input file

In [None]:
data

==> Extracting the page content from all the content generated. <br>
&nbsp; &nbsp; &nbsp; &nbsp; => The format of the text extracted is : <br> &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;Document(metadata={'metadata': 'producer', 'creator': '', 'creationdate':
'2023-07-19T22:21:03+09:00', 'moddate': '2023-07-19T22:21:07
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;+09:00', 'trapped': '/False', 'source': '/content/
Interview_Question_input_data.pdf', 'total_pages': 42, 'page':
31, 'page_label': '26'}, <br>
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;page_content='26\nSummary for Policymakers\nSummary for Policymakers Delayed mitigation action will further increase global') <br>
      <br>
    But we only want the page content as it has relevant content according to our use case. So we are extracting page_content field from every page throughout the input file.  

In [6]:
question_gen = ""
for page in data:
  question_gen += page.page_content

In [None]:
question_gen
# type(question_gen) --> type = str

==> Making chunks of the content extracted <br>
&nbsp;&nbsp;&nbsp;&nbsp;=> The need for chunks is because there is specific limit to size of data that can be send to Embedding &model, hence we need to
make smaller chunks of data and then it will be sent to embedding model.

In [8]:
from langchain.text_splitter import TokenTextSplitter

splitter_ques_gen = TokenTextSplitter(
    model_name = "gpt-3.5-turbo",
    chunk_size = 1000,
    chunk_overlap = 200
)

In [None]:
chunk_ques_gen = splitter_ques_gen.split_text(question_gen)

chunk_ques_gen
# type(chunk_ques_gen[0])

==> Changing the individual chunks from string format to document format <br>
&nbsp;&nbsp;&nbsp;&nbsp;=> It is recommended to pass the document format to the vector embedding model rather than string format. Hence, each chunk_ques_gen is converted to document format.

In [None]:
from langchain.docstore.document import Document

document_ques_gen = [Document(page_content = t) for t in chunk_ques_gen]

document_ques_gen
type(document_ques_gen[0])

In [21]:
splitter_ans_gen = TokenTextSplitter(
    model_name = "gpt-3.5-turbo",
    chunk_size = 1000,
    chunk_overlap = 100
)

In [22]:
document_answer_gen = splitter_ans_gen.split_documents(
    document_ques_gen
)

In [None]:
document_answer_gen
# type(document_answer_gen)

==> Passing the document to model to frame it in the Question format <br>
&nbsp;&nbsp;&nbsp;&nbsp;=> Here, we are not using vector database as we only want to frame questions from the document, so a generative AI model would do that, but when generating the answer, we would need to store document into vector database through embedding model. Hence, for answer generation, vector database would play a role as it has to find the most correct answer using searching relevant content from vector database according to the prompt given by user.

==> Using the OpenAI model for generating questions from text

In [None]:
from langchain.chat_models import ChatOpenAI

llm_ques_gen_pipeline = ChatOpenAI(
    model = 'gpt-3.5-turbo',
    temperature = 0.3
)

==> Making a prompt_template which will be given to the model for generating questions

In [12]:
prompt_template = """
You are responsible for creating questions from the text. Make sure that every important question is covered.
You can do this by asking questions about the text below:

-------------
{text}
-------------

Create questions that will make the user more knowledgeble when he completes your Interview.

Questions:
"""

In [13]:
from langchain.prompts import PromptTemplate

Prompt_questions = PromptTemplate(template = prompt_template, input_variables = ['text'])

==> Using a concept of refining template in which the questions generated by the model are again given to the model to verify that the generated questions are proper and not irrelevant. Hence, again it is passed to the model with refine_template that modifies questions if necessary

In [14]:
refine_template = ("""
You are responsible for creating questions from the text. Make sure that every important question is covered.
we have received some practice questions to a certain extent: {existing_answer}.
We have option to refine existing questions or add new ones (only if necessary) with some more context below.

------------
{text}
------------

Given the new context, refine the original questions in English.
If the context is not helpful, kindly give the original questions.

Questions:
"""
)

In [15]:
refine_prompt_questions = PromptTemplate(
    input_variables = ['existing_answer', 'text'],
    template = refine_template,
)

In [16]:
from langchain.chains.summarize import load_summarize_chain

ques_gen_chain = load_summarize_chain(llm = llm_ques_gen_pipeline,
                                      chain_type = "refine",
                                      verbose = True,
                                      question_prompt = Prompt_questions,
                                      refine_prompt = refine_prompt_questions

)

In [None]:
questions = ques_gen_chain.run(document_ques_gen)

ques_list = questions.split('\n')
print(ques_list)

==> Initializing an Embedding model

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

==> Storing the vector embeddings in the vector database

In [None]:
from langchain.vectorstores import FAISS

vector_store = FAISS.from_documents(document_answer_gen, embeddings)

==> Initializing the model to generate answers of the questions

In [26]:
llm_answer_gen = ChatOpenAI(temperature = 0.1, model = "gpt-3.5-turbo")

==> Connecting the vector database to the model for generating answers

In [None]:
from langchain.chains import RetrievalQA

answer_generation_chain = RetrievalQA.from_chain_type(llm=llm_answer_gen,
                                               chain_type="stuff",
                                               retriever=vector_store.as_retriever())

==> Using for loop to iterate over questions and every question will be given to the model and answer will be the output

In [None]:
# Answer each question and save to a file
for question in ques_list:
    print("Question: ", question)
    answer = answer_generation_chain.run(question)
    print("Answer: ", answer)
    print("--------------------------------------------------\\n\\n")
    # Save answer to file
    with open("answers.txt", "a") as f:
        f.write("Question: " + question + "\\n")
        f.write("Answer: " + answer + "\\n")
        f.write("--------------------------------------------------\\n\\n")