Retrieval Augmented Generation

***

This notebook centers on creating a personal chatbot enhanced by [Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401) (RAG). RAG enables the chatbot to access information from a local document database, such as the academic publications mentioned in the lectures. This feature allows the chatbot to incorporate locally stored data into its responses. The assignment is divided into three stages:

* First, divide local documents into smaller segments. Process these segments to create embeddings, which are then stored in a vector database for future retrieval.
* Next, integrate a generative language model with this vector database. This integration allows the model to access and utilize document segments stored in the database through RAG, enhancing its responses without additional fine-tuning.
* Finally, evaluate the effectiveness of this integrated system. Use the chatbot to answer questions related to the topics covered in our lectures.



<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#FFF2CC;border-color:#D6B656;color:#856404">
<b>How to Submit the Assignment</b>

Please work on this assignment in groups of two or three. Make sure to add your names to this files' header. After completion, share this assignment with me (<b>Julian Schelb - <a target="blank" href="https://www.kaggle.com/julianschelb">https://www.kaggle.com/julianschelb</a></b>) due Wednesday, 31st January, 10:00. Use the upper-right share button as instructed in the tutorial. In ILIAS, submit this notebook as response to Assignment 10. You can download this notebook using the "Download Notebook" option in the "File" menu.
</div>

## RAG Explained

**Retrieval Augmented Generation (RAG)** is a technique that combines the strengths of both retrieval-based and generative models to enhance text generation. RAG is commonly used to enhance response quality in question-answering scenarios. Before a generative model is prompted to answer the question, the user's input *(1)* is encoded as embedding *(2)* to retrieving relevant information from a database *(3)* of documents. By including them in the prompt *(4)* the retrieved data is used to inform and improve the responses generated by a generative model *(5)*. This method is especially useful because it circumvents the limitations of fine-tuning, which isn't always feasible due to various constraints such as data availability or computational resources. For example, by incorporating rich, academically-informed content directly into the input sequence, it significantly enhances its capability to provide detailed and relevant answers. Here are some resources with more information on the topic:

* [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
* [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426)
* [Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2312.10997)




In [1]:
!pip uninstall -qy dill pyarrow jupyter-lsp jupyterlab jupyterlab-lsp google-cloud-storage tensorflowjs
!pip install -q "dill<0.3.2,>=0.3.1.1" "pyarrow<10.0.0,>=3.0.0" "google-cloud-storage<3,>=2.2.1"
!pip -q install langchain tiktoken chromadb pypdf transformers InstructorEmbedding sentence-transformers
!pip install -q --upgrade "overrides>=7.3.1" "kubernetes>=28.1.0"

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
beatrix-jupyterlab 2023.814.150030 requires jupyter-server~=1.16, but you have jupyter-server 2.9.1 which is incompatible.
beatrix-jupyterlab 2023.814.150030 requires jupyterlab~=3.4, but you have jupyterlab 4.0.8 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.0.3 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is incompatible.
cudf 23.8.0 requires pyarrow==11.*, but you have pyarrow 9.0.0 which is incompatible.
cuml 23.8.0 requires dask==2023.7.1, but you have dask 2023.10.1 which is incompatible.
cuml 23

In [2]:
from IPython.display import display, HTML

def displayAnswer(question = "", answer = "", sources = []):
    
    msg = "<div style='position:relative;padding:0.75rem 1.25rem;\
            margin-bottom:1rem;border:1px solid transparent;\
            border-radius:.25rem;background-color:#fdf7e2;\
            border-color:#D6B656;color:#3c4046'>"
    if question: 
        msg += "<b>Question:</b><br>" + question + "<br><br>"
    if answer:
        msg += "<b>Answer:</b><br>" + answer + "<br><br>"
    if sources:
        msg += "<hr><b>Sources:</b><ul>"
        for source in sources:
            msg += "<li>" + source + "</li>"
        msg += "</ul>"
            
    msg += "</div>"
    display(HTML(msg))
    


In [3]:
#displayAnswer("Who was Albert Einstein?", "A physicist from Germany", ["Source 1", "Source 2"])

## Task 1: Save Documents in Vector Store

In this initial task, we aim to build a local database containing documents for later querying in question-answering scenarios. The process involves saving these documents in a vector store. Begin by loading the selected documents and then splitting them into manageable chunks. Next, create embeddings for each chunk, converting the text into a vector format that encapsulates its semantic content. This will be utilized by the document retriever subsequently. Finally, assess the effectiveness of your retrieval system to confirm its ability to accurately and efficiently extract relevant information from the vector store you have established.

In [4]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings
import os

  from tqdm.autonotebook import trange


<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Load Documents</b> 
</div>

1. Begin by loading the PDF documents from [this dataset](https://www.kaggle.com/datasets/julianschelb/worlam-papers) containing lecture-related papers and extracting their content. One approach is to use the [DirectoryLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) and [PyPDFLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) classes from [langchain](https://python.langchain.com/docs/get_started/introduction). Feel free to explore including papers or documents from other domains as well.
2. Verify the successful import of the documents by printing the content of a random page and the total number of imported pages/documents.

In [5]:
# Load and process the text files
path = '/kaggle/input/worlam-papers'
loader = DirectoryLoader(path, glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

In [6]:
print("Number of imported pages:", len(documents))

Number of imported pages: 130


In [7]:
print("Example page: \n", documents[0])

Example page: 
 page_content='Last Words\nComputational Linguistics and\nDeep Learning\nChristopher D. Manning∗\nStanford University\n1. The Deep Learning Tsunami\nDeep Learning waves have lapped at the shores of computational linguistics for several\nyears now, but 2015 seems like the year when the full force of the tsunami hit the\nmajor Natural Language Processing (NLP) conferences. However, some pundits are\npredicting that the ﬁnal damage will be even worse. Accompanying ICML 2015 in Lille,\nFrance, there was another, almost as big, event: the 2015 Deep Learning Workshop.\nThe workshop ended with a panel discussion, and at it, Neil Lawrence said, “NLP is\nkind of like a rabbit in the headlights of the Deep Learning machine, waiting to be\nﬂattened.” Now that is a remark that the computational linguistics community has to\ntake seriously! Is it the end of the road for us? Where are these predictions of steam-\nrollering coming from?\nAt the June 2015 opening of the Facebook AI Rese

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 2: Split Documents into Chunks</b> 
</div>

In this step, concentrate on dividing the loaded documents into smaller sections. This is essential due to several reasons: (1) the limitation imposed by the models' context window size, (2) the challenge of generating useful embeddings from longer chunks, and (3) the fact that a factually relevant block of text usually spans only a few sentences. Implement the following steps:


1. Plot the word count distribution per page.
2. Iterate over the document pages and split them in smaller chunks. There are many strategies ([see here](https://python.langchain.com/docs/modules/data_connection/document_transformers/)) but we recommend using the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), which is designed to segment texts into chunks of a specified size, while also maintaining an overlap for continuity. Ensure that the maximum chunk size is small enough to be processed by the model you plan to use for generating embeddings.
3. Post-chunking, plot and compared the word count distribution before and after chunking.

In [8]:
from nltk.tokenize import word_tokenize

def get_word_counts(document_collection):
    """Calculate the word count for each document in a collection."""
    word_counts = []

    for document in document_collection:
        text_content = document.page_content  # Extract text from the document
        words = word_tokenize(text_content)  # Tokenize the text into words
        word_count = len(words)  # Count the words
        word_counts.append(word_count)  # Append the count to the list

    return word_counts

In [9]:
word_counts_before = get_word_counts(documents)
print(word_counts_before[0:10])

[535, 691, 599, 533, 629, 473, 638, 462, 727, 329]


In [10]:
# Splitting the text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=128)
texts = text_splitter.split_documents(documents)

In [11]:
print("Number of chunks:", len(texts))

Number of chunks: 4434


In [12]:
print(texts[3])

page_content='predicting that the ﬁnal damage will be even worse. Accompanying ICML 2015 in Lille,\nFrance, there was another, almost as big, event: the 2015 Deep Learning Workshop.\nThe workshop ended with a panel discussion, and at it, Neil Lawrence said, “NLP is' metadata={'source': '/kaggle/input/worlam-papers/Computational_Linguistics_and_Deep_Learning.pdf', 'page': 0}


In [13]:
word_counts_after = get_word_counts(texts)
print(word_counts_after[0:10])

[31, 37, 48, 53, 50, 37, 39, 36, 45, 50]


In [14]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, subplot_titles=("Before Chunking", "After Chunking"))

trace1 = go.Histogram(x=word_counts_before, name='Before')
fig.add_trace(trace1, row=1, col=1)
trace2 = go.Histogram(x=word_counts_after, name='After')
fig.add_trace(trace2, row=1, col=2)

fig.update_layout(height=600, width=1200, title_text="Word Count Distribution: Before and After Chunking")
fig.show()

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 3: Create Document Embeddings</b> 
</div>

In this step, you'll transform the text chunks into semantic vector embeddings. Use the [HuggingFaceInstructEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/instruct_embeddings) class to load a pre-trained model for generating embeddings. Once created, store these embeddings in a [Chroma vector database](https://docs.trychroma.com/getting-started), both in memory and on disk for future retrieval. The [instructor-xl](https://huggingface.co/hkunlp/instructor-xl) model can be used for generating embeddings. Feel free to explore different models.


In [15]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})


Downloading .gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading 2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [16]:
#!pip -q install chromadb

In [17]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

# Here is the new embeddings being used
embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [18]:
# Persiste the db to disk
vectordb.persist()
vectordb = None

In [19]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 4: Test Retriever</b> 
</div>

Test your vector store by retrieving relevant information in response to four different questions. While you can use our predefined questions, consider adding your own as well. Evaluate whether your retriever is capable of retrieving relevant information for answering these questions.

*(Hint: Refer to [this article](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore) on querying a vector store with LangChain for guidance.)*


In [20]:
questions = ["What is scaled dot-product attention?", 
             "How does the word analogy task works in a vector space representation?",
             "How does the hierarchical softmax work?", 
             "How does negative sampling work?"]

In [21]:
retriever = vectordb.as_retriever()

In [22]:
docs = retriever.get_relevant_documents("What is scaled dot-product attention?")

In [23]:
print("Number of retrieved document chunks:", len(docs))

Number of retrieved document chunks: 4


In [24]:
print("Example chunk:", docs[0])

Example chunk: page_content='Scaled Dot-Product Attention\n Multi-Head Attention\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several\nattention layers running in parallel.' metadata={'page': 3, 'source': '/kaggle/input/worlam-papers/Attention-Is-All-You-Need.pdf'}


In [25]:
for question in questions:
    docs = retriever.get_relevant_documents(question)
    displayAnswer(question = question, sources = [doc.page_content for doc in docs])

**Your interpretation:**

The retriever is generally capable of identifying information pertinent to the question.

## Task 2: Load the Generative Language Model

This task involves loading a generative model that is specifically fine-tuned for chat applications, which can later be integrated with the vector store and retriever from Task 1. Fine-tuning for chat means adapting a pre-trained language model to perform well in conversational contexts, accomplished by additional training on dialogue-focused datasets. This enhances the model's ability to understand and generate natural, context-relevant responses. However, fine-tuning Large Language Models requires substantial computational resources and extensive training data. As an alternative, context-relevant information retrieved from the vector store can be included in the prompt, thus eliminating the need for extensive model fine-tuning.


<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Load Model</b> 
</div>

1. Import the necessary modules from the transformers library and initialize the tokenizer and model using [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) and [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSeq2SeqLM) with the pre-trained [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0).
2. Create a pipeline for text-to-text generation and experiment with different text generation parameters. Wrap this pipeline in a [HuggingFacePipeline](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines).
3. Test your setup by prompting the model with questions from the previous task and/or your own questions.

*(Hint: Don't hesitate to experiment with different models or APIs, such as the [ChatGPT API](https://python.langchain.com/docs/integrations/llms/openai) supported by Langchain.)*

In [26]:
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("lmsys/fastchat-t5-3b-v1.0")
model = AutoModelForSeq2SeqLM.from_pretrained("lmsys/fastchat-t5-3b-v1.0")

Downloading tokenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.71G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

### Text Generation Parameters

Each parameter influences the text generation in a specific way. Below are the parameters along with a brief explanation:

**`max_length`**:
* Sets the maximum number of tokens in the generated text (default is 50).
* Generation stops if the maximum length is reached before the model produces an EOS token.
* A higher `max_length` allows for longer generated texts, but may increase the time and computational resources required.

**`min_length`**:
* Sets the minimum number of tokens in the generated text (default is 10).
* Generation continues until this minimum length is reached, even if an EOS token is produced.

**`num_beams`**:
* In beam search, sets the number of "beams" or hypotheses to keep at each step (default is 4).
* A higher number of beams increases the chances of finding a good output but also increases the computational cost.

**`num_return_sequences`**:
* Specifies the number of independently computed sequences to return (default is 3).
* When using sampling, multiple different sequences are generated independently from each other.

**`early_stopping`**:
* Stops generation if the model produces the EOS (End Of Sentence) token, even if the predefined maximum length is not reached (default is True).
* Useful when an EOS token signifies the logical end of a text (often represented as `</s>`).

**`do_sample`**:
* Tokens are selected probabilistically based on their likelihood scores (default is True).
* Introduces randomness into the generation process for diverse outputs.
* The level of randomness is controlled by the 'temperature' parameter.

**`temperature`**:
* Adjusts the probability distribution used for sampling the next token (default is 0.7).
* Higher values make the generation more random, while lower values make it more deterministic.

**`top_k`**:
* Limits the number of tokens considered for sampling at each step to the top K most likely tokens (default is 50).
* Can make the generation process faster and more focused.

**`top_p`**:
* Also known as nucleus sampling, sets a cumulative probability threshold (default is 0.95).
* Tokens are sampled only from the smallest set whose cumulative probability exceeds this threshold.

**`repetition_penalty`**:
* Discourages the model from repeating the same token by modifying the token's score (default is 1.5).
* Values greater than 1.0 penalize repetitions, and values less than 1.0 encourage repetitions.

Here are additional resources on the topic:
    
* https://huggingface.co/docs/transformers/main_classes/text_generation
* https://huggingface.co/docs/api-inference/detailed_parameters
* https://huggingface.co/blog/how-to-generate

In [27]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import torch

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=256,
    min_length=25,
    do_sample = True,
    temperature = 0.6,
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [28]:
question = "When was Albert Einstein born?"
displayAnswer(question = question, answer = local_llm(question))


The function `__call__` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.



In [29]:
for question in questions:
    displayAnswer(question = question, answer = local_llm(question))

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 2: Chain Model and Database</b> 
</div>

Fine-tuned models are limited in accessing external, specific information. To mitigate this, a second model generates embeddings for a vector store, facilitating the retrieval of relevant information. This approach enables the chatbot to engage in natural conversation while incorporating detailed, specific data as needed, enhancing its knowledge base and response accuracy.

Your task is to build a question-answering (QA) chain that integrates the language model with the retrieval system. Use the [RetrievalQA.from_chain_type](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html) method to merge your local language model (local_llm) with the retriever. Configure the chain with the correct type to ensure that the chatbot effectively uses the retriever for sourcing relevant information.


In [30]:
# Create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=local_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

## Task 3: Test the Model

In this task we will focus on qualitatively evaluating the model's responses, enhanced by Retrieval Augmented Generation. We'll use the introspection questions from the lecture for this assessment.


<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Test With Lecture Related Questions </b> 
</div>

Evaluate the model's accuracy by prompting it with questions related to the lecture topics or from the list of introspective questions. Does your chatbot produce relevant and correct answers?

In [31]:
print("Prompt template:", qa_chain.combine_documents_chain.llm_chain.prompt.template)

Prompt template: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


In [32]:
for question in questions:
    response = qa_chain(question)
    displayAnswer(question, response['result'], 
                  [source.metadata['source'] for source in response["source_documents"]])
    


The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.



**Your interpretation:**

Compared to the model's answers without RAG, the responses appear more specific.

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 2: Experiment With Different Prompt Templates</b> 
</div>

As covered in previous assignments, generative language models are highly prompt-sensitive. Prompt design significantly influences response quality, but also allows tailoring of model instructions to specific use cases. Below are some key resources on prompt engineering:

* https://python.langchain.com/docs/modules/model_io/prompts/
* https://huggingface.co/docs/transformers/main/tasks/prompting
* https://github.com/thunlp/PromptPapers
* https://www.promptingguide.ai

For your final task, experiment with various prompt templates by following these steps:

1. Display the default LangChain RAG-QA prompt template. Ensure you understand the purposes of the placeholder values.
2. Test different prompt templates. Consider these task-specific instructions, but feel free to be creative:
    * Direct the model to generate either highly detailed or concise answers.
    * Prompt the model to respond in a different language.

In [33]:
new_prompt_template = """
Use the following pieces of context to answer the question at the end. Please provide a summary of only two sentences maximum.

{context}

Question: {question}
Helpful Answer in two sentences:
"""

#qa_chain.combine_documents_chain.llm_chain.prompt.template = new_prompt_template
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


In [34]:
#### ALTERNATIVE PROMPT TEMPLATES

new_prompt_template = """

Question: {question}
Context: {context}

Is the answer to the question contained in the context? Please answer only with yes or no.
Answer (yes/no:)
"""

#*******************************************************************

new_prompt_template = """
Use the following pieces of context to answer the question at the end.

{context}

Question: {question}
Helpful but detailed and long as possible Answer:
"""

#*******************************************************************

new_prompt_template = """
Use the following pieces of context to answer the question at the end.

{context}

Question: {question}
Helpful and as short as possible Answer:
"""

#*******************************************************************

new_prompt_template = """
Use the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer (translated into German):
"""

In [35]:
from langchain.prompts import PromptTemplate

In [36]:
prompt = PromptTemplate(
    template=new_prompt_template, 
    input_variables=[
        'context', 
        'question',
    ]
)

In [37]:
# Provide sample inputs
sample_inputs = {
    'context': 'The Eiffel Tower is located in Paris.',
    'question': 'Where is the Eiffel Tower?'
}

# Generate the prompt
generated_prompt = prompt.format(**sample_inputs)

# Evaluate the output
print(generated_prompt)


Use the following pieces of context to answer the question at the end.

The Eiffel Tower is located in Paris.

Question: Where is the Eiffel Tower?
Answer (translated into German):



In [38]:
# Initialise RetrievalQA Chain
qa_chain = RetrievalQA.from_chain_type(llm=local_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True,
                                  chain_type_kwargs={"prompt": prompt})

print(qa_chain.combine_documents_chain.llm_chain.prompt.template)


Use the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer (translated into German):



In [39]:
query = "How does the hierarchical softmax work?"
response = qa_chain({"query": query})

displayAnswer(question, response['result'], 
              [source.metadata['source'] for source in response["source_documents"]])
    