# Question Answering with LangChain and a locally running model

The goal of this notebook is only to illustrate how to run a [LangChain](https://github.com/hwchase17/langchain) question answering chain on top of a [Hugging Face](https://huggingface.co) language model that is running locally.

LangChain currently offers [two wrappers around Hugging Face LLMs](https://langchain.readthedocs.io/en/latest/ecosystem/huggingface.html), one for a local pipeline and one to access a hosted model in Hugging Face Hub. This notebook uses the local pipeline version, i.e. running the LLM locally, implemented by the `HuggingFacePipeline` class. 

References: 
- https://blog.langchain.dev/tutorial-chatgpt-over-your-data/ explains the overall workflow that this notebook follows. The example in the blog post uses OpenAI LLMs and a FAISS vector store; here we will use a locally hosted LM and ChromaDB instead.
- https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline.py

# Initialization

In [1]:
import os
import torch
from langchain.chains import LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import MarkdownTextSplitter
from langchain.vectorstores import Chroma
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

device = -1  # cpu
if torch.cuda.is_available():
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
    device = 0  # first GPU

## List of models

We define a dictionary with various language models available on Hugging Face for text generation. In the dictionary we specify, for each model:
- its Hugging Face path. Example: `bigscience/bloom-1b7`
- the task they support. This is specific to each model. Example: text-generation. **NOTE**: the `HuggingFacePipeline` only supports *text-generation* (decoder-only) or *text2text-generation* (encoder-decoder) models for now.
- an optional dictionary with additional parameters for that model

The purpose of defining this list of models is to facilitate the process of switching models quickly while running the notebook multiple times.

In [2]:
# Model selection

# A dictionary of models and their corresponding tasks
models = {
    "bloom-1b7": {
        "task": "text-generation",
        "model": "bigscience/bloom-1b7",
        "extra_args": {"max_new_tokens": 100},
    },
    "bloomz-1b7": {
        "task": "text-generation",
        "model": "bigscience/bloomz-1b7",
        "extra_args": {"max_new_tokens": 500, "temperature": 0.9},
    },
    "bloom-3b": {
        "task": "text-generation",
        "model": "bigscience/bloom-3b",
        "extra_args": {"max_new_tokens": 100},
    },
    "bloomz-3b": {
        "task": "text-generation",
        "model": "bigscience/bloomz-3b",
        "extra_args": {"max_new_tokens": 500, "temperature": 0.9},
    },
    "gpt2": {
        "task": "text-generation",
        "model": "gpt2",
        "extra_args": {"max_new_tokens": 30},
    },
    "gpt2-large": {
        "task": "text-generation",
        "model": "gpt2-large",
        "extra_args": {"max_new_tokens": 50},
    },
    "mt0-large": {
        "task": "text2text-generation",
        "model": "bigscience/mt0-large",
        "extra_args": {},
    },
    "opt-1b3": {
        "task": "text-generation",
        "model": "facebook/opt-1.3b",
        "extra_args": {"max_new_tokens": 100},
    },
}

# A mapping between the task and the corresponding transformers class
auto_classes = {
    "text-generation": AutoModelForCausalLM,
    "text2text-generation": AutoModelForSeq2SeqLM,
}

### Model selection

When running the notebook, pick just one of the models from the options listed above to use in the rest of the notebook.

A comparison between different models is beyond the scope of this notebook; the goal here is to show the workflow of running a question answering pipeline locally - allowing to pick among various models.

In [3]:
model_id = "bloomz-3b"

## Pipeline creation

Here we manually create a Hugging Face pipeline. LangChain's `HuggingFacePipeline` can create one automatically by calling `from_model_id`, but creating it manually and passing it to the constructor offers a bit more of control.

In [4]:
task = models[model_id]["task"]
model_name = models[model_id]["model"]
auto_class = auto_classes[task]
print(f"Using a {task} pipeline with {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = auto_class.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=torch.float16)
pipe = pipeline(
    task, model=model, tokenizer=tokenizer, device=device, **models[model_id]["extra_args"]
)
llm = HuggingFacePipeline(pipeline=pipe)

Using a text-generation pipeline with bigscience/bloomz-3b


### Testing the pipeline

Let's verify that the local language model / pipeline works with LangChain by creating a simple prompt template and completing a simple prompt:

In [5]:
prompt = PromptTemplate(
    input_variables=["day"],
    template="The day after {day} is",
)
chain = LLMChain(llm=llm, prompt=prompt, verbose=True)

print(chain.run("monday"))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe day after monday is[0m

[1m> Finished chain.[0m
 the day of the week


# Document indexing

Here we create and populate an embeddings database (AKA vectorstore) with embeddings from the data we have: a set of official documentation and two websites with resources about Red Hat OpenShift Service on AWS (ROSA). The data is in a set of Markdown files.

Each language model has a maximum context lenght, and we must trim the documents to ensure that when they get passed as context to the language model, they fit into that limit.

The `MarkdownTextSplitter` handles that document partition while maintaining the structure in the original Markdown files, aiming to keep relevant/related information together.

In [6]:
# Note: using TextLoader here instead of UnstructuredMarkdownLoader. MarkdownTextSplitter will do the job of parsing markdown
loader = DirectoryLoader('../data/external', glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()  # FYI with the current dataset, documents[42] is the FAQ

text_splitter = MarkdownTextSplitter(chunk_overlap=0)  # Consider adding chunk_size=1000
texts = text_splitter.split_documents(documents)
print(f"{len(documents)} documents were loaded in {len(texts)} chunks")

74 documents were loaded in 1484 chunks


As a quick summary of the result of the split, let's take a look at how big is the largest chunk:

In [7]:
# Find the index of the longest text in texts
max_len = max(len(text.page_content) for text in texts)
max_len_idx = [i for i, text in enumerate(texts) if len(text.page_content) == max_len][0]
print(f"The longest text chunk is index {max_len_idx}, with lenght {max_len}")

The longest text chunk is index 525, with lenght 3998


**NOTE**: depending on how big your "chunks" are you might face a situation where, when they are added as part of the prompt to provide context to the language model, they do not fit into the model's maximum input lenght.

It is possible to split into smaller chunks by setting the `chunk_size` parameter, but this means that we _will_ start to miss on relevant related information.

This compromise is one of the problems/challenges that this notebook attempts to highlight.

Now, let's check what is the content of that largest chunk to understand why did we end up with such a long piece:

In [14]:
# show the biggest chunk
print(texts[max_len_idx].page_content)

Delete objects

This section describes the `delete` commands for clusters and resources.

### delete admin

Deletes a cluster administrator from a specified cluster.



**Syntax**



``` terminal
$ rosa delete admin --cluster=<cluster_name> | <cluster_id>
```

| Option    | Definition                                                                              |
|-----------|-----------------------------------------------------------------------------------------|
| --cluster | Required: The name or ID (string) of the cluster to add to the identity provider (IDP). |

Arguments

| Option        | Definition                                                    |
|---------------|---------------------------------------------------------------|
| --help        | Shows help for this command.                                  |
| --debug       | Enables debug mode.                                           |
| --interactive | Enables interactive mode.                                     |
| --p

Now we create the embeddings for the text documents (chunks) that we collected:

In [9]:
embeddings = HuggingFaceEmbeddings()

# https://langchain.readthedocs.io/en/latest/modules/indexes/vectorstore_examples/chroma.html#persist-the-database
db_dir = "../data/interim"
docsearch = None
if os.path.isdir(os.path.join(db_dir, "index")):
    # Load the existing vector store
    docsearch = Chroma(persist_directory=db_dir, embedding_function=embeddings)
else:
    # Create a new vector store
    docsearch = Chroma.from_documents(texts, embeddings, persist_directory=db_dir)
    docsearch.persist()

Using embedded DuckDB with persistence: data will be stored in: ../data/interim


# Question Answering

Now it is time to go through the actual question answering.

Here we will be running a question answering chain from LangChain in *verbose* mode, so that we also get a peek into how LangChain builds a prompt for the model.

References:
- https://langchain.readthedocs.io/en/latest/modules/indexes/combine_docs.html for types of chains to combine documents
- https://langchain.readthedocs.io/en/latest/modules/indexes/chain_examples/question_answering.html for the QA example

In [10]:
# Types of chains: stuff, map_reduce, refine, map-rerank
# https://langchain.readthedocs.io/en/latest/modules/indexes/chain_examples/question_answering.html
# https://langchain.readthedocs.io/en/latest/modules/indexes/combine_docs.html
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)

In [11]:
question = "What is the roadmap for ROSA?"

In [12]:
docs = docsearch.similarity_search(question, k=3)
answer = chain({"input_documents": docs, "question": question}, return_only_outputs=True)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Next steps

-   [Create a ROSA cluster](#rosa-creating-cluster) or [Create an AWS PrivateLink cluster on ROSA](#rosa-aws-privatelink-creating-cluster).

### Additional resources

-   [AWS prerequisites](#prerequisites)

-   [Required AWS service quotas and requesting increases](#rosa-required-aws-service-quotas)

-   [Understanding the ROSA deployment workflow](#rosa-understanding-the-deployment-workflow)

Next steps

-   [Installing the ROSA CLI](#rosa-installing-cli)

### Additional resources

-   [AWS prerequisites](#prerequisites)

-   [Required AWS service quotas and requesting increases](#rosa-required-aws-service-quotas)

-   [Understanding the ROSA deployment workflow](#rosa-understan

The interesting part of the output below is to see the prompt that was created by LangChain. Notice how the prompt includes context collected using the embeddings database.

In [13]:
print(f'Query: {question}\nAnswer:{answer["output_text"]}')

Query: What is the roadmap for ROSA?
Answer:  https://red.ht/rosa-roadmap


This is a technically correct correct answer - even if maybe a bit too concise.

As we saw above, the answer obtained was extracted from the context that was provided to the model via the generated prompt.

### A longer context

Now let's try another question. In order to test the limits of this approach, this time we will try to hit the big chunk we found above. This can cause the prompt to become too big for the model's context to handle.

In [16]:
question = "How can I delete my cluster?"
docs = docsearch.similarity_search(question, k=3)
answer = chain({"input_documents": docs, "question": question}, return_only_outputs=True)
print(f'Query: {question}\nAnswer:{answer["output_text"]}')



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

## Deleting a ROSA Cluster

To delete a ROSA cluster follow the steps below.

1. If you need to list your clusters in order to make sure you are deleting the correct one run:

		rosa list clusters

1. Once you know which one to delete run:

		rosa delete cluster --cluster <cluster-name>

	!!! danger
			**THIS IS NON-RECOVERABLE.**

1. It will prompt you to confirm that you want to delete it. Press “y”, then enter. The cluster and all its associated infrastructure will be deleted.

	!!! note
			All AWS STS/IAM roles and policies will remain and must be deleted manually once the cluster deletion is complete by following the steps below.

1. The command will output the next two commands to delet

Here we can see that the generated prompt is indeed quite longer than the previous attempt (over 7000 characters that translate to more than 2500 tokens), to the point that it exceeds the model's sequence lenght (2048 for BLOOM). This can have different consequences depending on the model being used. BLOOM uses [ALiBi positional encodings](https://arxiv.org/abs/2108.12409) which is meant to enable extrapolation to longer sequences.

Still, the obtained answer is very terse, and in this case also partially incomplete/incorrect as it is missing a mandatory option.

# Conclusion

This notebook provides an example of how to run a LangChain chain using an LLM from HuggingFace running locally. The main goal here is to provide an example of how to develop a question answering pipeline with this approach.

Beyond the particular examples shown in the notebook, we also observe a few challenges:

- Different models exhibit different behaviours and limitations (the notebook only shows one specific example, and a detailed comparison is beyond the goal of the notebook)
- In the particular example shown above, we see how the answers could be more user friendly and complete.
- For an approach that uses prompt engineering that includes information indexed with a vector database, building that database accurately deserves its own attention: quantity and quality of the data that is then made available to the model as context; how to keep the prompt size under control while still keeping all the necessary context.

Exploring each of these challenges in more detail are topics for other notebooks.