# Utilizing existing FAQs for Question Answering
- **Level**: Beginner
- **Time to complete**: 15 minutes
- **Nodes Used**: `InMemoryDocumentStore`, `EmbeddingRetriever`
- **Goal**: Learn how to use the `EmbeddingRetriever` in a `FAQPipeline` to answer incoming questions by matching them to the most similar questions in your existing FAQ.

# Overview
While *extractive Question Answering* works on pure texts and is therefore more generalizable, there's also a common alternative that utilizes existing FAQ data.

**Pros**:

- Very fast at inference time
- Utilize existing FAQ data
- Quite good control over answers

**Cons**:

- Generalizability: We can only answer questions that are similar to existing ones in FAQ

In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option.


## Preparing the Colab Environment

- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)


In [1]:
%%bash

nvidia-smi


Sat Sep 23 04:49:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

In [2]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 6.6 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
Collecting farm-haystack[colab,inference]
  Obtaining dependency information for farm-haystack[colab,inference] from https://files.pythonhosted.org/packages/31/db/e81141e15cecf1abc1238d9aee55f66310274878a034dd96603703620e9c/farm_haystack-1.20.1-py3-none-any.whl.metadata
  Downloading farm_haystack-1.20.1-py3-none-any.whl.metadata (25 kB)
Collecting boilerpy3 (from farm-haystack[colab,inference])
  Downloading boilerpy3-1.0.6-py3-none-any.whl (22 kB)
Collecting canals==0.7.0 (from farm-haystack[colab,inference])
  Obtaining dependency information for canals==0.7.0 from https://files.pythonhosted.org/packages/08/2f/62e9e455c4cf183ebf6672c



### Enabling Telemetry
Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details.

In [3]:
from haystack.telemetry import tutorial_running

tutorial_running(4)

Set the logging level to INFO:

In [4]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

### Create a simple DocumentStore
The InMemoryDocumentStore is good for quick development and prototyping. For more scalable options, check-out the [docs](https://docs.haystack.deepset.ai/docs/document_store).

In [5]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


### Create a Retriever using embeddings
Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).
We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings.

In [6]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True,
    scale_score=False,
)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model sentence-transformers/all-MiniLM-L6-v2


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


### Prepare & Index FAQ data
We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in our documentstore.
Here: We download some question-answer pairs related to COVID-19

In [7]:
import pandas as pd

from haystack.utils import fetch_archive_from_http


# Download
doc_dir = "data/tutorial4"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Get dataframe with columns "question", "answer" and some custom metadata
df = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")
# Minimal cleaning
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

# Create embeddings for our questions from the FAQs
# In contrast to most other search use cases, we don't create the embeddings here from the content of our documents,
# but rather from the additional text field "question" as we want to match "incoming question" <-> "stored question".
questions = list(df["question"].values)
df["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = df.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip to 'data/tutorial4'


                                            question  \
0                       What is a novel coronavirus?   
1  Why is the disease being called coronavirus di...   
2  Why might someone blame or avoid individuals a...   
3  How can people help stop stigma related to COV...   
4                   What is the source of the virus?   

                                              answer  \
0  A novel coronavirus is a new coronavirus that ...   
1  On February 11, 2020 the World Health Organiza...   
2  People in the U.S. may be worried or anxious a...   
3  People can fight stigma and help, not hurt, ot...   
4  Coronaviruses are a large family of viruses. S...   

                                         answer_html  \
0  <p>A novel coronavirus is a new coronavirus th...   
1  <p>On February 11, 2020 the World Health Organ...   
2  <p>People in the U.S. may be worried or anxiou...   
3  <p>People can fight stigma and help, not hurt,...   
4  <p>Coronaviruses are a large family of viru

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'a8d4ddffcab67801c7a1a13d85fbe84a' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'e4fae6647538bfddae6c8d8771fd613' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'f6dd87c6e090d56685b37554befe602' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '719668a041cff08136aad7f4e2876a3a' already exists in index 'document'


### Ask questions
Initialize a Pipeline (this time without a reader) and ask questions

In [8]:
from haystack.pipelines import FAQPipeline

pipe = FAQPipeline(retriever=retriever)

In [9]:
from haystack.utils import print_answers

# Run any question and change top_k to see more or less answers
prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 1}})

print_answers(prediction, details="medium")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Query: How is the virus spreading?'
'Answers:'
[   {   'answer': 'This virus was first detected in Wuhan City, Hubei '
                  'Province, China. The first infections were linked to a live '
                  'animal market, but the virus is now spreading from '
                  'person-to-person. It’s important to note that '
                  'person-to-person spread can happen on a continuum. Some '
                  'viruses are highly contagious (like measles), while other '
                  'viruses are less so.\n'
                  '\n'
                  'The virus that causes COVID-19 seems to be spreading easily '
                  'and sustainably in the community (“community spread”) in '
                  'some affected geographic areas. Community spread means '
                  'people have been infected with the virus in an area, '
                  'including some who are not sure how or where they became '
                  'infected.\n'
                  '