# Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation

- **Level**: Beginner
- **Time to complete**: 10 minutes
- **Components Used**: [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore), [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder), [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder), [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever), [`PromptBuilder`](https://docs.haystack.deepset.ai/docs/promptbuilder), [`OpenAIGenerator`](https://docs.haystack.deepset.ai/docs/openaigenerator)
- **Prerequisites**: You must have an [OpenAI API Key](https://platform.openai.com/api-keys).
- **Goal**: After completing this tutorial, you'll have learned the new prompt syntax and how to use PromptBuilder and OpenAIGenerator to build a generative question-answering pipeline with retrieval-augmentation.

> This tutorial uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).

## Overview

This tutorial shows you how to create a generative question-answering pipeline using the retrieval-augmentation ([RAG](https://www.deepset.ai/blog/llms-retrieval-augmentation)) approach with Haystack 2.0. The process involves four main components: [SentenceTransformersTextEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder) for creating an embedding for the user query, [InMemoryBM25Retriever](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever) for fetching relevant documents, [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) for creating a template prompt, and [OpenAIGenerator](https://docs.haystack.deepset.ai/docs/openaigenerator) for generating responses.

For this tutorial, you'll use the Wikipedia pages of [Seven Wonders of the Ancient World](https://en.wikipedia.org/wiki/Wonders_of_the_World) as Documents, but you can replace them with any text you want.


## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)

## Installing Haystack

Install Haystack 2.0 and other required packages with `pip`:

In [9]:
%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install "sentence-transformers>=3.0.0"
pip install assemblyai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting haystack-ai
  Downloading haystack_ai-2.4.0-py3-none-any.whl.metadata (13 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.1.1-py3-none-any.whl.metadata (6.9 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Downloading haystack_ai-2.4.0-py3-none-any.whl (350 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m350.7/350.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading haystack_experimental-0.1.1-py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.8/41.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Installing collected packages: lazy-imports, haystack-experimental, haystack-ai
Successfully installed haystack-ai-2.4.0 haystack-experimental-0.1.1 lazy-imports-0.3.1
Collecting datasets>=2.6.1
  Downloading datasets-2.21.0-py

## Fetching and Indexing Documents

You'll start creating your question answering system by downloading the data and indexing the data with its embeddings to a DocumentStore. 

In this tutorial, you will take a simple approach to writing documents and their embeddings into the DocumentStore. For a full indexing pipeline with preprocessing, cleaning and splitting, check out our tutorial on [Preprocessing Different File Types](https://haystack.deepset.ai/tutorials/30_file_type_preprocessing_index_pipeline).


### Initializing the DocumentStore

Initialize a DocumentStore to index your documents. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, you'll be using the `InMemoryDocumentStore`.

In [3]:
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

> `InMemoryDocumentStore` is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see [DocumentStore Integrations](https://haystack.deepset.ai/integrations?type=Document+Store).

The DocumentStore is now ready. Now it's time to fill it with some Documents.

### Fetch the Data

You'll use the Wikipedia pages of [Seven Wonders of the Ancient World](https://en.wikipedia.org/wiki/Wonders_of_the_World) as Documents. We preprocessed the data and uploaded to a Hugging Face Space: [Seven Wonders](https://huggingface.co/datasets/bilgeyucel/seven-wonders). Thus, you don't need to perform any additional cleaning or splitting.

Fetch the data and convert it into Haystack Documents:

In [4]:
import os
from haystack import Document

# Path to the directory containing the .txt files
folder_path = 'scraped_articles'

# List to store the Document objects
docs = []

# Loop through each file in the directory
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        
        # Read the contents of the file
        with open(file_path, 'r', encoding='utf-8') as file:
            lines = file.readlines()
            # Assuming the file structure:
            # 1st line: Title
            # 2nd line: Separator (-----)
            # 3rd line: Summary
            title = lines[0].strip()  # Title is on the first line
            summary = lines[2].strip()  # Summary is on the third line
            
            # Create a Document object
            doc = Document(content=summary, meta={"title": title})
            docs.append(doc)

# Now `docs` contains all the Document objects that you can use with haystack
print(f"Loaded {len(docs)} documents.")

Loaded 22785 documents.


In [5]:
""" from datasets import load_dataset
from haystack import Document

dataset = load_dataset("scrapped_articles", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset] """

' from datasets import load_dataset\nfrom haystack import Document\n\ndataset = load_dataset("scrapped_articles", split="train")\ndocs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset] '

In [6]:
#Webscraper of the SF Chronicle (Last years articles)
""" import os
import requests
from bs4 import BeautifulSoup

# Number of pages you want to scrape
pages = 1950

# Directory to save the output files
output_dir = "scraped_articles"

# Create the directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Loop through the number of pages
for i in range(pages):
    datasetURL = f"https://sfchronicle.newsbank.com/search?text=&content_added=2023-09-02&date_from=&date_to=&pub%5B0%5D=SFCWS&sort=new&page={i+1}"
    
    if i == 0:  # For the first page, use a different URL pattern
        datasetURL = "https://sfchronicle.newsbank.com/search?text=&content_added=2023-09-02&date_from=&date_to=&pub%5B%5D=SFCWS&sort=new"
    
    # Request the page content
    response = requests.get(datasetURL)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find all article containers
    articles = soup.find_all("div", class_="views-row")

    # Loop through each article container to extract title and summary
    for article in articles:
        # Extract the title
        title_tag = article.find("div", class_="views-field views-field-text-1")
        title = title_tag.find("a", class_="text-links").get_text(strip=True) if title_tag else "No title found"
        
        # Extract the summary
        summary_tag = article.find("div", class_="views-field views-field-text-6")
        summary = summary_tag.find("span", class_="field-content").get_text(strip=True) if summary_tag else "No summary found"
        
        # Sanitize title to use as a filename
        safe_title = "".join([c for c in title if c.isalpha() or c.isdigit() or c in [' ', '.', '_']]).rstrip()
        file_path = os.path.join(output_dir, f"{safe_title}.txt")
        
        # Write the content to a text file
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(title + "\n")
            f.write("-----\n")
            f.write(summary + "\n")

        print(f"Saved: {file_path}") """

' import os\nimport requests\nfrom bs4 import BeautifulSoup\n\n# Number of pages you want to scrape\npages = 1950\n\n# Directory to save the output files\noutput_dir = "scraped_articles"\n\n# Create the directory if it doesn\'t exist\nif not os.path.exists(output_dir):\n    os.makedirs(output_dir)\n\n# Loop through the number of pages\nfor i in range(pages):\n    datasetURL = f"https://sfchronicle.newsbank.com/search?text=&content_added=2023-09-02&date_from=&date_to=&pub%5B0%5D=SFCWS&sort=new&page={i+1}"\n    \n    if i == 0:  # For the first page, use a different URL pattern\n        datasetURL = "https://sfchronicle.newsbank.com/search?text=&content_added=2023-09-02&date_from=&date_to=&pub%5B%5D=SFCWS&sort=new"\n    \n    # Request the page content\n    response = requests.get(datasetURL)\n    soup = BeautifulSoup(response.text, "html.parser")\n    \n    # Find all article containers\n    articles = soup.find_all("div", class_="views-row")\n\n    # Loop through each article container

### Initalize a Document Embedder

To store your data in the DocumentStore with embeddings, initialize a [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) with the model name and call `warm_up()` to download the embedding model.

> If you'd like, you can use a different [Embedder](https://docs.haystack.deepset.ai/docs/embedders) for your documents.

In [7]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()



### Write Documents to the DocumentStore

Run the `doc_embedder` with the Documents. The embedder will create embeddings for each document and save these embeddings in Document object's `embedding` field. Then, you can write the Documents to the DocumentStore with `write_documents()` method.

In [8]:
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

Batches:   0%|          | 0/713 [00:00<?, ?it/s]

22785

## Building the RAG Pipeline

The next step is to build a [Pipeline](https://docs.haystack.deepset.ai/docs/pipelines) to generate answers for the user query following the RAG approach. To create the pipeline, you first need to initialize each component, add them to your pipeline, and connect them.

### Initialize a Text Embedder

Initialize a text embedder to create an embedding for the user query. The created embedding will later be used by the Retriever to retrieve relevant documents from the DocumentStore.

> ⚠️ Notice that you used `sentence-transformers/all-MiniLM-L6-v2` model to create embeddings for your documents before. This is why you need to use the same model to embed the user queries.

In [8]:
from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

### Initialize the Retriever

Initialize a [InMemoryEmbeddingRetriever](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever) and make it use the InMemoryDocumentStore you initialized earlier in this tutorial. This Retriever will get the relevant documents to the query.

In [9]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

retriever = InMemoryEmbeddingRetriever(document_store)

### Define a Template Prompt

Create a custom prompt for a generative question answering task using the RAG approach. The prompt should take in two parameters: `documents`, which are retrieved from a document store, and a `question` from the user. Use the Jinja2 looping syntax to combine the content of the retrieved documents in the prompt.

Next, initialize a [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) instance with your prompt template. The PromptBuilder, when given the necessary values, will automatically fill in the variable values and generate a complete prompt. This approach allows for a more tailored and effective question-answering experience.

In [19]:
from haystack.components.builders import PromptBuilder

template = """
Consider the context given in the question, please generate a wild conspiracy theory in two paragraphs that could only occur based on the following news articles of events that occurred in San Francisco in the last year 


Consider:
1. At the end of the answer, give me the title of the documents you used and how the story relates to the conspiracy theory.
2. The documents are divided into Title, a separator (-----) and Summary.

The context for the question are the following news articles of events that occurred in San Francisco in the last year, consider them as highly important to properly answer the question:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)

### Initialize a Generator


Generators are the components that interact with large language models (LLMs). Now, set `OPENAI_API_KEY` environment variable and initialize a [OpenAIGenerator](https://docs.haystack.deepset.ai/docs/OpenAIGenerator) that can communicate with OpenAI GPT models. As you initialize, provide a model name:

In [11]:
import os
from getpass import getpass
from haystack.components.generators import OpenAIGenerator

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
generator = OpenAIGenerator(model="gpt-3.5-turbo")

> You can replace `OpenAIGenerator` in your pipeline with another `Generator`. Check out the full list of generators [here](https://docs.haystack.deepset.ai/docs/generators).

### Build the Pipeline

To build a pipeline, add all components to your pipeline and connect them. Create connections from `text_embedder`'s "embedding" output to "query_embedding" input of `retriever`, from `retriever` to `prompt_builder` and from `prompt_builder` to `llm`. Explicitly connect the output of `retriever` with "documents" input of the `prompt_builder` to make the connection obvious as `prompt_builder` has two inputs ("documents" and "question").

For more information on pipelines and creating connections, refer to [Creating Pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines) documentation.

In [12]:
from haystack import Pipeline

basic_rag_pipeline = Pipeline()
# Add components to your pipeline
basic_rag_pipeline.add_component("text_embedder", text_embedder)
basic_rag_pipeline.add_component("retriever", retriever)
basic_rag_pipeline.add_component("prompt_builder", prompt_builder)
basic_rag_pipeline.add_component("llm", generator)

# Now, connect the components to each other
basic_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
basic_rag_pipeline.connect("retriever", "prompt_builder.documents")
basic_rag_pipeline.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x3f107a300>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

That's it! Your RAG pipeline is ready to generate answers to questions!

## Asking a Question

When asking a question, use the `run()` method of the pipeline. Make sure to provide the question to both the `text_embedder` and the `prompt_builder`. This ensures that the `{{question}}` variable in the template prompt gets replaced with your specific question.

In [13]:
question =  "Could you generate a conspiracy theory based on the current state of the San Francisco 49ers football team?"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In a wild conspiracy theory, it appears that the San Francisco 49ers are actually working in cahoots with other NFL teams to manipulate game outcomes in order to benefit financially from bets placed on their games. The team's inconsistent performance and unexpected losses are not a result of natural gameplay, but rather carefully orchestrated schemes to deceive the public and cover up their illicit activities. The 49ers' recent defeat to the Baltimore Ravens, which was seen as a turning point for the team, was actually a strategic move to throw off suspicion and maintain their facade of competition.

This theory is supported by the articles "Somewhere between the extremes," "The San Francisco 49ers appear primed for a Super Bowl run," and "John Lynch's response to an ESPN report," which highlight the unpredictability and questionable behavior surrounding the team. The mention of 'juicy scenarios' in the context of contract negotiations and draft picks suggests that there may be hidden 

Here are some other example questions to test:

In [14]:
examples = [
    "Generate a conspiracy theory based on the most trendy restaurants in San Francisco",
    'Generate a consipracy theory based on San Francisco\'s Chinatown',
    'Generate a consipracy theory based on the state of San Francisco\'s police',
    'Generate a consipracy theory about the current state of music in San Facisco',
]

## What's next

🎉 Congratulations! You've learned how to create a generative QA system for your documents with the RAG approach.

If you liked this tutorial, you may also enjoy:
- [Filtering Documents with Metadata](https://haystack.deepset.ai/tutorials/31_metadata_filtering)
- [Preprocessing Different File Types](https://haystack.deepset.ai/tutorials/30_file_type_preprocessing_index_pipeline)
- [Creating a Hybrid Retrieval Pipeline](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)

To stay up to date on the latest Haystack developments, you can [subscribe to our newsletter](https://landing.deepset.ai/haystack-community-updates) and [join Haystack discord community](https://discord.gg/haystack).

Thanks for reading!

Integrate with Streamlit

In [20]:
question =  "Generate a consipracy theory about the current state of music in San Francisco"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In a shocking twist, it has been uncovered that the decline of the music scene in San Francisco is not due to natural causes, but rather a carefully orchestrated plan by a secret society known as the "Doom Loop Syndicate". This group of elites, made up of influential figures in the tech industry and political sphere, has been systematically erasing the city's musical identity in favor of a more homogenized, mainstream sound. Through their control of major music festivals like Outside Lands and Noise Pop, they have been manipulating the music culture to fit their own agenda, stifling creativity and diversity in the process.

The key to their plan lies in the manipulation of iconic figures in San Francisco music history, such as Sly Stone, whose downfall was orchestrated to send a message to other artists who dared to challenge the status quo. By promoting a narrative of artistic decline and cultural extinction, the Doom Loop Syndicate aims to control the narrative and steer the city's c