## Build an Extractive QA Pipeline
Extractive QA will fetch the exact span of text that answers the user's question. If no answer, it returns an empty string.

#### DocumentStore Data Loader Pipeline
We will use an indexing pipeline to fetch our data, process it and load them into the document store

In [2]:
from os import name
from datasets import load_dataset
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.readers import ExtractiveReader
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter

dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
documents = [Document(content=doc['content'], meta=doc['meta']) for doc in dataset]

model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()

indexing_pipeline.add_component(instance=SentenceTransformersDocumentEmbedder(model=model), name="embedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
indexing_pipeline.connect("embedder.documents", "writer.documents")

indexing_pipeline.run({"documents": documents})


Batches: 100%|██████████| 5/5 [00:01<00:00,  3.84it/s]


{'writer': {'documents_written': 151}}

#### Build an Extractive QA Pipeline

An extractive QA pipeline will consist of the embedder, retriever and reader

SentenceTransformersTextEmbedder turns a query into a vector, using the same embedding model we used above

Vector search allows the retriever to efficiently return relevant documents from the document store. 

ExtractiveReader returns the answer to that query, as well as their location in the source document, and a confidence score

In [7]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.readers import ExtractiveReader
from haystack.components.embedders import SentenceTransformersTextEmbedder

retriever = InMemoryEmbeddingRetriever(document_store=document_store)
# By default, the Reader sets a no_answer=True parameter. This param returns an ExtractedAnswer with no text, and the probability that none of the returned answers are correct.
reader = ExtractiveReader(no_answer=True)
reader.warm_up()

# embedder + retriever + reader (prompt+llm instead of reader would be for generative model)
extractive_qa_pipeline = Pipeline()
extractive_qa_pipeline.add_component(instance=SentenceTransformersTextEmbedder(model=model), name="embedder")
extractive_qa_pipeline.add_component(instance=retriever, name="retriever")
extractive_qa_pipeline.add_component(instance=reader, name="reader")

# connect 
extractive_qa_pipeline.connect("embedder.embedding", "retriever.query_embedding")
extractive_qa_pipeline.connect("retriever.documents", "reader.documents")


<haystack.core.pipeline.pipeline.Pipeline object at 0x72e3a88e4c50>
🚅 Components
  - embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - reader: ExtractiveReader
🛤️ Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> reader.documents (List[Document])

##### Try extracting some answers

In [6]:
query = "Who was Pliny the Elder?"
extractive_qa_pipeline.run(
    data={
        "embedder": {
            "text": query
        },
        "retriever": {
            "top_k": 3
        },
        "reader": {
            "query": query,
            "top_k": 2
        }
    }
)

Batches: 100%|██████████| 1/1 [00:00<00:00, 127.07it/s]


{'reader': {'answers': [ExtractedAnswer(query='Who was Pliny the Elder?', score=0.8306006193161011, data='Roman writer', document=Document(id=bb2c5f3d2e2e2bf28d599c7b686ab47ba10fbc13c07279e612d8632af81e5d71, content: 'The Roman writer Pliny the Elder, writing in the first century AD, argued that the Great Pyramid had...', meta: {'url': 'https://en.wikipedia.org/wiki/Great_Pyramid_of_Giza', '_split_id': 16}, score: 21.667728268420095), context=None, document_offset=ExtractedAnswer.Span(start=4, end=16), context_offset=None, meta={}),
   ExtractedAnswer(query='Who was Pliny the Elder?', score=0.7280881404876709, data='a Roman author', document=Document(id=8910f21f7c0e97792473bcc60a8dcc7f6a90586dbb46b7bf96d28dbfcdc313f4, content: '[21]
   Pliny the Elder (AD 23/24 – 79) was a Roman author, a naturalist and natural philosopher, a nav...', meta: {'url': 'https://en.wikipedia.org/wiki/Colossus_of_Rhodes', '_split_id': 8}, score: 26.857542091152716), context=None, document_offset=ExtractedAns