# Embedding Metadata for Improved Retrieval
There are times metadata has some really good information that can be used as part of the retrieval process. This notebook will you show how to.

Embedding meaningful metadata alongside the contents of a document to improve retrieval

In [4]:
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

embedder = SentenceTransformersDocumentEmbedder(meta_fields_to_embed=["url"])

In [11]:
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice


def create_indexing_pipeline(document_store, metadata_fields_to_embed=None):
    document_cleaner = DocumentCleaner()
    document_splitter = DocumentSplitter(split_by="sentence", split_length=2)
    document_embedder = SentenceTransformersDocumentEmbedder(
        model="thenlper/gte-large", meta_fields_to_embed=metadata_fields_to_embed
    )
    document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)

    indexing_pipeline = Pipeline()
    indexing_pipeline.add_component("cleaner", document_cleaner)
    indexing_pipeline.add_component("splitter", document_splitter)
    indexing_pipeline.add_component("embedder", document_embedder)
    indexing_pipeline.add_component("writer", document_writer)

    indexing_pipeline.connect("cleaner", "splitter")
    indexing_pipeline.connect("splitter", "embedder")
    indexing_pipeline.connect("embedder", "writer")

    return indexing_pipeline


#### Two Pipelines 
We can now create multiple pipelines
1. indexing_pipeline that indexes only the contents of the documents
2. indexing_with_metadata_pipeline which indexes meta fields alongside contents of the documents

In [12]:
import wikipedia
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

some_bands = """The Beatles,The Cure""".split(",")

raw_docs = []

for title in some_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url": page.url})
    raw_docs.append(doc)
    
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
document_store_with_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function="cosine")

indexing_pipeline = create_indexing_pipeline(document_store=document_store)
indexing_with_metadata_pipeline = create_indexing_pipeline(
        document_store=document_store_with_embedded_metadata, 
        metadata_fields_to_embed=["title"]
)

indexing_pipeline.run(
    {
        "cleaner": {
            "documents": raw_docs
        }
    }
)
indexing_with_metadata_pipeline.run({
    "cleaner": {
        "documents": raw_docs
    }
})


Batches: 100%|██████████| 17/17 [00:02<00:00,  6.43it/s]
Batches: 100%|██████████| 17/17 [00:02<00:00,  6.35it/s]


{'writer': {'documents_written': 538}}

#### Comparing Retrieval With and Without Embedded Metadata
We would retrieve from document_store and later from document_store_with_embedded_metadata.

Comparing the two techniques shows that the retriever with embedded metadata is able to provide information about the answer.

In [14]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever


retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model="thenlper/gte-large"))
retrieval_pipeline.add_component(
    "retriever", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3)
)
retrieval_pipeline.add_component(
    "retriever_with_embeddings", 
    InMemoryEmbeddingRetriever(document_store=document_store_with_embedded_metadata, scale_score=False, top_k=3)
)


# here our embedder output was passed into two separate retrievers
retrieval_pipeline.connect("text_embedder", "retriever")
retrieval_pipeline.connect("text_embedder", "retriever_with_embeddings")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7351e14ecd90>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - retriever_with_embeddings: InMemoryEmbeddingRetriever
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - text_embedder.embedding -> retriever_with_embeddings.query_embedding (List[float])

In [23]:
result = retrieval_pipeline.run({
    "text_embedder": {
        "text": "Have the Beatles ever been to Bangor?"
    }
})

print("Retriever Results:\n")
for doc in result["retriever"]["documents"]:
    print(doc.meta['title'])
    print(doc.content)



Batches: 100%|██████████| 1/1 [00:00<00:00, 34.03it/s]

Retriever Results:

The Beatles
 The band flew to Florida, where they appeared on The Ed Sullivan Show a second time, again before 70 million viewers, before returning to the UK on 22 February.
The Beatles' first visit to the US took place when the nation was still mourning the assassination of President John F.
The Beatles

During the 1964 US tour, the group were confronted with racial segregation in the country at the time. When informed that the venue for their 11 September concert, the Gator Bowl in Jacksonville, Florida, was segregated, the Beatles said they would refuse to perform unless the audience was integrated.
The Beatles
The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and the recognition of popular music as an art form.





In [24]:

# The information of where they went was in the text but their names was not mentioned, it basically says "the group"
print("\n\nRetriever with Embeddings Results:\n")
for doc in result["retriever_with_embeddings"]["documents"]:
    print(doc.meta['title'])
    print(doc.content)



Retriever with Embeddings Results:

The Beatles

On 24 August, the group were introduced to Maharishi Mahesh Yogi in London. The next day, they travelled to Bangor for his Transcendental Meditation retreat.
The Beatles
" City officials relented and agreed to allow an integrated show. The group also cancelled their reservations at the whites-only Hotel George Washington in Jacksonville.
The Beatles
 The band flew to Florida, where they appeared on The Ed Sullivan Show a second time, again before 70 million viewers, before returning to the UK on 22 February.
The Beatles' first visit to the US took place when the nation was still mourning the assassination of President John F.
