# Classifying Documents & Queries by Language

Build a Haystack Pipeline to classify documents based on the human language they were written in.

Incorporate language classification and query routing into a RAG pipeline, so we can query documents based on the language a question was written in.

Haystack DocumentLanguageClassifier classifies language of Documents and adds the detected language to their metadata. If no language is matched, it is classified as `unmatched`. 

### Write Documents into InMemoryDocumentStore

The following indexing pipeine writes French and English documents into their own InMemoryDocumentStores based on language.

In [42]:
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter

documents = [
    Document(
        content="Super appartement. Juste au dessus de plusieurs bars qui ferment très tard. A savoir à l'avance. (Bouchons d'oreilles fournis !)"
    ),
    Document(
        content="El apartamento estaba genial y muy céntrico, todo a mano. Al lado de la librería Lello y De la Torre de los clérigos. Está situado en una zona de marcha, así que si vais en fin de semana , habrá ruido, aunque a nosotros no nos molestaba para dormir"
    ),
    Document(
        content="The keypad with a code is convenient and the location is convenient. Basically everything else, very noisy, wi-fi didn't work, check-in person didn't explain anything about facilities, shower head was broken, there's no cleaning and everything else one may need is charged."
    ),
    Document(
        content="It is very central and appartement has a nice appearance (even though a lot IKEA stuff), *W A R N I N G** the appartement presents itself as a elegant and as a place to relax, very wrong place to relax - you cannot sleep in this appartement, even the beds are vibrating from the bass of the clubs in the same building - you get ear plugs from the hotel -> now I understand why -> I missed a trip as it was so loud and I could not hear the alarm next day due to the ear plugs.- there is a green light indicating 'emergency exit' just above the bed, which shines very bright at night - during the arrival process, you felt the urge of the agent to leave as soon as possible. - try to go to 'RVA clerigos appartements' -> same price, super quiet, beautiful, city center and very nice staff (not an agency)- you are basically sleeping next to the fridge, which makes a lot of noise, when the compressor is running -> had to switch it off - but then had no cool food and drinks. - the bed was somehow broken down - the wooden part behind the bed was almost falling appart and some hooks were broken before- when the neighbour room is cooking you hear the fan very loud. I initially thought that I somehow activated the kitchen fan"
    ),
    Document(content="Un peu salé surtout le sol. Manque de service et de souplesse"),
    Document(
        content="Nous avons passé un séjour formidable. Merci aux personnes , le bonjours à Ricardo notre taxi man, très sympathique. Je pense refaire un séjour parmi vous, après le confinement, tout était parfait, surtout leur gentillesse, aucune chaude négative. Je n'ai rien à redire de négative, Ils étaient a notre écoute, un gentil message tout les matins, pour nous demander si nous avions besoins de renseignement et savoir si tout allait bien pendant notre séjour."
    ),
    Document(
        content="Céntrico. Muy cómodo para moverse y ver Oporto. Edificio con terraza propia en la última planta. Todo reformado y nuevo. Te traen un estupendo desayuno todas las mañanas al apartamento. Solo que se puede escuchar algo de ruido de la calle a primeras horas de la noche. Es un zona de ocio nocturno. Pero respetan los horarios."
    ),
]

In [43]:
 # Each language gets its own DocumentStore
 
 en_document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
 fr_document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
 es_document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

Write out the rules the MetadatRouter should follow to route a document to a specific node. 

In [44]:
language_classifier = DocumentLanguageClassifier(languages=["en", "fr", "es"])
router_rules = {
    "en": {
        "language": {
            "$eq": "en"
        }
    },
    "fr": {
        "language": {
            "$eq": "fr"
        }
    },
     "es": {
        "language": {
            "$eq": "es"
        }
    },
}
router = MetadataRouter(rules=router_rules)

In [45]:
from haystack.document_stores.types import DuplicatePolicy

en_writer = DocumentWriter(document_store=en_document_store, policy=DuplicatePolicy.OVERWRITE)
fr_writer = DocumentWriter(document_store=fr_document_store, policy=DuplicatePolicy.OVERWRITE)
es_writer = DocumentWriter(document_store=es_document_store, policy=DuplicatePolicy.OVERWRITE)

In [46]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentSplitter

en_document_embedder = SentenceTransformersDocumentEmbedder("thenlper/gte-large")
es_document_embedder = SentenceTransformersDocumentEmbedder("thenlper/gte-large")
fr_document_embedder = SentenceTransformersDocumentEmbedder("thenlper/gte-large")

en_document_splitter = DocumentSplitter(split_by="word", split_length=2)
fr_document_splitter = DocumentSplitter(split_by="word", split_length=2)
es_document_splitter = DocumentSplitter(split_by="word", split_length=2)

In [47]:
indexing_pipeline = Pipeline()

indexing_pipeline.add_component(instance=language_classifier, name="language_classifier")
indexing_pipeline.add_component(instance=router, name="router")
indexing_pipeline.add_component(instance=en_writer, name="en_writer")
indexing_pipeline.add_component(instance=fr_writer, name="fr_writer")
indexing_pipeline.add_component(instance=es_writer, name="es_writer")

indexing_pipeline.add_component(instance=en_document_embedder, name="en_embedder")
indexing_pipeline.add_component(instance=es_document_embedder, name="es_embedder")
indexing_pipeline.add_component(instance=fr_document_embedder, name="fr_embedder")

indexing_pipeline.add_component(instance=en_document_splitter, name="en_splitter")
indexing_pipeline.add_component(instance=fr_document_splitter, name="es_splitter")
indexing_pipeline.add_component(instance=es_document_splitter, name="fr_splitter")

indexing_pipeline.connect("language_classifier", "router")

indexing_pipeline.connect("router.en", "en_splitter")
indexing_pipeline.connect("router.es", "es_splitter")
indexing_pipeline.connect("router.fr", "fr_splitter")

indexing_pipeline.connect("en_splitter", "en_embedder")
indexing_pipeline.connect("es_splitter", "es_embedder")
indexing_pipeline.connect("fr_splitter", "fr_embedder")


indexing_pipeline.connect("en_embedder", "en_writer")
indexing_pipeline.connect("fr_embedder", "fr_writer")
indexing_pipeline.connect("es_embedder", "es_writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x74f1761abd50>
🚅 Components
  - language_classifier: DocumentLanguageClassifier
  - router: MetadataRouter
  - en_writer: DocumentWriter
  - fr_writer: DocumentWriter
  - es_writer: DocumentWriter
  - en_embedder: SentenceTransformersDocumentEmbedder
  - es_embedder: SentenceTransformersDocumentEmbedder
  - fr_embedder: SentenceTransformersDocumentEmbedder
  - en_splitter: DocumentSplitter
  - es_splitter: DocumentSplitter
  - fr_splitter: DocumentSplitter
🛤️ Connections
  - language_classifier.documents -> router.documents (List[Document])
  - router.en -> en_splitter.documents (List[Document])
  - router.es -> es_splitter.documents (List[Document])
  - router.fr -> fr_splitter.documents (List[Document])
  - en_embedder.documents -> en_writer.documents (List[Document])
  - es_embedder.documents -> es_writer.documents (List[Document])
  - fr_embedder.documents -> fr_writer.documents (List[Document])
  - en_splitter.documents -> en_em

In [48]:
indexing_pipeline.draw("indexing_pipeline.png")

In [49]:
indexing_pipeline.run(data={
    "language_classifier": {
        "documents": documents
    }
})

Batches: 100%|██████████| 5/5 [00:00<00:00, 39.03it/s]
Batches: 100%|██████████| 2/2 [00:00<00:00, 41.44it/s]
Batches: 100%|██████████| 2/2 [00:00<00:00, 46.12it/s]


{'router': {'unmatched': []},
 'en_writer': {'documents_written': 141},
 'fr_writer': {'documents_written': 53},
 'es_writer': {'documents_written': 54}}

### Check the Contents of the Document Stores

In [50]:
print("English documents ", en_document_store.filter_documents())
print("French documents ", fr_document_store.filter_documents())
print("Spanish documents ", es_document_store.filter_documents())

English documents  [Document(id=ee08b8715e02e71ce101e8f63d92ce4432fee7d35e8f152c31d7f6fcae2c6120, content: 'The keypad ', meta: {'language': 'en', 'source_id': '8f64ab234c6a5d5652d02bed144d069ec6e988903b071d16fffbf400abfc1047', 'page_number': 1}, embedding: vector of size 1024), Document(id=1d4c4d919d4ac6672c875a325c7ea9c86d9f0d642cf802da8e79e7623fadb366, content: 'with a ', meta: {'language': 'en', 'source_id': '8f64ab234c6a5d5652d02bed144d069ec6e988903b071d16fffbf400abfc1047', 'page_number': 1}, embedding: vector of size 1024), Document(id=e65ec6f8345eaa194e64a839d4eb2879ad3f70967e2b3493c6102a1e86208d99, content: 'code is ', meta: {'language': 'en', 'source_id': '8f64ab234c6a5d5652d02bed144d069ec6e988903b071d16fffbf400abfc1047', 'page_number': 1}, embedding: vector of size 1024), Document(id=8be4dbae8276e55b708f9ceeb5a72aedb0c5582155e1dc9076303af9a3747ee2, content: 'convenient and ', meta: {'language': 'en', 'source_id': '8f64ab234c6a5d5652d02bed144d069ec6e988903b071d16fffbf400abfc10

## Create a Multi-Lingual RAG pipeline

TextLanguageRouter will detect the language of the query.

In [51]:
import os
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY is required.")

The RAG will allow users to query about an accomodation in the language they chose.

In [66]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.joiners import DocumentJoiner
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.routers import TextLanguageRouter
from haystack.components.embedders import SentenceTransformersTextEmbedder

prompt_template = """
You will be provided with reviews for an accomodation.
Answer the question concisely based solely on the given reviews.

Reviews:
{% for doc in documents %}
    {{ documents }}
{% endfor %}

Question: {{query}}
Answer:

"""


Build the Pipeline

BM25Retriever does keyword matching, which is not as accurate as other search methods. IN order to make the LLM responses more precise, we will use the EmbeddingRetriever which performs vector search over the documents

In [67]:
rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=TextLanguageRouter(["en", "fr", "es"]), name="router")
rag_pipeline.add_component(instance=SentenceTransformersTextEmbedder("thenlper/gte-large"), name="en_embedder")
rag_pipeline.add_component(instance=SentenceTransformersTextEmbedder("thenlper/gte-large"), name="es_embedder")
rag_pipeline.add_component(instance=SentenceTransformersTextEmbedder("thenlper/gte-large"), name="fr_embedder")

rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=en_document_store), name="en_retriever")
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=fr_document_store), name="fr_retriever")
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=es_document_store), name="es_retriever")
rag_pipeline.add_component(instance=DocumentJoiner(), name="joiner")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")

rag_pipeline.connect("router.en", "en_embedder")
rag_pipeline.connect("router.es", "es_embedder")
rag_pipeline.connect("router.fr", "fr_embedder")

rag_pipeline.connect("en_embedder.embedding", "en_retriever.query_embedding")
rag_pipeline.connect("es_embedder.embedding", "es_retriever.query_embedding")
rag_pipeline.connect("fr_embedder.embedding", "fr_retriever.query_embedding")

rag_pipeline.connect("en_retriever", "joiner")
rag_pipeline.connect("es_retriever", "joiner")
rag_pipeline.connect("fr_retriever", "joiner")
rag_pipeline.connect("joiner.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x74f174d6a5d0>
🚅 Components
  - router: TextLanguageRouter
  - en_embedder: SentenceTransformersTextEmbedder
  - es_embedder: SentenceTransformersTextEmbedder
  - fr_embedder: SentenceTransformersTextEmbedder
  - en_retriever: InMemoryEmbeddingRetriever
  - fr_retriever: InMemoryEmbeddingRetriever
  - es_retriever: InMemoryEmbeddingRetriever
  - joiner: DocumentJoiner
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - router.en -> en_embedder.text (str)
  - router.es -> es_embedder.text (str)
  - router.fr -> fr_embedder.text (str)
  - en_embedder.embedding -> en_retriever.query_embedding (List[float])
  - es_embedder.embedding -> es_retriever.query_embedding (List[float])
  - fr_embedder.embedding -> fr_retriever.query_embedding (List[float])
  - en_retriever.documents -> joiner.documents (List[Document])
  - fr_retriever.documents -> joiner.documents (List[Document])
  - es_retriever.documents -> joiner.d

In [68]:
rag_pipeline.draw("rag_pipeline.png")

In [69]:
en_question = "Is this apartment conveniently located?"

result = rag_pipeline.run({
    "router": {
        "text": en_question # we use this because our first contact router requires a text
    },
    "prompt_builder": {
        "query": en_question # our prompt template requires this
    }
})

Batches: 100%|██████████| 1/1 [00:00<00:00, 34.99it/s]


In [63]:
print(result["llm"]["replies"][0])

The accomodation is convenient.


In [74]:
es_question = "¿El desayuno es genial?" # Is the breakfast great?

result = rag_pipeline.run({"router": {"text": es_question}, "prompt_builder": {"query": es_question}})


Batches: 100%|██████████| 1/1 [00:00<00:00, 87.91it/s]


In [76]:
print(result["llm"]["replies"][0]) # Yes


Sí.
