## Preprocessing Different File Types

Note: HuggingFace API Key is required here: `HF_API_TOKEN`

Building an indexing pipeline that will preprocess files based on their file type, using the `FileTypeRouter`

The indexing pipeline will preprocess different types of files (markdown, txt and pdf), each file will have its own FileConverter. 

After this, the rest of the pipeline will split the documents into chunks, trim whitespaces, create embeddings and write them to a document store.

### Components Used
1. FileTypeRouter: route files based on their corresponding MIME type to different components
2. MarkdownToDocument
3. PyPDFToDocument: this component helps convert PDF files into Haystack Documents
4. TextFileToDocument: this component helps convert text files into Haystack Documents
5. DocumentJoiner: this component join documents coming from different branches of pipeline
6. DocumentCleaner: this component help make Document more readble by removing extra whitespaces (optional)
7. DocumentSplitter: this component help split document into chunks
8. SentenceTransformerDocumentEmbedder: thsi component create embeddings for documents
9. DocumentWriter: this component help write documents into the DocumentStore

### Downloading All Files
Download sample files from Google Drive

In [41]:
import gdown

url = "https://drive.google.com/drive/folders/1n9yqq5Gl_HWfND5bTlrCwAOycMDt5EMj"
output_dir = "recipe_files"

gdown.download_folder(url, quiet=True, output=output_dir)

['recipe_files/vegan_flan_recipe.md',
 'recipe_files/vegan_keto_eggplant_recipe.pdf',
 'recipe_files/vegan_sunflower_hemp_cheese_recipe.txt']

### Create a Pipeline to Index Documents
We will use a different file converter class for each file type in our data sources.
.pdf, .txt and .md.
FileTypeRouter connects each file type to the proper converter

In [42]:
from haystack.components.writers import DocumentWriter
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore


document_store = InMemoryDocumentStore()
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/markdown"])
text_file_converter = TextFileToDocument()
markdown_converter = MarkdownToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()

In [43]:
# remove whitespaces
document_cleaner = DocumentCleaner()
# split_overlap is the window-stride
# split_by can be word, passage, sentence, page
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)

In [44]:
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store)

In [45]:
# Add all components to the indexing pipeline
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

In [46]:
# connect them
# the file type text/plain should be routed to the text file converter, output is input is sources
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("file_type_router.text/markdown", "markdown_converter.sources")

# connect the converters to the document joiner, send their output to a document joiner
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("markdown_converter", "document_joiner")

# connect the document joiner with the document cleaner to remove white spaces
# all output of the joiner will be sent to the document cleaner for cleaning
preprocessing_pipeline.connect("document_joiner", "document_cleaner")

# the output of the cleaner should be sent to the splitter
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
# output of the splitter should be sent tot he embedder
preprocessing_pipeline.connect("document_splitter", "document_embedder")
# output of the embedder should be sent to the writer
preprocessing_pipeline.connect("document_embedder", "document_writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x70d22803b050>
🚅 Components
  - file_type_router: FileTypeRouter
  - text_file_converter: TextFileToDocument
  - markdown_converter: MarkdownToDocument
  - pypdf_converter: PyPDFToDocument
  - document_joiner: DocumentJoiner
  - document_cleaner: DocumentCleaner
  - document_splitter: DocumentSplitter
  - document_embedder: SentenceTransformersDocumentEmbedder
  - document_writer: DocumentWriter
🛤️ Connections
  - file_type_router.text/plain -> text_file_converter.sources (List[Path])
  - file_type_router.application/pdf -> pypdf_converter.sources (List[Path])
  - file_type_router.text/markdown -> markdown_converter.sources (List[Path])
  - text_file_converter.documents -> document_joiner.documents (List[Document])
  - markdown_converter.documents -> document_joiner.documents (List[Document])
  - pypdf_converter.documents -> document_joiner.documents (List[Document])
  - document_joiner.documents -> document_cleaner.documents (List[D

### Try it out!
The final output should be a list of documents embeddings

In [47]:
from pathlib import Path

result = preprocessing_pipeline.run(
    {
        "file_type_router": {
            "sources": list(Path(output_dir).glob("**/*"))
        }
    }
)



Converting markdown files to Documents: 100%|██████████| 1/1 [00:00<00:00,  7.76it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 48.10it/s]


In [48]:
f"Number of documents in store {document_store.count_documents()}"

'Number of documents in store 7'

## Build a pipeline to query the document

In [49]:
# Load the Hugging Face API token

import os
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("HF_API_TOKEN"):
    raise ValueError("HuggingFace API token is required")

In [50]:
# The pipeline will take the prompt, searches the document store for relevant documents and passes those documents along to the LLM to formulate answer

In [53]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator


template = """
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""

pipe = Pipeline()
pipe.add_component("embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", HuggingFaceAPIGenerator(api_type="serverless_inference_api", api_params={"model": "HuggingFaceH4/zephyr-7b-beta"}))


# connect
# output of the embedder, into the input of retriever
# NOTE: since this is the sentence embedder, that's why we are using 'embedding'
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x70d238049e10>
🚅 Components
  - embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: HuggingFaceAPIGenerator
🛤️ Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

In [54]:
# Ask Questions
question = "What ingredients would I need to make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese?"
pipe.run({
    "embedder": { "text": question },
    "prompt_builder": { "question": question },
    "llm": { "generation_kwargs": { "max_new_tokens": 350 } }
})

Batches: 100%|██████████| 1/1 [00:00<00:00, 28.94it/s]


{'llm': {'replies': ["\n\nVegan Keto Eggplant Lasagna:\n\nIngredients:\n- 2 large eggplants\n- A lot of salt (you should have this in your house already)\n- 1/2 cup store-bought vegan mozzarella (for topping)\n\nPesto:\n- 4 oz basil (generally one large clamshell or 2 small ones)\n- 1/4 cup almonds\n- 1/4 cup nutritional yeast\n- 1/4 cup olive oil\n- 1 recipe vegan pesto (you can find this in the recipe)\n- 1 recipe spinach tofu ricotta (you can find this in the recipe)\n- 1 tsp garlic powder\n- Juice of half a lemon\n- Salt to taste\n\nSpinach Tofu Ricotta:\n- 10 oz firm or extra firm tofu\n- Juice of 1 lemon\n- Garlic powder to taste\n- Salt to taste\n\nInstructions:\n1. Slice the eggplants into 1/4 inch thick slices. Some slices will need to be scrapped because it's difficult to get them all uniformly thin. Use them in soup or something, IDK, man.\n2. Take the eggplant slices and rub both sides with salt. Don't be shy about how much, you're gonna rinse it off anyway.\n3. Put them in