## Multimodal‑Nutrition Agent — https://haystack.deepset.ai/blog/multimodal-nutrition-agent?utm_source=substack&utm_medium=email
### 0  Current Status
- **Vision still pending - and will not be implemented**  
  All tests so far were **text‑only**: we retrieved the label *captions* and reasoned with small CPU models (TinyLlama, Phi‑2).  
  The image side (`MultiModalPromptBuilder` + vision LLM such as **Phi‑3.5 Vision**) will be plugged in once GPU inference is enabled.

### 1  Motivation
- **Problem**  
  Nutrition questions (e.g. *“Which snack has more protein?”*) require reasoning over both the **image** (nutrition‐label photo) and **text** (user query).  
- **Goal**  
  Build a lightweight *Multimodal Nutrition Agent* that can  
  1. **Store** nutrition‑label images with short captions,  
  2. **Retrieve** the relevant label for a query, and  
  3. **Reason** over that image/text with a vision‑capable LLM, replying in plain English.  
- **Constraints**  
  Runs locally, works on CPU first, later pluggable into your Postgres + pgvector recipe stack.

---

### 2  Inspiration – deepset Haystack Blog
| Stage | What the article shows | Why it matters to our build |
|-------|------------------------|-----------------------------|
| **Data prep** | nutrition‑label images in JSON → `Document(content, meta)` | Same structure; we can swap in any JPG/PNG later. |
| **Indexing** | `SentenceTransformersDocumentEmbedder` → `InMemoryDocumentStore` | Identical flow (we'll point to pgvector later). |
| **Retrieval pipeline** | User query → text embedding → top‑1 label → `MultiModalPromptBuilder` injects Base‑64 image into prompt | Ready‑made component for mixing image + text. |
| **Tool wrapper** | Expose retrieval as `DocWithImageHaystackQueryTool` | Lets an **agent** call the pipeline only when needed. |
| **Generator** | `Phi35VisionHFGenerator` (4 B) | We started with TinyLlama (CPU) and can upgrade to Phi‑3.5‑Vision once GPU is enabled. |
| **Agent prompt** | ReAct template (Thought → Action → Observation → Final Answer) | Matches the tool‑calling style you already use. |
| **Examples** | Single‑hop (“How much fat…?”) and multi‑hop comparison | Confirms the agent can chain tool outputs and reason. |

---

### 3  Implementation References
| Area | Minimal component / reference |
|------|------------------------------|
| **Multimodal prompting** | *Li et al., 2023* “Align before fuse” (vision‑language instruction tuning) |
| **Vision LLMs (open)** | Phi‑3.5 Vision • LLaVA‑1.5 • BLIP‑2 |
| **Agent framework** | [fastRAG 3.x](https://github.com/IntelLabs/fastRAG) – ReAct agent & tools |
| **Vector search** | `pgvector` + Postgres 16; fallback: `InMemoryDocumentStore` |
| **Sentence embeddings** | `sentence-transformers/all-MiniLM-L6-v2` (384‑d) |
| **Dataset sources** | Blog JSON sample; USDA Branded Foods (text, can add images) |
| **Prompt‑engineering** | ReAct (*Yao et al., 2023*) – reasoning + acting loop |


In [1]:
import json, pathlib, os, colorama
colorama.init(strip=True)
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder,
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.components.generators import HuggingFaceLocalGenerator

from fastrag.prompt_builders.multi_modal_prompt_builder import MultiModalPromptBuilder
from fastrag.agents.tools.tools import DocWithImageHaystackQueryTool
from fastrag.agents.base import Agent, ToolsManager
from fastrag.agents.create_agent import ConversationMemory

from transformers import AutoTokenizer, TextIteratorStreamer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
demo_path = pathlib.Path("nutrition_demo.json")
entries = json.loads(demo_path.read_text(encoding="utf-8"))
docs = [Document(content=e["content"], meta=e) for e in entries]

In [3]:
store = InMemoryDocumentStore()
index = Pipeline()
index.add_component(
    "embed",
    SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)
index.add_component("writer", DocumentWriter(document_store=store))
index.connect("embed.documents", "writer.documents")
index.run({"embed": {"documents": docs}})

Batches: 100%|██████████| 1/1 [00:00<00:00, 39.13it/s]


{'writer': {'documents_written': 6}}

In [4]:
template = "Label: {{ documents[0].content }}"
retrieval = Pipeline()
retrieval.add_component("q_emb", SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"))
retrieval.add_component("ret", InMemoryEmbeddingRetriever(
    document_store=store, top_k=1))
retrieval.add_component("prompt", MultiModalPromptBuilder(template=template))
retrieval.connect("q_emb.embedding", "ret.query_embedding")
retrieval.connect("ret", "prompt.documents")

PromptBuilder has 1 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.


<haystack.core.pipeline.pipeline.Pipeline object at 0x0000023281BD4B90>
🚅 Components
  - q_emb: SentenceTransformersTextEmbedder
  - ret: InMemoryEmbeddingRetriever
  - prompt: MultiModalPromptBuilder
🛤️ Connections
  - q_emb.embedding -> ret.query_embedding (List[float])
  - ret.documents -> prompt.documents (List[Document])

In [5]:
nutrition_tool = DocWithImageHaystackQueryTool(
    name="nutrition_tool",
    description="Retrieve the most relevant nutrition label text",
    pipeline_or_yaml_file=retrieval)

In [6]:

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

generator = HuggingFaceLocalGenerator(
    model=model_name,
    task="text-generation",
    generation_kwargs={"max_new_tokens": 160, "temperature": 0.2, "num_beams": 1},
)
generator.warm_up() 

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


In [9]:
tokenizer = generator.pipeline.tokenizer          
dummy_streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
generator.generation_kwargs["streamer"] = dummy_streamer

In [14]:
one_shot = """
### EXAMPLE
User: How much protein is in the protein bar?
Thought: I need to look up the label.
Action: nutrition_tool({{{{"text_query": "protein bar protein grams"}}}})
Observation: Label: Protein bar: 12 g protein, 8 g fat, 23 g carbs, 200 kcal
Thought: I have the information.
Final Answer: The protein bar contains 12 g of protein.
### END EXAMPLE
"""

agent_prompt = f"""You are a helpful nutrition assistant.
You may call tools to look up nutrition labels.

TOOLS:
{{tool_names_with_descriptions}}

{one_shot}

RESPONSE FORMAT
Thought:
Action: nutrition_tool({{{{"text_query": "..."}}}})
Observation:
... (repeat) ...
Final Answer: the answer to the user's question

Begin!
"""


# Build the agent (re‑use the nutrition_tool we already defined)
agent = Agent(
    generator,
    prompt_template={
        "system": [{"role": "system", "content": agent_prompt}],
        "chat":   [{"role": "user",   "content": "{query}"}],
    },
    tools_manager=ToolsManager([nutrition_tool]),
    memory=ConversationMemory(generator=generator),
    final_answer_pattern=r"Final Answer:\s*(.*)",
    streaming=False  
)

# Run a test question
result = agent.run("Which has more protein, the protein bar or the yogurt?")
print(result["transcript"])
print("\nFinal answer\n", result.get("final_answer", "⟨missing⟩"))


Agent Agent started with {'query': 'Which has more protein, the protein bar or the yogurt?', 'params': None}
The protein bar has more protein than the yogurt. A 100g serving of protein bar contains 12g of protein, while a 100g serving of plain Greek yogurt contains 8g of protein.
The protein bar has more protein than the yogurt. A 100g serving of protein bar contains 12g of protein, while a 100g serving of plain Greek yogurt contains 8g of protein.

Final answer
 ⟨missing⟩
