# Building a RAG Pipeline over IKEA Product Instruction Manuals

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/product_manual_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows how to use LlamaParse and OpenAI's multimodal models to query over IKEA instruction manual PDFs, which mainly contain images and diagrams to show how one can assemble the product.

LlamaParse and multimodal LLMs can interpret these diagrams and translate them into textual instructions. With textual assistance, confusing visual instructions within the IKEA product manuals can be made easier to understand and interpret. Additionally, textual instructions can be helpful for those who are visually impaired.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-20-2025   | 0.6.61  | Maintained |

## Install and Setup

Install LlamaIndex, download the data, and configure the API keys.

In [None]:
%pip install "llama-index>=0.13.0<0.14.0" llama-cloud-services

In [None]:
!wget https://github.com/user-attachments/files/16461058/data.zip -O data.zip
!unzip -o data.zip
!rm data.zip

Set up your OpenAI and LlamaCloud keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

## Code Implementation

Load data from the parser.

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
)

In [None]:
DATA_DIR = "data"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

Load data into docs, and save images from PDFs into `data_images` directory.

In [None]:
results = await parser.aparse(files)

Getting job results:   0%|          | 0/5 [00:00<?, ?it/s]

Started parsing the file under job_id 0d3de1c0-e4c6-4cca-9e85-b738b301119a
Started parsing the file under job_id 48ef73aa-fe6b-4e67-a4c0-ebe5d1fc532c
Started parsing the file under job_id 71cdf344-d4c1-40ca-812c-3ada19aeca5a
Started parsing the file under job_id 747a4847-7971-4e3b-87c5-6ce93a05c260


Getting job results:  20%|██        | 1/5 [00:14<00:58, 14.62s/it]

Started parsing the file under job_id a2a9fd6a-fa25-4410-8ccc-9da7d38e1590


Getting job results: 100%|██████████| 5/5 [00:38<00:00,  7.78s/it]


In [None]:
all_text_nodes = []

for result in results:
    text_nodes = result.get_markdown_nodes(split_by_page=True)
    image_nodes = await result.aget_image_nodes(
        include_object_images=False,
        include_screenshot_images=True,
        image_download_dir="./data_images",
    )

    for text_node, image_node in zip(text_nodes, image_nodes):
        text_node.metadata["image_path"] = image_node.image_path
        all_text_nodes.append(text_node)

Index the documents.

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-5-mini")

Settings.llm = llm
Settings.embed_model = embed_model

index = VectorStoreIndex(nodes=all_text_nodes)

Create a custom query engine that uses OpenAI for multi-modal response generation.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import MetadataMode
from llama_index.core.base.response.schema import Response
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock


qa_prompt_block_text = """\
Below we give parsed text from slides in two different formats, as well as the image.

---------------------
{context_str}
---------------------
"""

image_prefix_block = TextBlock(text="And here are the corresponding images per page\n")

image_suffix = """\
Given the context information and not prior knowledge, answer the query. Explain whether you got the answer
from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """


class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes and respond using an LLM + retrieved text/images.

    """

    retriever: BaseRetriever
    llm: OpenAI

    def __init__(self, **kwargs) -> None:
        """Initialize."""
        super().__init__(**kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_blocks = [
            ImageBlock(path=n.metadata["image_path"])
            for n in nodes
            if n.metadata.get("image_path")
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )

        formatted_msg = ChatMessage(
            role="user",
            blocks=[
                TextBlock(text=qa_prompt_block_text.format(context_str=context_str)),
                image_prefix_block,
                *image_blocks,
                TextBlock(text=image_suffix.format(query_str=query_str)),
            ],
        )

        # synthesize an answer from formatted text and images
        llm_response = self.llm.chat([formatted_msg])

        return Response(
            response=str(llm_response.message.content),
            source_nodes=nodes,
        )

Create a query engine instance.

In [None]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=3),
    llm=llm,
)


## Example Queries

In [None]:
from IPython.display import display, Markdown

response = query_engine.query("What parts are included in the Uppspel?")
display(Markdown(str(response)))

Answer (parts included in the UPPSPEL kit)

I read the parts inventory diagram (image of the parts page). The parsed slide text only mentioned caster wheels and clips in the assembly steps, so the full parts list came from the image. The image is clear but some small part numbers are tiny; below I list the parts, quantities and the part numbers that are visible.

- 2x long screws (107603)  
- 6x large screws/dowels (100214)  
- 5x cam screws / binding-post screws (118331)  
- 12x threaded connector dowels / cross dowels (100498)  
- 4x cylindrical spacers (106986)  
- 2x ribbed wooden dowels (101350)  
- 4x small screws (100413)  
- 4x hex/Allen-head screws (100181)  
- 2x wall plugs (111322)  
- 2x short screws (109067)  
- 12x small wood screws (109560)  
- 17x cam lock nuts (102534)  
- 4x oval/cover caps (135049 / FRE001)  
- 2x metal brackets / wall-mount plates (128985)  
- 4x mushroom-shaped plastic pegs / feet (128409 / 128303)  
- 1x small Allen key (100001)  
- 2x larger Allen keys (108490)  
- 2x round shallow plastic bowls (123602 / 123603)  
- 2x round deeper plastic bowls (126873 / FRE002)

Notes / discrepancies:
- The parsed text (markdown) included only partial info (mentions of caster wheels and clips) and did not contain the full inventory. The complete inventory above was taken from the parts-diagram image.  
- Some part numbers on the image are very small and I transcribed them as best as they appear; a few numbers may be slightly off due to image resolution.

In [None]:
response = query_engine.query("What does the Tuffing look like?")
display(Markdown(str(response)))

Answer: According to the parsed page text, the Tuffing is depicted as a bunk bed — a simple metal‑frame bunk with safety rails on the top bunk and a ladder in the middle (IKEA logo at the bottom right).

Where I got this:
- Primary source for the description: the parsed markdown/alt‑text for page 1, which explicitly describes the bunk bed.

Discrepancies / notes:
- The actual image shown in the attached files (the large drawing with the big FREDDE title) is a different IKEA product (a desk with raised shelves), not the bunk bed described in the parsed text. Page 18’s parsed text shows a person fitting a fabric/mesh over a rectangular frame, and page 37 is a blank/credits page. Because the visual files and the parsed descriptions conflict, I relied on the parsed markdown description for the answer but there is uncertainty — the raw image content does not match that description.

In [None]:
response = query_engine.query("What is step 4 of assembling the Nordli?")
display(Markdown(str(response)))

Step 4: Use 4x screws (part numbers 118331 and 112996) to attach the two panels as shown. Insert the screws into the indicated holes and tighten with a screwdriver.

Source and notes:
- This answer comes from the parsed text for page 6 (the raw parsed instructions).
- The accompanying image for page 6, however, shows a close-up of inserting/rotating a cylindrical cam/dowel (labelled 106986), which doesn't visually match the parsed text's described screws/part numbers. Because you asked me to use only the provided context, I reported the parsed-text instruction as step 4 and noted the image/text discrepancy above.

In [None]:
response = query_engine.query(
    "What should I do if I'm confused with reading the manual?"
)
display(Markdown(str(response)))

Answer: Call IKEA for help (use the phone number on the manual or contact your local IKEA store).

Source & reasoning: I read the parsed page text and inspected the image. Both show a confused person with a question mark, then a second panel of a person on the phone holding the instructions with an IKEA store in the background — indicating you should call IKEA. The three parsed variants (smagora, tuffing, uppspel) and the raw image all agree on this instruction, so there are no meaningful discrepancies.