# Multimodal Demo

In [4]:
import os
os.environ["OPENAI_API_KEY"] = "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"

In [5]:
from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex
from llama_index.readers.file.base import (
    DEFAULT_FILE_EXTRACTOR, 
    ImageParser,
)
from gpt_index.response.notebook_utils import (
    display_response, 
    display_image,
)
from gpt_index.indices.query.query_transform.base import (
    ImageOutputQueryTransform,
)

In [6]:
# NOTE: By default, image parser converts image into text and discard the original image.  
#       Here, we explicitly keep both the original image and parsed text in an image document
image_parser = ImageParser(keep_image=True, parse_text=True)
file_extractor = DEFAULT_FILE_EXTRACTOR
file_extractor.update(
{
    ".jpg": image_parser,
    ".png": image_parser,
    ".jpeg": image_parser,
})

# NOTE: we add filename as metadata for all documents
filename_fn = lambda filename: {'file_name': filename}

## Q&A over Receipt Images

We first ingest our receipt images with the *custom* `image parser` and `metadata function` defined above.   
This gives us `image documents` instead of only text documents.

In [7]:
receipt_reader = SimpleDirectoryReader(
    input_dir='data/receipts', 
    file_extractor=file_extractor, 
    file_metadata=filename_fn,
)
receipt_documents = receipt_reader.load_data()

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


We build a simple vector index as usual, but unlike before, our index holds images in addition to text.

In [8]:
receipts_index = GPTSimpleVectorIndex.from_documents(receipt_documents)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 2180 tokens


We can now ask a question that prompts for response with both text and image.  
We use a custom query transform `ImageOutputQueryTransform` to add instruction on how to display the image nicely in the notebook.

In [9]:
receipts_response = receipts_index.query(
    'When was the last time I went to McDonald\'s and how much did I spend. \
    Also show me the receipt from my visit.',
    query_transform=ImageOutputQueryTransform(width=400)
)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1004 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 30 tokens


We now have rich multimodal response with inline text and image!  

The source nodes section gives additional details on retrieved data used for synthesizing the final response.  
In this case, we can verify that the receipt for McDonald's is correctly retrieved. 

In [10]:
display_response(receipts_response)

**`Final Response:`** The last time you went to McDonald's was on 03/10/2018 at 07:39:12 PM and you spent $26.15. Here is the receipt from your visit: <img src="data/receipts/1100-receipt.jpg" width="400" />

---

**`Source Node 1/1`**

**Document ID:** 5b8ee185-d9ac-44f4-9abe-1d08b7277db8<br>**Similarity:** 0.7981665332785771<br>**Text:** file_name: data/receipts/1100-receipt.jpg

<s_menu><s_nm> Story</s_nm><s_num> 16725 Stony Platin ...<br>

## Q & A over LlamaIndex Documentation

We now demo the same for Q&A over LlamaIndex documentations.   
This demo higlights the ability to synthesize multimodal output with a mixture of text and image documents

In [11]:
llama_reader = SimpleDirectoryReader(
    input_dir='data/llama',
    file_extractor=file_extractor, 
    file_metadata=filename_fn,
)
llama_documents = llama_reader.load_data(concatenate=True)

In [12]:
llama_index = GPTSimpleVectorIndex.from_documents(llama_documents)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 965 tokens


In [13]:
llama_response = llama_index.query(
    'Show an image to illustrate how tree index works and explain briefly.', 
    query_transform=ImageOutputQueryTransform(width=400),
    similarity_top_k=2
)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1475 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 13 tokens


By inspecting the 2 source nodes, we see relevant text and image describing the tree index are retrieved for synthesizing the final multimodal response.

In [14]:
display_response(llama_response)

**`Final Response:`** This image illustrates how a tree index works. A tree index is a type of data structure that stores data in a hierarchical structure. It is composed of nodes, which can have multiple children and a single parent. Each node contains data, such as a key, value, or other information. The tree index is used to quickly search for data by traversing the tree from the root node to the desired node. During query time, we traverse from root nodes down to leaf nodes. By default, (`child_branch_factor=1`), a query chooses one child node given a parent node. If `child_branch_factor=2`, a query chooses two child nodes per parent. LlamaIndex also offers different methods of synthesizing a response, such as Create and Refine and Tree Summarize. Create and Refine is an iterative way of generating a response, while Tree Summarize builds a tree index over the set of candidate nodes with a summary prompt seeded with the query.

---

**`Source Node 1/2`**

**Document ID:** c1d13929-ecb4-4d9e-8d76-ead7df960ae1<br>**Similarity:** 0.815091548290817<br>**Text:** file_name: data/llama/tree_index.png

<s_menu><s_nm> Root Node</s_nm><s_unitprice> Parent</s_nm><...<br>

---

**`Source Node 2/2`**

**Document ID:** c75338b8-cee3-46ca-8b73-beefa50e387a<br>**Similarity:** 0.8134417398079422<br>**Text:** How Each Index Works

This guide describes how each index works with diagrams. We also visually h...<br>

We show another example asking about vector store index instead.

In [15]:
llama_response = llama_index.query(
    'Show an image to illustrate how vector store index works and explain briefly.', 
    query_transform=ImageOutputQueryTransform(width=400),
    similarity_top_k=2
)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1404 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 14 tokens


In [16]:
display_response(llama_response)

**`Final Response:`** <img src="data/llama/vector_store_index.png" width="400" />
Vector store index is a way of storing data in a vector format. It is used to store data in a way that is easy to access and manipulate. The data is stored in a vector format, which is a collection of numbers that represent the data. This makes it easier to access and manipulate the data, as well as to store it in a more efficient way. Vector store index stores each Node and a corresponding embedding in a Vector Store. During query time, we extract relevant keywords from the query, and match those with pre-extracted Node keywords to fetch the corresponding Nodes. The extracted Nodes are passed to our Response Synthesis module, which can be configured to use different methods of synthesizing a response, such as Create and Refine or Tree Summarize.

---

**`Source Node 1/2`**

**Document ID:** 957904ba-d811-41b1-b156-1d9e1cb7ceff<br>**Similarity:** 0.816241967681269<br>**Text:** file_name: data/llama/vector_store_index.png

<s_menu><s_nm> Nodel</s_nm><s_unitprice> Node2</s_u...<br>

---

**`Source Node 2/2`**

**Document ID:** c75338b8-cee3-46ca-8b73-beefa50e387a<br>**Similarity:** 0.7878806797032482<br>**Text:** How Each Index Works

This guide describes how each index works with diagrams. We also visually h...<br>