# Image processing in RAG system

Traditionally, RAG systems focus on textual inputs, but the inclusion of image processing can extend their functionality to multimodal scenarios. This notebook examines three methods for integrating image processing into a RAG system:

- Optical Character Recognition for text extraction
- Vision models for image description
- Multimodal embeddings

# Setup 0: Text Embedding

for embeddings the output text of each method we use nomic embed text v1.5 using lm-studio.

we use default config of lm-studio settings that provided in **Local Server**.

In [3]:
from openai import OpenAI
import numpy as np 

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

def get_embedding(text, model="nomic-ai/nomic-embed-text-v1.5-GGUF"):
    text = text.replace("\n", " ")
    embedding = client.embeddings.create(input = [text], model=model).data[0].embedding
    return np.array([embedding])

test_embedding = get_embedding("Boo ...")
test_embedding.shape

(1, 768)

Now lets save image paths with their query

In [4]:
paths = ["data/documentation.png", "data/drag-race.jpeg", "data/orange-juice.jpeg"]
queries = ["give me an overview about software design", "orange juice with pulp produced in USA", "drag race between to classic car"]

embedded_queries = []
# for query in queries:
    # embedded_queries.append(get_embedding(query))

# OCR-Based Text Extraction

This method uses the OCR technique, as it first extracts text from the image and then converts them into vectors using embedding models.

In [15]:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
def extract_texts(image_path):
    img_doc = DocumentFile.from_images(image_path)
    return model(img_doc)

  state_dict = torch.load(archive_path, map_location="cpu")


In [16]:
ocr_texts = []

for path in paths: 
    text = extract_texts(path)
    ocr_texts.append(text.render())

In [18]:
ocr_texts_embeddings = []

for text in ocr_texts:
    ocr_texts_embeddings.append(get_embedding(text))

In [30]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

ocr_similarity = pd.DataFrame(index=range(len(ocr_texts_embeddings)), columns=range(len(embedded_queries)), dtype=float)

for num_col in range(len(embedded_queries)):
    for num_row in range(num_col, len(embedded_queries)):
        a = embedded_queries[num_col]
        b = ocr_texts_embeddings[num_row]
        similarity = cosine_similarity(a,b)[0][0]

        ocr_similarity.loc[num_col, num_row] = similarity
        ocr_similarity.loc[num_row, num_col] = similarity

In [40]:
ocr_df = ocr_similarity.rename({i: paths[i].replace('data/', '') for i in ocr_similarity.index})
ocr_df

Unnamed: 0,0,1,2
documentation.png,0.709916,0.316771,0.390691
drag-race.jpeg,0.316771,0.351983,0.719673
orange-juice.jpeg,0.390691,0.719673,0.43793


# Vision-Based Models for Image Captioning

In this method, vision models such as convolutional neural networks or versatile models such as GPT-4O are used to describe an image. These descriptions are then converted into vectors using embedding models.

for now we use [xtuner/llava-llama-3-8b-v1_1-gguf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf) and lm-studio to connect to it.

In [48]:
from openai import OpenAI
import base64
import requests

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")


def describe_image(path): 
    image = open(path.replace("'", ""), "rb").read()
    base64_image = base64.b64encode(image).decode("utf-8")

    
    completion = client.chat.completions.create(
      model="model-identifier",
      messages=[
        {
          "role": "system",
          "content": "You are an intelligent assistant. You are helping the user to describe an image. Provide only the answer; avoid unnecessary talk or explanations.",
        },
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "describe this image and if there is any content give me a summary of it."},
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
              },
            },
          ],
        }
      ],
      max_tokens=1000,
      stream=True
    )

    full_response = ""
    
    for chunk in completion:
        if chunk.choices[0].delta.content:
            full_response += chunk.choices[0].delta.content
            print(chunk.choices[0].delta.content, end="", flush=True)

    return full_response

In [None]:
vision_descriptions = []

for path in paths: 
    print(path)
    text = describe_image(path)
    vision_descriptions.append(text)

data/documentation.png
The image you've shared is a screenshot of a software design document. The document is neatly organized into three sections, each with a distinct purpose.

Starting from the top left corner, we see a section titled "Software Design Document". This section has a blue header and white text, providing clear contrast for easy reading. It also contains instructions on how to describe an image in this context.

Moving to the right side of the document, we find another section titled "System Overview". This section follows a similar layout with a blue header and white text. It provides a brief overview of the system design, setting the stage for the rest of the document.

Finally, at the bottom center of the document, we see the third section titled "Design Considerations". This section also has a blue header and white text. It outlines several key considerations in designing software.

The document is written in English and includes some technical terms related to soft

In [51]:
vision_descriptions_embeddings = []

for description in vision_descriptions:
    vision_descriptions_embeddings.append(get_embedding(description))

In [56]:
vision_similarity = pd.DataFrame(index=range(len(vision_descriptions_embeddings)), columns=range(len(embedded_queries)), dtype=float)

for num_col in range(len(embedded_queries)):
    for num_row in range(num_col, len(embedded_queries)):
        a = embedded_queries[num_col]
        b = vision_descriptions_embeddings[num_row]
        similarity = cosine_similarity(a,b)[0][0]

        vision_similarity.loc[num_col, num_row] = similarity
        vision_similarity.loc[num_row, num_col] = similarity

vision_df = vision_similarity.rename({i: paths[i].replace('data/', '') for i in vision_similarity.index})
vision_df

Unnamed: 0,0,1,2
documentation.png,0.741236,0.372204,0.418964
drag-race.jpeg,0.372204,0.392488,0.66824
orange-juice.jpeg,0.418964,0.66824,0.419419


# Direct Image Embedding
In this method, images are directly converted into vectors by embedding models and can be used separately in the RAG system.

we use [CLIP](https://github.com/openai/CLIP) for this test.

In [1]:
import torch
import clip
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

In [5]:
text = clip.tokenize(queries).to(device)

with torch.no_grad():
    text_features = model.encode_text(text)
    
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
text_vector = text_features.cpu().numpy()


def process(path):
    image = preprocess(Image.open(path)).unsqueeze(0).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)

    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

    image_vector = image_features.cpu().numpy()

    return (image_features @ text_features.T).cpu().numpy()

In [9]:
similarity

array([[0.28302374, 0.14890398, 0.16479638]], dtype=float32)

In [10]:
import pandas as pd

embedding_similarity = pd.DataFrame(index=range(len(text_vector)), dtype=float)

for path in paths:
    similarity = process(path)[0]
    embedding_similarity[path.replace('data/', '')] = similarity

embedding_similarity

Unnamed: 0,documentation.png,drag-race.jpeg,orange-juice.jpeg
0,0.283024,0.16788,0.190975
1,0.148904,0.104399,0.327666
2,0.164796,0.318107,0.121669


In [11]:
text_vector.shape

(3, 512)