# Multimodal

This notebook demonstrates a basic multimodal pipeline using a LLaVA (Large Language and Vision Assistant) model to generate a textual description of an image.

A key aspect of this setup is the two-part model loading process. The primary file is the Large Language Model, while the CLIP model (mmproj-model-f16.gguf) acts as a vision encoder. The Llava16ChatHandler uses this CLIP model to "see" the image and translate its visual features into a special format that the language model can understand.

To process an image, we first convert it into a base64-encoded string, which allows us to embed the image data directly into the chat message alongside a text prompt. Finally, we send this combined prompt, containing both the image and a question (e.g., "Describe what is in this image."), to the model. The LLaVA model then processes both the visual and textual information to generate its descriptive response.

In [None]:
import base64
from io import BytesIO
from pathlib import Path

from PIL import Image

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava16ChatHandler

In [None]:
MODEL_ROOT = Path("../llama-cpp-python/models")
assert MODEL_ROOT.exists()
DATA_ROOT = Path("../data")
assert DATA_ROOT.exists()

In [None]:
MULTIMODAL_MODEL_ROOT = MODEL_ROOT / "multimodal"
assert MULTIMODAL_MODEL_ROOT.exists()
TEXT_GEN_MODEL_ROOT = MODEL_ROOT / "text_gen"
assert TEXT_GEN_MODEL_ROOT.exists()

In [None]:
# --- 1. Configuration ---
LLAVA_MODEL_PATH = MULTIMODAL_MODEL_ROOT / "llava/llava-v1.6-mistral-7b.Q4_K_M.gguf"
LLAVA_CLIP_MODEL_PATH = MULTIMODAL_MODEL_ROOT / "llava/llava-v1.6-mistral-7b-mmproj-model-f16.gguf"
IMAGE_PATH = DATA_ROOT / "image/cat.jpg"

# --- 2. Model Loading ---
def load_models():
    """Loads the LLaVA and standard LLM models."""
    print("Loading LLaVA model for vision analysis...")
    chat_handler = Llava16ChatHandler(clip_model_path=str(LLAVA_CLIP_MODEL_PATH), verbose=True)

    llava_model = Llama(
        model_path=str(LLAVA_MODEL_PATH),
        chat_handler=chat_handler,
        n_ctx=4096,      # LLaVA 1.6 can handle larger context
        n_gpu_layers=31,
        verbose=True
    )
    
    print("Model loaded successfully.")
    return llava_model

def pil_to_base64(image):
    """Converts a PIL Image to a base64 encoded string."""
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# --- 4. Core AI Functions ---
def describe_frame(llava, image):
    """Uses LLaVA to generate a textual description of an image."""
    b64_image = pil_to_base64(image)
    chat_message = [
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
            {"type": "text", "content": "Describe what is in this image."}
        ]}
    ]
    
    response = llava.create_chat_completion(messages=chat_message, max_tokens=256)
    return response['choices'][0]['message']['content']

# Main Pipeline

In [None]:
llava_model = load_models()

In [None]:
frame = Image.open(IMAGE_PATH)
response = describe_frame(llava_model, frame)
print("\n\n Response:")
print(response)