# Vision-Language Model (VLM) Embeddings with MLX Server

This notebook demonstrates how to leverage the embeddings endpoint of MLX Server through its OpenAI-compatible API. Vision-Language Models (VLMs) can process both images and text, allowing for multimodal understanding and representation.



## Introduction

MLX Server provides an efficient way to serve multimodal models on Apple Silicon. In this notebook, we'll explore how to:

- Generate embeddings for text and images
- Work with the OpenAI-compatible API
- Calculate similarity between text and image representations
- Understand how these embeddings can be used for practical applications

Embeddings are high-dimensional vector representations of content that capture semantic meaning, making them useful for search, recommendation systems, and other AI applications.

## 1. Setup and API Connection

- A local server endpoint (`http://localhost:8000/v1`)
- A placeholder API key (since MLX Server doesn't require authentication by default)

Make sure you have MLX Server running locally before executing this notebook.

In [1]:
# Import the OpenAI client for API communication
from openai import OpenAI

# Connect to the local MLX Server with OpenAI-compatible API
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="fake-api-key",
)

## 2. Image Processing for API Requests

When working with image inputs, we need to prepare them in a format that the API can understand. The OpenAI-compatible API expects images to be provided as base64-encoded data URIs.

Below, we'll import the necessary libraries and define a helper function to convert PIL Image objects to the required format.

In [2]:
from PIL import Image
from io import BytesIO
import base64

In [3]:
# To send images to the API, we need to convert them to base64-encoded strings in a data URI format.

def image_to_base64(image: Image.Image):
    """
    Convert a PIL Image to a base64-encoded data URI string that can be sent to the API.
    
    Args:
        image: A PIL Image object
        
    Returns:
        A data URI string with the base64-encoded image
    """
    # Convert image to bytes
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    buffer.seek(0)
    image_data = buffer.getvalue()
    
    # Encode as base64
    image_base64 = base64.b64encode(image_data).decode('utf-8')
    
    # Create the data URI format required by the API
    mime_type = "image/png"  
    image_uri = f"data:{mime_type};base64,{image_base64}"
    
    return image_uri

## 3. Loading and Preparing an Image

Now we'll load a sample image (a green dog in this case) and convert it to the base64 format required by the API. This image will be used to generate embeddings in the subsequent steps.

In [5]:
image = Image.open("images/green_dog.jpeg")
image_uri = image_to_base64(image)

## 4. Generating Embeddings

In [12]:
# Generate embedding for a single text input
prompt = "Describe the image in detail"
image_embedding = client.embeddings.create(
    input=[prompt],
    model="mlx-community/Qwen2.5-VL-3B-Instruct-4bit",
    extra_body = {
        "image_url": image_uri
    }
).data[0].embedding

text = "A green dog looking at the camera"
text_embedding = client.embeddings.create(
    input=[text],
    model="mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
).data[0].embedding

## 5. Comparing Text and Image Embeddings

One of the powerful features of VLM embeddings is that they create a shared vector space for both text and images. This means we can directly compare how similar a text description is to an image's content by calculating the cosine similarity between their embeddings.

A higher similarity score (closer to 1.0) indicates that the text description closely matches the image content according to the model's understanding.

In [13]:
import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b)

similarity = cosine_similarity(image_embedding, text_embedding)
print(similarity)

0.8473370724651375
