UForm Python SDK

UForm multimodal AI SDK offers a simple way to integrate multimodal AI capabilities into your Python applications. The SDK doesn't require any deep learning knowledge, PyTorch, or CUDA installation, and can run on almost any hardware.

Installation

There are several ways to install the UForm Python SDK, depending on the backend you want to use. PyTorch is by far the heaviest, but the most capable. ONNX is a lightweight alternative that can run on any CPU, and on some GPUs.

pip install "uform[torch]"       # For PyTorch
pip install "uform[onnx]"        # For ONNX on CPU
pip install "uform[onnx-gpu]"    # For ONNX on GPU, available for some platforms
pip install "uform[torch,onnx]"  # For PyTorch and ONNX Python tests

Quick Start

Embeddings

Load the model:

from uform import get_model, Modality

model_name = 'unum-cloud/uform3-image-text-english-small'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

Embed images:

import requests
from io import BytesIO
from PIL import Image

image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))
image_data = processor_image(image)
image_features, image_embedding = model_image.encode(image_data, return_features=True)

Embed queries:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)

Generative Models

UForm generative models are fully compatible with the Hugging Face Transformers library, and can be used without installing the UForm library. Those models can be used to caption images or power multimodal chat experiences.

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)

prompt = 'Question or Instruction'
image = Image.open('image.jpg')

inputs = processor(text=[prompt], images=[image], return_tensors='pt')

with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=256,
        eos_token_id=151645,
        pad_token_id=processor.tokenizer.pad_token_id
    )
prompt_len = inputs['input_ids'].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

You can check examples of different prompts in our demo Gradio spaces on HuggingFace:

for uform-gen2-qwen-500m
for uform-gen2-dpo

Technical Details

Multi-GPU Parallelism

To achieve higher throughput, you can launch UForm on multiple GPUs. For that pick the encoder of the model you want to run in parallel, and wrap it in nn.DataParallel (or nn.DistributedDataParallel).

from uform import get_model, Modality
import torch.nn as nn

encoders, processors = uform.get_model('unum-cloud/uform-vl-english-small', backend='torch')

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

model_text.return_features = False
model_image.return_features = False
model_text_parallel = nn.DataParallel(model_text)
model_image_parallel = nn.DataParallel(model_image)

Since we are now dealing with the PyTorch wrapper, make sure to use the forward method (instead of encode) to get the embeddings, and the .detach().cpu().numpy() sequence to bring the data back to more Pythonic NumPy arrays.

def get_image_embedding(images: List[Image]):
    preprocessed = processor_image(images)
    embedding = model_image_parallel.forward(preprocessed)
    return embedding.detach().cpu().numpy()

def get_text_embedding(texts: List[str]):
    preprocessed = processor_text(texts)
    embedding = model_text_parallel.forward(preprocessed)
    return embedding.detach().cpu().numpy()

ONNX and CUDA

The configuration process may include a few additional steps, depending on the environment. When using the CUDA and TensorRT backends with CUDA 12 or newer make sure to install the Nvidia toolkit and the onnxruntime-gpu package from the custom repository.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12
pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
export CUDA_PATH="/usr/local/cuda-12/bin"
export PATH="/usr/local/cuda-12/bin${PATH:+:${PATH}}"
export LD_LIBRARY_PATH="/usr/local/cuda-12/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
pytest python/scripts/ -s -x -Wd -v -k onnx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

UForm Python SDK

Installation

Quick Start

Embeddings

Generative Models

Technical Details

Multi-GPU Parallelism

ONNX and CUDA

Files

README.md

Latest commit

History

README.md

File metadata and controls

UForm Python SDK

Installation

Quick Start

Embeddings

Generative Models

Technical Details

Multi-GPU Parallelism

ONNX and CUDA