**Overview**

*   **Multimodality** refers to the ability to work with data in different forms, such as text, audio, images, and video.
*   Multimodality allows models and systems to process a mix of data types seamlessly.
*   Multimodal support is relatively new and not yet standardised across model providers.
*   LangChain's multimodal abstractions are designed to be flexible and accommodate different APIs.

**Multimodality in Chat Models**

*   Chat models could accept and generate multimodal inputs and outputs, handling various data types.
*   To use multimodal models:
    *   Identify which models support multimodality using the chat model integration table.
    *   Reference the how-to guides for specific examples.

*   **Inputs**:
    *   Some models accept inputs like images, audio, video, or files.
    *   The types of supported inputs depend on the model provider.
    *   Most models that support multimodal inputs accept values in OpenAI's content blocks format, which is restricted to image inputs so far.
    *   Models like Gemini also support native, model-specific representations for video and other byte inputs.
    *   Multimodal inputs are passed using content blocks that specify a type and corresponding data.
    *   The format of the content blocks may vary depending on the model provider.
        *   For example, to pass an image to a chat model, you would specify the type as "image_url" and include the URL.
        ```python
        from langchain_core.messages import HumanMessage
        message = HumanMessage(
            content = [
                {"type": "text", "text": "describe the weather in this image"},
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        )
        response = model.invoke([message])
        ```
*   **Outputs**:
    *   Virtually no popular chat models support multimodal outputs, with the exception of OpenAI's `gpt-4o-audio-preview` model which can generate audio outputs.
    *   Multimodal outputs will appear as part of the `AIMessage` response object.
    *   The `ChatOpenAI` documentation has more information on how to use multimodal outputs.

*  **Tools**:
    *   No chat model is designed to work directly with multimodal data in a tool call request or `ToolMessage` result.
    *   Models can interact with multimodal data by invoking tools with references (e.g., URLs) to the data.
    *   Models can be equipped with tools to download and process images, audio, or video.

**Multimodality in Embedding Models**

*   **Embeddings** are vector representations of data used for tasks like similarity search and retrieval.
*   The current embedding interface in LangChain is optimised for text-based data and will not work with multimodal data.
*   The embedding interface is expected to expand to accommodate other data types like images, audio, and video as use cases become more common.

**Multimodality in Vector Stores**

*   Vector stores are databases for storing and retrieving embeddings, typically used in search and retrieval tasks.
*   Vector stores are currently optimised for text-based data.
*   The vector store interface is expected to expand to accommodate other data types like images, audio, and video as use cases become more common.
