# Multimodal Search and Conditional Image Generation
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal_Search_and_Conditional_Image_Generation.ipynb)

## Introduction

In this notebook we will demonstrate how to implement text-to-image search and image-to-image search. This will allow you to retrieve semantically relevant images and then we will use the retrieved images to condition the generation of new images using diffusion models.

We will cover:
1. How we can use multimodal embedding models like JinaCLIP to perform multimodal search.
2. How we can perform conditional image generation using the FLUX models.

## Install relevant libraries

In [11]:
!pip install -Uqq duckduckgo_search together transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/83.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.7/83.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from duckduckgo_search import DDGS

def search_images(keywords, max_images=10):
    """
    Search for images based on given keywords and return a list of image URLs.
    Args:
        keywords (str): The search terms to use for finding images.
        max_images (int, optional): The maximum number of images to retrieve. Defaults to 10.
    Returns:
        list: A list of URLs of the images found based on the search keywords.
    """

    results = DDGS().images(keywords, max_results=max_images)

    return [item['image'] for item in results]

In [26]:
# Our function will allow us to search the web for images
search_images('family picture', max_images=3)

['http://leapphotography.com/blog/wp-content/uploads/2017/04/family-portrait-studio-boise-idaho-003.jpg',
 'http://3.bp.blogspot.com/_b_LWsdjxDUI/TKJuV4BG7EI/AAAAAAAAEkM/qcdcqXqtZLc/s1600/Family+PIctures_53.jpg',
 'https://www.uniqueideas.site/wp-content/uploads/img_6349-1087x1600-pixels-family-group-photos-pinterest-1.jpg']

We use the following code to obtain a variety of image links we can index.

```python
# Lets create a small dataset of 12 images containing diverse topics
searches = 'forest', 'dog', 'strawberry field', 'family picture'

from time import sleep

links = []

for o in searches:
    links += search_images(o, max_images=3)
    sleep(1)
```

In [None]:
# Below we just provide the links obtained when this code was ran to make the notebook reproducible in case web search results change

links = ['https://get.pxhere.com/photo/tree-forest-path-plant-hiking-trail-meadow-sunlight-rustic-solitude-recreation-green-jungle-scenic-peaceful-usa-relaxing-trees-leaves-outdoors-woods-spruce-vegetation-rainforest-deciduous-ferns-grove-woodland-habitat-ecosystem-north-carolina-biome-old-growth-forest-natural-environment-geographical-feature-woody-plant-temperate-broadleaf-and-mixed-forest-temperate-coniferous-forest-riparian-forest-elk-knob-state-park-1172973.jpg',
 'https://wallup.net/wp-content/uploads/2019/09/952492-forest-trees-nature-landscape-tree.jpg',
 'https://get.pxhere.com/photo/tree-nature-forest-path-wilderness-plant-trail-sunlight-leaf-green-jungle-autumn-ridge-trees-outdoors-woods-rainforest-deciduous-woodland-habitat-ecosystem-biome-old-growth-forest-natural-environment-woody-plant-temperate-broadleaf-and-mixed-forest-temperate-coniferous-forest-1170198.jpg',
 'https://get.pxhere.com/photo/puppy-dog-animal-canine-pet-young-mammal-friend-golden-retriever-happy-vertebrate-funny-domestic-adorable-cub-dog-breed-retriever-pup-doggy-puppies-doggie-pedigree-cute-dog-young-dog-yellow-dog-dog-face-lazy-dog-dog-nose-young-dogs-dog-like-mammal-dog-breed-group-dog-crossbreeds-norfolk-terrier-tibetan-spaniel-nova-scotia-duck-tolling-retriever-female-dog-whelp-funny-dogs-yellow-dogs-dogs-types-dog-photos-smile-dogs-dog-cute-dog-funny-dog-laughing-dog-girl-1387994.jpg',
 'https://wallup.net/wp-content/uploads/2018/10/06/364377-puppies-puppy-baby-dog-dogs-41.jpg',
 'https://get.pxhere.com/photo/puppy-dog-animal-cute-canine-pet-fur-mammal-hound-close-up-nose-snout-ears-vertebrate-beagle-resting-adorable-dog-breed-street-dog-dog-like-mammal-816169.jpg',
 'https://images.pexels.com/photos/7534234/pexels-photo-7534234.jpeg?auto=compress&cs=tinysrgb&h=750&w=1260',
 'https://i.pinimg.com/originals/9e/dd/56/9edd568dd03d58a00d50adb2ec040208.jpg',
 'https://images.pexels.com/photos/7707012/pexels-photo-7707012.jpeg?auto=compress&cs=tinysrgb&h=750&w=1260',
 'http://leapphotography.com/blog/wp-content/uploads/2017/04/family-photographer-boise-professional-portrait-photographers-002.jpg',
 'https://icmedonline.com/blog/wp-content/uploads/2017/06/MensHealth.jpeg',
 'https://simpleasthatblog.com/wp-content/uploads/2014/12/familyphotosIG.jpg']

### CLIP Architecture Overview

CLIP (Contrastive Language-Image Pre-training) is designed to understand the relationship between images and text by learning joint representations.

<img src="../images/CLIP.png" width="1000">

**Key Components:**
- **Image Encoder**: Processes images into feature vectors
- **Text Encoder**: Processes text into feature vectors  
- **Joint Embedding Space**: Both encoders map to the same vector space
- **Contrastive Learning**: Trained to align matching image-text pairs

In [37]:
from transformers import AutoModel

# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

# Encode text and images
image_embeddings = model.encode_image(links)  # also accepts PIL.image, local filenames, dataURI

image_embeddings.shape

(12, 768)

`image_embeddings` is now a numpy array/vector index that contains vector representations for each of our 12 images.

## Image Retrieval Function

Below we implement a retrieval function that will embed an image or text query and return the most semantically relevant image.

Since JinaCLIP is a multimodal model is can accept both text or images as input and thus our function will need to handle both text or image queries.

In [40]:
import numpy as np

def retrieve_image(query, query_type, index):
    """
    Retrieve the index of the most similar image based on a query.
    Args:
        query (str or PIL.Image or str): The query input, which can be a text string,
                                         a PIL image, or a local filename.
        query_type (str): The type of the query, either 'text' or 'image'.
        index (int): The index of the image to be retrieved.
    Returns:
        int: The index of the most similar image based on the query.
    Raises:
        ValueError: If the query_type is not 'text' or 'image'.
    """

    if query_type == 'text':
        query_embedding = model.encode_text(query)
    elif query_type == 'image':
        query_embedding = model.encode_image(query) # Accepts PIL.image, local filenames, dataURI
    else:
        raise ValueError("query_type must be 'text' or 'image'")

    similarities = query_embedding @ index.T # We calculate the similaritry between the query embedding and all the image embeddings

    return np.argmax(similarities)

Below we perform text2image retrieval:

In [41]:
retrieved_image = links[retrieve_image(query = 'family pics', query_type = 'text', index = image_embeddings)]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

This image is the most semantically similar to the text query: `family pics`

In [42]:
import IPython.display as display

display.Image(url=retrieved_image, width=500)

## Conditional Image Generation Using Diffusion Models

We will use the retrieved image above to generate a holiday card cartoon version of the image above!

In [44]:
from together import Together

client = Together(api_key = 'TOGETHER_API_KEY')

def generate_image(image_prompt, retrieved_image, model = "black-forest-labs/FLUX.1-depth"):

    imageCompletion = client.images.generate(
        model = model,
        width=1024,
        height=768,
        steps=28,
        prompt = image_prompt,
        image_url = retrieved_image,
    )

    return imageCompletion.data[0].url

In [45]:
generated_image = generate_image("Create a cute holiday cartoon version of this image.", retrieved_image = retrieved_image)

In [46]:
display.Image(url=generated_image, width=500)

## Image to Image Search and Conditional Generation

Next we will demonstrate using an image as a query and then used the semantically relevant retrieved image to generate another holiday cartoon generated image!

In [47]:
# Search the internet for a new image
new_image = search_images('cute pet dog running', max_images=1)[0]

display.Image(url=new_image, width=500)

In [48]:
# Use the image above as a query to retrieve the most similar image from our dataset of 12 images
image_2_image = links[retrieve_image(query=new_image, query_type='image', index=image_embeddings)]

display.Image(url=image_2_image, width=500)

In [49]:
# Generate a holiday cartoon version of the retrieved image
generated_image_2 = generate_image(image_prompt="Create a cute holiday cartoon version of this image.", retrieved_image = image_2_image)

In [50]:
display.Image(url=generated_image_2, width=500)

Check our how you can generated images conditioned on input images [here](https://www.together.ai/blog/flux-tools-models-together-apis-canny-depth-image-generation)!