#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad Polit√©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

üìò License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share ‚Äî copy and redistribute the material in any medium or format; (2) Adapt ‚Äî remix, transform, and build upon the material.

Under the following terms: (1) Attribution ‚Äî You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial ‚Äî You may not use the material for commercial purposes; (3) ShareAlike ‚Äî If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

üîó License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

#Image Generation and Multimodal LLMs

The goal of this notebook is to bridge the gap between the world of text and the world of images and multimodal understanding.

We will cover:
* **Image Generation with Diffusion Models**: We'll uncover how we can generate novel images from text prompts using the popular Stable Diffusion model.
* **Specialized Multimodal AI**: We'll interact with a Visual Question Answering (VQA) model, a "specialist" trained for a single task: answering direct questions about an image.
* **Multimodal LLMs**: We'll explore the distinction between specialists and powerful "generalist" models like those from Google (Gemini) or accessible via Groq API (LLaVA). We will see how to interact with them  tools like LangChain.



# Setting up the Environment

First, let's install the necessary libraries from Hugging Face. We'll need diffusers for image generation and transformers for the multimodal pipeline. We also install accelerate to ensure efficient model loading.

- diffusers: A library by Hugging Face that provides state-of-the-art pretrained diffusion models and makes it incredibly easy to use them.

- transformers: The go-to library for all things related to Transformer models. We'll use it for our Visual Question Answering pipeline.

- torch: The underlying deep learning framework.

- accelerate: A library that simplifies running PyTorch code on any infrastructure (CPU, GPU, etc.).

In [None]:
# This command installs the necessary Python libraries for our notebook.
# - 'diffusers' is Hugging Face's library for diffusion models (like Stable Diffusion).
# - 'transformers' is the core Hugging Face library for models like BERT, GPT, and our VQA model.
# - 'accelerate' helps to load and run models efficiently, especially on GPUs.
# - 'langchain' and 'langchain-groq'
# The '-q' flag stands for "quiet", which means it will install without too much output.

!pip install diffusers transformers accelerate torch langchain langchain-groq -q

print("‚úÖ Libraries installed successfully!")

Let us also check for a GPU, needed to speed up the generation of images.

In [None]:
import torch

# Check for GPU availability
if torch.cuda.is_available():
    print(f"GPU detected: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected.")
    print("If you are using Google Colab, please go to 'Runtime' > 'Change runtime type' and select 'GPU' as the hardware accelerator.")

We will use Groq to run multimodal LLMs. Make sure you have your API key ready.

In [None]:
from google.colab import userdata

#Using google.colab secrets
api_key = userdata.get('GROQ_API_KEY')

if not api_key:
    print("üõë Groq API Key not found. Please make sure to set it up.")
else:
    print("‚úÖ Groq API Key configured.")

#Part 1: Image Generation with Diffusion Models
If LLMs like GPT-3 can generate coherent text, what is the equivalent for images? The current answer is largely "Diffusion Models". Stable Diffusion, DALL-E 2, and Midjourney are all based on this architecture.



##The Core Idea: What are Diffusion Models?
Imagine you have a clear image. You start adding a tiny amount of noise to it, step by step, until all you have is pure static. This is the **forward process**. It's easy and mathematically defined.

Now, what if you could learn to reverse this process? What if you could train a neural network to take the noisy static and, step by step, remove the noise until a clear image emerges? That's the **reverse process**, and it's the core of how diffusion models work.

The model doesn't just remove random noise; it's **guided by your text prompt**. This guidance is achieved through a mechanism similar to the attention you know from Transformers, where the model "pays attention" to the words in your prompt to denoise the image in a way that matches the text.


For a deeper understanding, check  [The Illustrated Stable Diffusion](https://jalammar.github.io/illustrated-stable-diffusion/) post by Jay Alammar.



##Generating immages with diffusers

* We will use the *StableDiffusionPipeline* from the  Hugging Face diffusers library. This pipeline handles all the complexity for us.

* We'll load a pre-trained model from the Hugging Face Hub. `stable-diffusion-v1-5`.

* Finally, we will use the StableDiffusionPipeline from the diffusers library.


In [None]:
import torch
from diffusers import StableDiffusionPipeline

# --- 1. Setup the Pipeline ---
# We're loading a pre-trained Stable Diffusion model.
# "runwayml/stable-diffusion-v1-5" is the identifier of the model on the Hugging Face Hub.
# torch.float16 is used for memory efficiency, which is helpful in Colab.
# This model will be downloaded from the hub, which might take a minute.
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)

# Move the pipeline to the GPU for faster inference.
# If no GPU is available, this will default to CPU (much slower).
pipe = pipe.to("cuda")

print("‚úÖ Stable Diffusion pipeline loaded successfully!")



Here we use the pipeline to generate the image with a prompt. Feel free to change the prompt and run the cell again to create your own images!

In [None]:

# --- 2. Define the Prompt and Generate ---
# The prompt is the text that guides the image generation.
# Think of it like the input you give to ChatGPT.
prompt = "A high-quality photograph of an astronaut riding a horse on Mars"

# We run the pipeline with our prompt.
# The pipeline returns an object containing the generated images.
# We access the first (and only) image with .images[0]
image = pipe(prompt).images[0]

# --- 3. Display the Image ---
print("\nGenerated Image for prompt: '{}'".format(prompt))
display(image) # 'display()' is a handy function in Colab/Jupyter to show images.

## The Training Process  
We are not going to train a model, as this requires massive datasets (like LAION-5B, with 5 billion image-text pairs) and huge computational resources. However, it's crucial to understand the process.

1. **Get a Dataset**: You need a vast number of images with corresponding text descriptions.
2. **The Forward Process**: Take an image from the dataset. Add a random amount of noise to it. You now have a `(noisy_image, text_description)` pair.
3. **The Model's Goal**: The core of the model is typically a U-Net architecture (common in image segmentation). You feed this model the `noisy_image` and the `text_description`. The model's job is to predict the noise that was added to the image.
4. **Calculate Loss**: You compare the noise predicted by the model with the actual noise you added. The difference is the loss.
5. **Update Weights**: You use backpropagation to update the model's weights to minimize this loss, just like in any other neural network.

By repeating this process billions of times, the model becomes incredibly good at predicting and removing noise, guided by a text prompt.

**Fine-Tuning**: What if you want to teach the model a new style or a specific object (like your face)? You don't need to train from scratch. You can **fine-tune** it. This involves continuing the training process on a small, specialized dataset (e.g., 15-20 images of the new object/style) for a much shorter time. This adjusts the model's weights to become an expert in your specific concept.

#Part 2: Multimodal (Specific) AI
Multimodal AI refers to models that can process and understand information from multiple modalities (types of data) at once, like text, images, audio, etc.  

##The Core Idea
How does a model "see" an image? It uses a Vision Encoder (often a **Vision Transformer**, or ViT) to convert the image's pixels into a meaningful vector representation (an embedding). This is analogous to how BERT's encoder turns a sentence into a set of embeddings.

For a task like **Visual Question Answering** (VQA), the typical process is as follows:

- The image is converted into image embeddings by a vision encoder.

- The question (text) is converted into text embeddings by a language encoder.

- These two sets of embeddings are fused together and processed by a multimodal fusion layer, which learns the relationships between them.

- Finally, a decoder (or a classification head) generates the answer based on this fused representation.


You can learn more about the ViT architecture in the original paper by  Dosovitskiy et al. (2020) https://arxiv.org/abs/2010.11929  In summary, it is a BERT-like encoder-only Transformer. The self-supervision is performed learning the *masked patch prediction* task and the special token <CLS> allow the model to compress all information relevant for predicting the image label into one vector.

##Question Answering (VQA)  


We'll use a pipeline from the Hugging Face transformers library‚Äîa high-level abstraction that simplifies using models for specific tasks. By specifying the task as visual-question-answering without choosing a particular model, the pipeline will automatically download and load a default pre-trained model suited for this task.



In [None]:
import requests
from PIL import Image
from io import BytesIO
from transformers import pipeline

# --- 1. Load Image ---
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

print("‚úÖ Image and VQA pipeline loaded successfully.")
display(image)

# --- 2. Setup Pipeline and Ask Question ---
vqa_pipeline = pipeline("visual-question-answering")

question = "How many cats are there?"
result = vqa_pipeline(image=image, question=question)

# --- 3. Display Result ---
print("\nQuestion: '{}'".format(question))
print("Answer:", result[0]['answer'])
print("Confidence Score:", result[0]['score'])

##The Training Process  


Many powerful multimodal models, such as [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language‚ÄìImage Pretraining), are first pre-trained on large datasets of image-text pairs using a self-supervised objective. CLIP, developed by OpenAI in 2021, popularized the use of **Contrastive Learning** to align visual and textual representations:

* The model is given a batch of images and their corresponding captions.
* It creates embeddings for all images and all texts.
* The model's goal is to learn to pull the embedding of a correct `(image, text)` pair closer together in the embedding space, while pushing incorrect pairs further apart.
* This forces the model to learn a shared representation space where "a photo of a dog" (text) is semantically close to an actual picture of a dog (image).


After pretraning the model with Contrastive Learning, this can be fine-tuned for a specific task like VQA.

1. **Get a Task-Specific Dataset**: For VQA, you need a dataset with images, questions, and ground-truth answers (e.g., the VQAv2 dataset).

2. **Add a "Head"**: You take a pre-trained multimodal model (like one trained with CLIP) and add a new final layer, called a "question-answering head". This head is often a simple linear layer that will output the final answer.

3. **Train on the Task**: You feed the model an `(image, question)` pair and train it to output the correct `answer`. The loss is calculated based on how far the model's prediction is from the ground-truth answer.

4. **Update Weights**: The error is backpropagated, but you might only update the weights of the new "head", or you might "unfreeze" the whole model and let all the weights be updated slightly. This second approach is full fine-tuning.

This two-step process (pre-training on a general task, fine-tuning on a specific task) is very similar to how models like BERT are first pre-trained on Masked Language Modeling and then fine-tuned for tasks like text classification or question answering.

**Note**: This is a general framework and starting point. In practice, training multimodal AI can vary widely depending on the model architecture, dataset availability, and specific application. Not all multimodal systems follow this exact process.

**Note 2**: Contrastive Learning can be seen as a generalization or conceptual evolution inspired by ideas similar to those used in Word2Vec, but applied to multimodal tasks and using modern deep learning techniques.



# Part 3: Multimodal (Generalist) LLMs

The VQA model was a specialist. Now, let's discuss the general-purpose multimodal models that power today's most advanced AI assistants.




## The Key Difference: Specialist vs. Generalist

Previously, we have used a specialized model for Visual Question Answering. It's excellent at its one job. However, the current cutting edge of AI is in large, general-purpose multimodal models. Let's clarify the distinction.

**Specialist Multimodal Models** (like the `vilt-b32-finetuned-vqa` we used):
- **Analogy**: Think of this as a highly-trained radiologist. They can look at an X-ray (the image) and answer very specific questions ("Is there a fracture?"). But you wouldn't ask them to write a poem about the X-ray or tell you a story about the patient.
- **Function**: Designed and fine-tuned for a single task. The output is constrained, often a single word or a short phrase from a predefined set of possible answers.
- **Architecture**: They typically use two separate encoders (one for vision, one for text) and a "fusion" module to combine their knowledge and produce an answer.

**Multimodal LLMs** (like Google's Gemini, OpenAI's GPT-4o, LLaVA):
- **Analogy**: This is a brilliant general practitioner with expertise in nearly every field. You can show them the X-ray and not only ask "Is there a fracture?", but also "Can you explain this to me in simple terms?", "What are the likely next steps for treatment?", or "Write a short, hopeful note to the patient based on this image."
- **Function**: They are fundamentally LLMs that have been augmented to accept images (and other modalities) as part of their input. Their output is flexible, generative text. They can reason, describe, create, and converse about the image.
- **Architecture**: The core is the LLM. The image is processed by a vision encoder into a series of embeddings, which are then fed into the LLM as if they were special "word" tokens. The LLM sees a sequence of both image tokens and text tokens and generates a text response.

This table summarizes the main differences:


| Aspect | Specialist Models (e.g., CLIP, ViLT-VQA) | Generalist Multimodal LLMs (e.g., Gemini, GPT-4o) |
|--------|------------------------------------------|---------------------------------------------------|
| **Analogy** | Highly-trained radiologist who can analyze X-rays but won't write poetry about them | Brilliant general practitioner with expertise across fields |
| **Primary Function** | Single, specific task (classification, VQA, retrieval) | General-purpose reasoning, conversation, and content generation |
| **Output Type** | Constrained (single word, short phrase, classification score) | Flexible, generative text of any length |
| **Architecture** | Dual encoders + fusion module | LLM core + vision encoder + projection layer |
| **Training Focus** | Task-specific fine-tuning | Instruction-following and general reasoning |
| **Use Cases** | Image classification, similarity search, specific VQA | Conversational AI, complex reasoning, creative tasks |




## How Generalists Work: The LLM at the Core

The breakthrough of generalist models is that they teach a Large Language Model‚Äîan expert in text, grammar, and reasoning‚Äîa new skill: how to read images. The architecture to achieve this generally involves two main components connected by a "bridge."

- **The Vision Encoder**: This is the "eye" of the system. Its only job is to look at an image and convert its pixels into a meaningful numerical format (embeddings).

- **The Large Language Model (LLM)**: This is the "brain." It's the same kind of powerful language model you're already familiar with (like GPT-3, Gemini, Llama, Mistral, etc.). It receives the numerical representation of the image and the user's text prompt to perform reasoning and generate a textual response.

The general data flow looks like this:

Image ‚Üí Vision Encoder ‚Üí Sequence of Image Embeddings ‚Üí Adapter/Bridge ‚Üí LLM Input + Text Prompt Input ‚Üí LLM ‚Üí Text Output

Therefore, multimodal LLMs  treat images as just another part of the input language. Imagine the input to an LLM: `[token1, token2, token3, ...]`  For a multimodal LLM, the input becomes: `[img_tok1, img_tok2, ..., img_tok_N,  text_tok1, text_tok2, ...]` The model's reasoning and generation capabilities, which you are familiar with from text-only LLMs, can now be applied to the concepts and objects present in the image tokens. This is what allows for the incredible flexibility we see in multimodal models such as ChatGPT and Gemini.



## The Key: How "Image Tokens" are Created

The term "image token" is a useful abstraction. In reality, the model creates a sequence of embeddings (vectors) that are dimensionally compatible with the LLM's text embeddings. This is done through a process, usually involving a **Vision Transformer (ViT)** (see Part 2).

Here is the step-by-step process of creating Image Tokens:

* Step 1: **Image Patching**. The Vision Encoder doesn't look at the whole image at once. Instead, it slices the image into a grid of smaller, fixed-size squares called patches. Think of it like cutting a photograph into a mosaic of puzzle pieces. This allows the model to process the image as a sequence, similar to how it processes a sequence of words.

* Step 2: **Embedding the Patches**. Each patch is then "embedded." It's fed through a neural network that converts the raw pixels of that small square into a numerical vector, or an embedding. This vector represents the visual content of that specific patch in a high-dimensional space. At this stage, you have a sequence of vectors, one for each piece of the original image.

* Step 3: The **Projection Layer** (The "Bridge").  This is the most critical step for making the two systems compatible. The embeddings produced by the Vision Encoder are in a "visual space," which the LLM doesn't understand. The LLM understands a "language space."  A small, trainable neural network, often called a projection layer or an adapter, acts as a translator. Its sole purpose is to take the sequence of image patch embeddings and convert them into a sequence of embeddings that live in the exact same dimensional space as the LLM's word embeddings.

After this step, the image has been transformed into a sequence of vectors that the LLM can read and process just as if they were embeddings from a sentence. The LLM can now apply its attention mechanisms across both the text tokens from the prompt and these new "image tokens."





## The Fuel: Training Data and Process

The model learns to connect vision and language through a two-stage training process using massive datasets.

* Pre-training (Learning to See). The model is first trained on enormous datasets of paired images and text descriptions. A famous example is the LAION dataset, which contains billions of image-alt-text pairs scraped from the web. In this stage, the model's goal is to align the two modalities. It learns that the visual information from an image of a dog (processed by the Vision Encoder) should correspond to the semantic meaning of the words "a photo of a dog" (processed by the text embedder).

* Instruction Fine-Tuning (Learning to Obey). After pre-training, the model knows what's in an image, but it isn't yet a helpful assistant. The second stage uses curated, high-quality datasets of (image, instruction, desired_output) triplets. This teaches the model to follow commands. For example:

 - **Image**: A picture of a birthday party.
 - **Instruction**: "Describe what is happening in this image."
 - **Desired Output**: "This image shows a group of people celebrating a birthday party. There is a cake with candles on the table..."

This fine-tuning stage is what turns a descriptive model into a conversational and reasoning agent.

**Note**: This is a common ‚Äústandard recipe‚Äù for multimodal models, but the exact process can vary depending on the architecture, available data, and training goals.



## Real-World Examples

Several prominent models use this exact architecture:

Open-Source Models include:
- **[LLaVA (Large Language and Vision Assistant)](https://arxiv.org/abs/2304.08485)**: This is a classic, open-source example from 2023. It explicitly uses a pre-trained Vision Encoder from CLIP (a ViT) and a simple projection layer (an MLP) to feed image features into an instruction-tuned LLM like Vicuna. Its architecture is a textbook implementation of the process described above.

- **IDEFICS (by Hugging Face)**: This is another open model that builds on this principle. It's designed to handle interleaved image and text sequences, making it effective for tasks like visual storytelling or analyzing documents with multiple images. It still uses a vision encoder and an adapter to bridge the modalities.

Commercial Models include **GPT-4o & Google's Gemini**. These are state-of-the-art, closed-source models that follow the same fundamental paradigm but on a much larger scale. They use highly advanced, proprietary vision encoders and LLMs, and are trained on vast, private datasets. However, the core principle of patching an image, embedding the patches, and projecting them into the LLM's language space remains the same.



## Accessing Multimodal GenAI Programmatically

You typically access these massive models through an API. LangChain provides a universal interface to interact with many different LLM providers, making your code cleaner and more portable.

Let's look at how you would use `langchain_groq` to interact with `llama-4-maverick-17b-128e-instruct`, a powerful open-source multimodal LLM available through Groq's fast inference engine.

The model will be instructed to analyze the provided image and respond exclusively with a JSON object containing two specific fields:
* "description": A detailed, accurate description of what is depicted in the image.
* "funny_story": A brief, humorous story inspired by the content of the image.

The image is sent as a `Base64` string, which embeds the actual image data directly into the body of the API request instead of simply providing a URL.A Base64 string is a way of encoding binary data (like an image file) into ASCII text characters.

In [None]:
import base64
import requests
from PIL import Image
from langchain.schema import HumanMessage
from langchain_groq import ChatGroq
from IPython.display import display
from io import BytesIO


# 1. Download the image from the URL and encode it in base64
#To explore more images from the COCO dataset, you can visit the official COCO Dataset website. https://cocodataset.org/
url = "http://images.cocodataset.org/val2017/000000397133.jpg" # man in a kitchen
#url = "http://images.cocodataset.org/val2017/000000039769.jpg" #cats
response = requests.get(url)

# Show image on screen
image = Image.open(BytesIO(response.content))
display(image)

#When you send an image as a Base64 string, you are embedding the actual image data directly into the body of your API request.
# The model receives the image pixels and the text prompt in a single, self-contained package.
image_b64 = base64.b64encode(response.content).decode("utf-8")


# 2. Prepare the multimodal message content
# The content is a list where each item is a dictionary representing a part of the message:
# - The first part is a text instruction for the model.
# - The second part is the image encoded as a data URL inside an object with the key "image_url".
message = HumanMessage(
    content=[
        {
            "type": "text",
            "text": (
                "Please analyze the following image and respond ONLY with a JSON object "
                "with two fields:\n"
                "1. \"description\": A detailed description of the image.\n"
                "2. \"funny_story\": A short, funny story inspired by the image.\n"
                "Do not include any additional text."
            )
        },
        {
            "type": "image_url",
            # The 'image_url' field is an object with a mandatory 'url' key,
            # which must contain the image data as a base64-encoded data URL.
            "image_url": {
                "url": f"data:image/jpeg;base64,{image_b64}"
            }
        }
    ]
)

# 3. Invoke the multimodal model on Groq's API
# Replace 'api_key' with your actual Groq API key
chat = ChatGroq(model_name="meta-llama/llama-4-maverick-17b-128e-instruct", groq_api_key=api_key)
response = chat.invoke([message])

# 4. Print the raw JSON response from the model
print(response.content)




#Conclusions and Next Steps

We have:
* Generated an image from text using a diffusion model.
* Used a specialist AI to answer a direct question about an image.
* Understood the architecture of generalist multimodal LLMs and how to interact with them programmatically.

The skills developed for  LLMs ‚Äîunderstanding embeddings, attention, and the pre-training/fine-tuning paradigm‚Äî are the bedrock for multimodal GenAI.

Next steps

- Explore other HuggingFace pipelines: Try Image-to-Image in diffusers or Image-Captioning in transformers.
- Experiment with different prompts and models from GenAI providers like Groq, OpenAI, or Google.
-  Read the original papers for models like [Vision Transformers](https://arxiv.org/abs/2010.11929) (2020), [CLIP](https://arxiv.org/abs/2103.00020) (2021), and [LLaVA](https://arxiv.org/abs/2304.08485)(2023) to understand their innovations firsthand. Also [The Illustrated Stable Diffusion](https://jalammar.github.io/illustrated-stable-diffusion/) post by Jay Alammar.
