# Intro to Vison and Video Models

This notebook uses vision-language models to classify images and diffusion models to recreate and edit them.

If you are using AWS SageMaker, ensure you are using the [Conda PyTorch kernel](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-pytorch.html) that is specifically optimized for PyTorch-based deep learning tasks.


## Vision-Language Model
A [Vision-Language Model (VLM)](https://huggingface.co/blog/vlms) is an AI model that integrates visual and textual information, enabling it to understand and generate insights from both images and text. It combines computer vision (to recognize objects and scenes) with natural language processing, allowing tasks like generating image captions, answering questions about images, and creating visuals from text prompts. 

## Diffusion Model
A [diffusion model](https://huggingface.co/blog/Esmail-AGumaan/diffusion-models) is a generative AI model that creates images by reversing a process of adding noise. It learns to transform random noise into detailed images by progressively removing noise over many steps. Starting with pure noise, the model gradually refines each step until it forms a complete, high-quality image. This approach is highly effective for text-to-image generation and image editing.





## Set up API Key


You also need to purchase and your [openai](https://openai.com/) api key in AWS Secrets Manager:
- key name: `api_key`
- key value: <`your openai api key`>
- secret name: `openai`

## Install Python Libraries

- pymongo: manage the MongoDB database
- openai: use the VLM model and diffusion model

In [None]:
pip install openai -q

## Secrets Manager Function

In [None]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [None]:
import json
import re
import os
from openai import OpenAI
from pprint import pprint


openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)


## Utility Functions

- `get_image_from_url`: Retrieve the image object from a URL.
- `display_image`: Display an image in Python.
- `image_to_bytes`: Convert an image to bytes.

In [None]:
from PIL import Image
from io import BytesIO
import requests

def get_image_from_url(image_url):
    response = requests.get(image_url)
    # print(response)
    if response.status_code == 200 :
        image = Image.open(BytesIO(response.content))
        return image
    else:
        print('image not accessible')
        return None

In [None]:
import matplotlib.pyplot as plt

def display_image(image):
    if image:
        plt.imshow(image)
        plt.axis('off')
        plt.show()

In [None]:
from io import BytesIO

def image_to_bytes(image_obj):
    buffer = BytesIO()
    image_obj.save(buffer, format="PNG")
    buffer.seek(0)  # move to beginning so OpenAI can read it
    return buffer

# Image URL
Find an online image and past the URL here

In [None]:
image_url = "https://www.jmu.edu/_images/news/2025/xuebin-wei-textbook1.jpg"

In [None]:
display_image(get_image_from_url(image_url))

## Classify Image

Many OpenAI [models](https://platform.openai.com/docs/models) can handle both images and text, allowing you to choose the most suitable model for your needs.

This function summarizes each Twitter image and extracts entities into a JSON document.

In [None]:
vison_model = 'gpt-4o-mini'
temperature=0

def describe_image(image_url):

    response = client.chat.completions.create(
      model=vison_model,
      messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": """ Analyze the image included in the tweet.
                                        Briefly describe the content of the image and extract entities from it.
                                        Organize the response in a JSON document:
                                        Place the description in the <img_desc> key.
                                        List the entity type and entity value in the <img_entities> key.
                                        Do not wrap the JSON code in JSON markers."""},
            {
              "type": "image_url",
              "image_url": {"url": image_url,},
 
            },
          ],
        }

      ],
    temperature=temperature
     )
    return (response.choices[0].message.content)

In [None]:
image_result = json.loads(describe_image(image_url))
pprint(image_result)

## Create Images
For image generation, you can choose between DALL·E 2 or DALL·E 3. Both models currently support different parameters. Note that only DALL·E 2 can be used to create variations or edit images. Please check the OpenAI [API documentation](https://platform.openai.com/docs/guides/images) for more details.

### Create New Image
Generate a new image based on the provided image description.

In [None]:
image_model="dall-e-2"  


def create_new_img(prompt):

    resposne = client.images.generate(
    model=image_model,
    prompt=prompt,
    n=1,
    size = "256x256"
    )
    
    
    return(resposne.data[0].url)

In [None]:
new_image = create_new_img(image_result['img_desc'])

display_image(get_image_from_url(new_image))

### Create Variant
Upload an existing Twitter image and create a variant.

In [None]:
def create_var_img(image_url):
    img_obj = get_image_from_url(image_url)
    img_byt = image_to_bytes(img_obj)
    response = client.images.create_variation(
        image=img_byt,
        n=1,
        size="256x256",

        )
    return (response.data[0].url)   

In [None]:
var_image = create_var_img(image_url)

display_image(get_image_from_url(var_image))
                                                  

### Edit Image
Upload Twitter images, mask the desired area, and edit the masked part.

In [None]:
from io import BytesIO

def create_img_edit(image, img_mask, prompt):
    """Edit an image using DALL·E 2 or GPT-Image-1 with proper PNG encoding."""
    # Convert base image to PNG bytes
    base_buffer = BytesIO()
    image.convert("RGBA").save(base_buffer, format="PNG")
    base_buffer.seek(0)

    # Convert mask to PNG bytes
    mask_buffer = BytesIO()
    img_mask.convert("RGBA").save(mask_buffer, format="PNG")
    mask_buffer.seek(0)

    response = client.images.edit(
        model="dall-e-2",  # or "gpt-image-1"
        image=("base.png", base_buffer, "image/png"),
        mask=("mask.png", mask_buffer, "image/png"),
        prompt=prompt,
        n=1,
        size="256x256"
    )

    return response.data[0].url

### Create Mask with PyTorch
The following code is provided by ChatGPT with the prompt `"Segment image and create masks."` 

You can also use foundational models to segment images, such as Meta’s [SAM (Segment Anything Model)](https://ai.meta.com/sam2/).

In [None]:
from torchvision import models, transforms
from PIL import Image, ImageEnhance
import torch
import numpy as np
import matplotlib.pyplot as plt
from io import BytesIO

# Load the image
image = get_image_from_url(image_url)  # Replace with your image loading method

# Define transformations (no resizing since images are 150x150)
transform = transforms.Compose([
    transforms.ToTensor(),
])
input_image = transform(image).unsqueeze(0)  # Add batch dimension

# Load the pre-trained DeepLabV3 model
model = models.segmentation.deeplabv3_resnet101(pretrained=True)
model.eval()

# Perform segmentation
with torch.no_grad():
    output = model(input_image)['out'][0]

# Convert output to predicted classes
output_predictions = output.argmax(0).byte().numpy()


# Create a transparent RGBA mask with fully transparent non-segmented areas
height, width = output_predictions.shape
mask = np.zeros((height, width, 4), dtype=np.uint8)  # 4 channels for RGBA

# Define a color for segmented areas (e.g., semi-transparent red)
color = [255, 0, 0, 180]  # Red with 70% opacity
mask[output_predictions > 0] = color  # Apply color to segmented areas only

# Convert mask to a PIL Image
mask_image = Image.fromarray(mask, mode="RGBA")

# Overlay mask onto the contrast-enhanced original image
image_with_mask = Image.alpha_composite(image.convert("RGBA"), mask_image)

# Display the original image and transparent mask overlay
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].imshow(image)
axs[0].axis('off')
axs[0].set_title(" Image")

axs[1].imshow(image_with_mask)
axs[1].axis('off')
axs[1].set_title("Image Transparent Mask")

plt.show()

In [None]:
img_edit = create_img_edit(image,mask_image,'in the middle of a lake with yellow ducks')

display_image(
    get_image_from_url(img_edit))

# Create videos

Create video with a pommpt

In [None]:
def create_video(prompt):

    video = client.videos.create(

        prompt=prompt,
 
        )
    return(video)

In [None]:
video = create_video('a person is eating a book')

In [None]:
print(video.id)

In [None]:
page = client.videos.list()
page = page.data
pprint(page)

Waite untile your video id appears in the following list

In [None]:
response = client.videos.download_content(
    video_id=video.id,
)
content = response.read()

# Save to file
output_path = f"generated_video_{video.id}.mp4"
with open(output_path, "wb") as f:
    f.write(content)

print(f"Video saved to: {output_path}")

## Respoisbile AI
Many image model providers update their security measures to foster responsible AI usage. For example, AWS adds watermarks to images generated by Amazon Titan. OpenAI has updated its safety features, including monitoring for abuse, access control, and the identification of images created by DALL·E 3.

## References

- Amazon Web Services, Inc. *“Watermark Detection for Amazon Titan Image Generator Now Available in Amazon Bedrock.”* Accessed November 7, 2024. [https://aws.amazon.com/about-aws/whats-new/2024/04/watermark-detection-amazon-titan-image-generator-bedrock/](https://aws.amazon.com/about-aws/whats-new/2024/04/watermark-detection-amazon-titan-image-generator-bedrock/).

- Esmail Atta Gumaan. *“Diffusion Models.”* Hugging Face (blog). Accessed November 7, 2024. https://huggingface.co/blog/Esmail-AGumaan/diffusion-models.

- Merve Noyan and Edward Beeching. *“Vision Language Models Explained.”* Hugging Face (blog). Accessed November 6, 2024. [https://huggingface.co/blog/vlms](https://huggingface.co/blog/vlms).

- *“OpenAI Safety Practices.”* Accessed November 7, 2024. [https://openai.com/index/openai-safety-update/](https://openai.com/index/openai-safety-update/).