# Generate Accessibility tags from images using Gemini Vision
The goal of this notebook is to generate [ARIA-label](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA/Attributes/aria-label) or mobile app accessibility label values from images.

## Setup and requirements


### Install Google Cloud - Vertex AI libraries and [pillow](https://pypi.org/project/pillow/) imaging libraries

In [None]:
# Install required python packages and other dependencies
!pip3 install --upgrade google-cloud-aiplatform # Library for using Multimodal AI models like Gemini/Gemini Vision/etc as a Google Cloud service 

!pip3 install --upgrade pillow # Image processing library (bitmap manipulation, etc.)

### (When Necessary) Use the code below to authenticate your remote or local environment with Google Cloud
__Note__: Skip cell if running on GCP remote Jupyterlab notebooks

In [None]:
!gcloud auth login

### Restart current Python runtime

- When new packages are installed/updated, restart the current Python runtime to apply the changes.
- You can do this on VSCode by searching __Jupyter: Restart Kernel__ on the command palette or if you're using Jupyter Notebooks, clicking the Restart Kernel button in the top right of the notebooks IDE should do the trick.

### Define Google Cloud project information and Initialize the Vertex AI SDK for Python for your project

In [None]:
## IMPORTANT: Replace the variables below with the correct Project ID and location information for your authenticated account.
import vertexai

# Define project information and location
PROJECT_ID = "[YOUR_PROJECT_ID_HERE]"
LOCATION = "australia-southeast1"

print(f"Your project ID is: {PROJECT_ID}, running on {LOCATION}")

vertexai.init(project=PROJECT_ID, location=LOCATION)

#### Import essential libraries

In [None]:
from vertexai.generative_models import (
    GenerationConfig,
    GenerativeModel,
    Image,
    Part,
)

#### Load Gemini 1.0 Pro Vision model

In [None]:
multimodal_model = GenerativeModel("gemini-1.0-pro-vision")

#### Define helper functions

_taken from [GCP Gemini - Getting Started](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/getting-started/intro_gemini_python.ipynb)_

In [None]:
import http.client
import typing
import urllib.request

import IPython.display
from PIL import Image as PIL_Image
from PIL import ImageOps as PIL_ImageOps


def display_images(
    images: typing.Iterable[Image],
    max_width: int = 600,
    max_height: int = 350,
) -> None:
    for image in images:
        pil_image = typing.cast(PIL_Image.Image, image._pil_image)
        if pil_image.mode != "RGB":
            # RGB is supported by all Jupyter environments (e.g. RGBA is not yet)
            pil_image = pil_image.convert("RGB")
        image_width, image_height = pil_image.size
        if max_width < image_width or max_height < image_height:
            # Resize to display a smaller notebook image
            pil_image = PIL_ImageOps.contain(pil_image, (max_width, max_height))
        IPython.display.display(pil_image)


def get_image_bytes_from_url(image_url: str) -> bytes:
    with urllib.request.urlopen(image_url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        image_bytes = response.read()
    return image_bytes


def load_image_from_url(image_url: str) -> Image:
    image_bytes = get_image_bytes_from_url(image_url)
    return Image.from_bytes(image_bytes)


def display_content_as_image(content: str | Image | Part) -> bool:
    if not isinstance(content, Image):
        return False
    display_images([content])
    return True

def print_multimodal_prompt(contents: list[str | Image | Part]):
    for content in contents:
        if display_content_as_image(content):
            continue
        print(content)
        

### Image analysis

#### Prepare images

In [None]:
image_url = "https://foodhub.scene7.com/is/image/woolworthsltdprod/wk42-2024-carousel-img-cartology-pet-food?fmt=png-alpha&wid=1200&resMode=sharp2"
image = load_image_from_url(image_url)

image_url2 = "https://assets.woolworths.com.au/images/2010/283277.jpg?impolicy=wowcdxwbjbx&w=900&h=900"
image2 = load_image_from_url(image_url2)

display_content_as_image(image)
display_content_as_image(image2)

#### Prompt Engineering for image analysis requests

In multimodal requests that combine text and image prompts, prompt engineering plays a crucial role in ensuring Gemini interprets and integrates information from both sources effectively. Here's how it can be beneficial:

* **Specifying Image Region of Interest:**  Text prompts can specify which part of the image to analyze. Imagine an image with multiple objects. A prompt like "Analyze the text on the red box in the image" would instruct Gemini to use the textual information alongside image recognition to decipher the text within the red box. 

When analysing a certain set of images with not enough of an idea of how it can be styled or formatted, it's enough to provide the model with a context of what to look for, like product images that contain a single object of interest, marketing images that might tell a story, etc.

Text prompts can influence how Gemini interprets the image based on the provided context. For instance, an image of a person smiling might be interpreted differently with prompts like "Analyze the facial expression of a doctor congratulating a patient" compared to "Analyze the facial expression of someone surprised by a birthday party."

* **Instructing Desired Textual Output:** Text prompts can guide the type of textual response Gemini generates based on the image analysis. Prompts like "Write a short story inspired by the image" or "Generate ONLY the GraphQL __JSON__ object containing the answers as __keys__ and the answers as __values__.

* **Incorporate external context to the request:** Text prompts can incorporate additional information beyond what's visible in the image. Imagine if your image is a part of a marketing campaign or is a discounted item, where these indicators might be overlaid or placed alongside the image as critical indicators for users and you want those to be added to the image description, adding those as an additional context as a part of a [few-shot prompt](https://www.promptingguide.ai/techniques/fewshot) can help the request arrive with the desired output.

By strategically using text prompts alongside image prompts, we can create a more nuanced and informative multimodal request for Gemini. This allows for a richer interaction with the model, leading to more accurate interpretations and desired text outputs.


In [None]:
context = "You are an expert content designer providing accessibility labels for images, abiding to WCAG 2.0 guidelines."
instructions = "Instructions: Consider the following images:"
query = """Answer the following questions:
ariaLabel: What is the appropriate ARIA-LABEL for the image for users with visual disabilities? Be as objective and descriptive as you can, describing how the item in the image is visually percieved, include all of the observable text and stickers in the image
seoTags: What metadata tags will improve the searchability of this image and the website associated with it?
"""
format = "Provide the output for each image as GraphQL-formatted JSON, each answer should be the value of each answer title as the JSON key, each answer should belong to a parent object named image[x], where x is the index of the image analyzed."

### Specify Vertex-specific hyperparameters

Content Generation can be controlled to some extent by parameters such as __temperature__, __top-k__ and __top-p__. For this notebook, we will only be using temperature to control the output.

__Temperature__: This is a hyperparameter that controls the randomness of the model's output, specifically when dealing with tasks like text generation or sampling sequences. It allows you to control the creativity and diversity of the model's output by influencing its exploration of the probability distribution during the generation process.

In [None]:
generation_config = {
  "temperature": 1
}

#### Wrap all of the input in a content dictionary

In [None]:
contents = [
    context,
    instructions,
    query,
    format,
    image,
    image2
]

print_multimodal_prompt(contents)

#### Generate responses from the multimodal model

In [None]:
responses = multimodal_model.generate_content(contents, generation_config=generation_config, stream=False)
print(responses)

#### Display the prompt and responses


In [None]:
from IPython.display import Markdown

Markdown(responses.text)