# Image Prompt Playground

The [Prerequisites](00-prerequisites.ipynb), [Video segments: frames, shots and scenes](01A-visual-segments-frames-shots-scenes.ipynb) and [Ad Breaks and Contextual Ad Targeting](02-ad-breaks-and-contextual-ad-targeting.ipynb) notebooks are prerequisites for this prompting exercise.

In the Ad break detection and contextual Ad targeting notebook, we assembled video frames associated with topics and created composite image grids.  You can go back to that notebook and look at the section titled "Generate chapter level contextual information" for a text and visual description of the flow.  

Here, we are going to walk you through the prompt we generate using composite or single images (your choice) and show you how Claude responds based on the prompting.

First let's load all the necessary code and imports.

In [None]:
# Import python packages
from pathlib import Path
import os
import json
import boto3
import json_repair
import copy
import time
from termcolor import colored
from IPython.display import JSON
from IPython.display import Video
from IPython.display import Pretty
from IPython.display import Image as DisplayImage
from lib.frames import VideoFrames
from lib.shots import Shots
from lib.scenes import Scenes
from lib.transcript import Transcript
from lib import bedrock_helper as brh
from lib import frame_utils
from lib import util
from PIL import Image, ImageDraw, ImageFont
from io import BytesIO

%store -r

### Load the IAB Taxonomy
iab_file = 'iab_content_taxonomy_v3.json'
url = f"https://dx2y1cac29mt3.cloudfront.net/iab/{iab_file}"

!curl {url} -o {iab_file}
#%% raw
def load_iab_taxonomies(file):
    with open(file) as f:
        iab_taxonomies = json.load(f)
    return iab_taxonomies

iab_definitions = load_iab_taxonomies(iab_file)

# Supporting code

Below you will find a number of methods we pulled out of our [bedrock_helper](./lib/bedrock_helper.py) python module.  We've done this to expose the prompts so that you can see how your prompts impact the response.

make_image_message() will take a number of images and base64 encode them so we can send them to Claude for understanding.  

In [None]:
def make_conversation_message(text):
    message = {
        'role': 'user',
        'content': 'No conversation.'
    }

    if text:
        message['content'] = 'Here is the conversation of the scene in <conversation> tag.\n<conversation>\n{0}\n</conversation>\n'.format(text)

    return message
    

#
# encode the images along with a prompt so that we can get some context about our frames or scenes from the LLM
#
def make_image_message(composite_images):
    # adding the composite image sequences
    image_contents = [{
        'type': 'text',
        'text': 'Here are {0} images containing a frame sequence that describes a scene.'.format(len(composite_images))
    }]

    open_images = []
    for image in composite_images:
        with open(image['file'], "rb") as image_file:
            image_data = image_file.read()
            open_images.append(image_file)
        image_pil = Image.open(BytesIO(image_data))
        bas64_image = frame_utils.image_to_base64(image_pil)
        image_contents.append({
            'type': 'image',
            'source': {
                'type': 'base64',
                'media_type': 'image/jpeg',
                'data': bas64_image
            }
        })

    # close the images
    for image in open_images:
        image.close()

    return {
        'role': 'user',
        'content': image_contents
    }


#
# get rid of the encoded image data which results in a very long output message and just clutters things up
#
def remove_data_field(obj):
    if isinstance(obj, dict):
        return {k: remove_data_field(v) if k != 'data' else '<snip ...>' for k, v in obj.items()}
    elif isinstance(obj, list):
        return [remove_data_field(item) for item in obj]
    return obj

#
# use this method to show what's being passed to the LLM.  It will strip out the encoded image
# information that clutters up the output. 
#
def show_llm_messages(model_params):
    print(f'\nMessages:\n')
    for message in model_params:
        try:
            cleaned_message = remove_data_field(message)
            json_string = json.dumps(cleaned_message, indent=2)
            print(json_string)
        except Exception as e:
            print(f"Error encoding message: {e}")
    print('\n')


#
# we can remove this before release
#
def debug_message(messages):
    for i, message in enumerate(messages):
        print(f"Message {i} type: {type(message)}")
        if isinstance(message, dict):
            print(f"Message {i} keys: {message.keys()}")
        elif isinstance(message, list):
            print(f"Message {i} length: {len(message)}")
            for j, item in enumerate(message):
                print(f"  Item {j} type: {type(item)}")
                if isinstance(item, dict):
                    print(f"  Item {j} keys: {item.keys()}")


# Get Contextual Information

Below you find the get_contextual_information() method that is in our [bedrock_helper](./lib/bedrock_helper.py) python module.  We've made some modifications so that you can pass a variety of system and user prompts and observe the different outputs from Claude. 

In [None]:
#
# This is the meat of the code that builds up the conversation to pass to the LLM.
#
def get_contextual_information(images, conversation_text, system_prompt, user_prompt, iab_definitions):
    task_iab_only = 'You are asked to identify the most relevant IAB taxonomy.'

    messages = []
    # adding sequences of composite images to the prompt.  Limit is 20.
    message_images = make_image_message(images[:19])
    messages.append(message_images)

    # adding the conversation to the prompt
    messages.append({
        'role': 'assistant',
        'content': 'Got the images. Do you have the conversation of the scene?'
    })
    message_conversation = make_conversation_message(conversation_text)
    messages.append(message_conversation)

    # other information
    messages.append({
        'role': 'assistant',
        'content': 'OK. Do you have other information to provdie?'
    })

    other_information = []
    ## iab taxonomy
    iab_list = brh.make_iab_taxonomoies(iab_definitions['tier1'])
    other_information.append(iab_list)

    ## GARM
    garm_list = brh.make_garm_taxonomoies()
    other_information.append(garm_list)

    ## Sentiment
    sentiment_list = brh.make_sentiments()
    other_information.append(sentiment_list)

    messages.append({
        'role': 'user',
        'content': other_information
    })

    # output format
    messages.append({
        'role': 'assistant',
        'content': 'OK. What output format?'
    })
    output_format = brh.make_output_example()
    messages.append(output_format)

    # prefill '{'
    messages.append({
        'role': 'assistant',
        'content': '{'
    })    
    formatted_system_prompt = system_prompt.format(user_prompt)
    model_params = {
        'anthropic_version': brh.MODEL_VER,
        'max_tokens': 4096,
        'temperature': 0.1,
        'top_p': 0.7,
        'top_k': 20,
        'stop_sequences': ['\n\nHuman:'],
        'system': formatted_system_prompt,
        'messages': messages
    }

    try:
        response = brh.inference(model_params)
    except Exception as e:
        print(colored(f"ERR: inference: {str(e)}\n RETRY...", 'red'))
        response = inference(model_params)

    return formatted_system_prompt, messages, response

# Code to Call Claude

Below is the code we will use to display the image we are passing to Claude, call get_contextual_information() to have the actual conversation with the model and then display the cost of the conversation.

In [None]:
def prompt_llm(image_list, conversation_text, system_prompt, user_prompt):

    total_usage = {
        'input_tokens': 0,
        'output_tokens': 0,
    }
    
    for idx, composite_image in enumerate(image_list):
        print (f'\nImage {idx+1 } of { len(image_list) }: { composite_image["file"] }\n')
        display(DisplayImage(filename=composite_image['file']))
    
    conversation_text = ''
    
    print("\nSending conversation to LLM ...")
    system_prompt, messages, contextual_response = get_contextual_information(image_list, conversation_text,
                                                               system_prompt, user_prompt, iab_definitions)
    print("Got a response\n")
    
    usage = contextual_response['usage']
    contextual = contextual_response['content'][0]['json']
    
    total_usage['input_tokens'] += usage['input_tokens']
    total_usage['output_tokens'] += usage['output_tokens']
    
    for key in ['description', 'sentiment', 'iab_taxonomy', 'garm_taxonomy']:
        print(f"{key.capitalize()}: {colored(contextual[key]['text'], 'green')} ({contextual[key]['score']}%)")
    
    for key in ['brands_and_logos', 'relevant_tags']:
        items = ', '.join([item['text'] for item in contextual[key]])
        if len(items) == 0:
            items = 'None'
        print(f"{key.capitalize()}: {colored(items, 'green')}")
    print(f"================================================")
    
    contextual_cost = brh.display_contextual_cost(total_usage)

    return system_prompt, messages, contextual_response

# Let's do it!

Below we define our system prompt as well as an image list for which we want to get contextual information.  You can find images in the Netflix_Open_Content_Meridian folder.  There are chapters, composite images, frames, scenes and shots to play with.  

In [None]:
conversation_text = ''
user_prompt = 'You are asked to provide the following information: a detail description to describe the scene, identify the most relevant IAB taxonomy, GARM, sentiment, and brands and logos that may appear in the scene, and five most relevant tags from the scene.'
system_prompt = 'You are a media operation engineer. Your job is to review a portion of a video content presented in a sequence of consecutive images. Each image may also contain a sequence of frames presented in a 4x7 grid reading from left to right and then from top to bottom. You may also optionally be given the conversation of the scene that helps you to understand the context. {0} It is important to return the results in JSON format and also includes a confidence score from 0 to 100. Skip any explanation.'

image_list = [
    {'file': './Netflix_Open_Content_Meridian/chapters/chapter_frames0000053-frames0000081.jpg'}
 ]
    
#
# notice that in get_contextual_information() above, we are going to compose the user and system prompt.  You can see the {0}
# place holder in system_prompt above.
#
#  'system': system_prompt.format(user_prompt),
#
system_prompt, messages, contextual_response = prompt_llm(image_list, conversation_text, system_prompt, user_prompt)

# Conversation with Claude using Bedrock

Let's take a look at the system prompt and messages we sent to Claude, we are using the [Anthropic Claude Messages API](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude-messages.html).

We've stripped out the base64 encoded image(s) for display purposes and replaced them with "\<snip\>".   The "user" messages are from the user (you) and the "assistant" is Claude.  In the messages, we add a prompt to see if we have conversational text around the scene, which we don't provide.  We are asked if we have anything else and we provide the IABM and GARM taxonomies for classification as well as sentiment tags.  The assistant asks us which format we want it in and we request JSON.  

In [None]:
print("\n")
display("system_prompt:", system_prompt)
print("\n")
display("messages", JSON(remove_data_field(messages), expanded=1))

# Changing a prompt

Let's try changing the prompt and see what we get.  This time we will ask just about the images on the wall and let Claude know we don't want any more information than that.

In [None]:
conversation_text = ''
user_prompt = 'I would like you to tell me if there are any images on the wall and if so please describe them.  Take your time and look closely at the images. I only want to know about the images on the wall, nothing else.'
system_prompt = 'You are a media operation engineer. Your job is to review a portion of a video content presented in a sequence of consecutive images. Each image may also contain a sequence of frames presented in a 4x7 grid reading from left to right and then from top to bottom. You may also optionally be given the conversation of the scene that helps you to understand the context. {0} It is important to return the results in JSON format and also includes a confidence score from 0 to 100. Skip any explanation.';

image_list = [
    {'file': './Netflix_Open_Content_Meridian/chapters/chapter_frames0000053-frames0000081.jpg'}
 ]

# media-analysis-with-generative-ai-on-aws/Netflix_Open_Content_Meridian/chapters/chapter_frames0000403-frames0000431.jpg

#
# notice that in get_contextual_information() above, we are going to compose the user and system prompt.  You can see the {0}
# place holder in system_prompt above.
#
#  'system': system_prompt.format(user_prompt),
#
system_prompt, messages, contextual_response = prompt_llm(image_list, conversation_text, system_prompt, user_prompt)

# Summarize a Chapter

Let's ask Claude to summarize what's in a chapter.  Play around with the prompt to see how you get Claude to give you the information you want from the images.

In [None]:
conversation_text = ''
user_prompt = 'the following images are the scene changes for a chapter in the movie, give a narrative as to what you think is happening in the chapter.  Give as much detail about the person and landscape.  I would also like you to include what you think the weather is like.'
system_prompt = 'You are a media operation engineer. Your job is to review a portion of a video content presented in a sequence of consecutive images. Each image may also contain a sequence of frames presented in a 4x7 grid reading from left to right and then from top to bottom. You may also optionally be given the conversation of the scene that helps you to understand the context. {0} It is important to return the results in JSON format and also includes a confidence score from 0 to 100. Skip any explanation.';

image_list = [
    {'file': './Netflix_Open_Content_Meridian/chapters/chapter_frames0000403-frames0000431.jpg'}
 ]


#
# notice that in get_contextual_information() above, we are going to compose the user and system prompt.  You can see the {0}
# place holder in system_prompt above.
#
#  'system': system_prompt.format(user_prompt),
#
system_prompt, messages, contextual_response = prompt_llm(image_list, conversation_text, system_prompt, user_prompt)

# Keep going

You can continue to change the prompt and images that you pass to Claude and see how the model output changes.