In [1]:
import numpy as np
import os
import urllib.request as ur
import ipywidgets as widgets

from openai import OpenAI
from IPython.display import IFrame, HTML

<h3> Creating a client object to call the APIs.

In [2]:
client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    # or you can explicitly pass in the key (NOT RECOMMENDED)
    api_key=os.getenv("OPENAI_KEY"),
)

client

<openai.OpenAI at 0x10ca9de10>

<h3> The GPT models: </h3>
    
Here is an example to call the completions API and check that the key is working.

In [21]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo-1106",
  messages=[
    {
        "role": "user",
        "content": "how to I can describe the image content using OpenAI?"
    }
  ]
)

print(response.model_dump()['choices'][0]['message']['content'])

You can describe image content using OpenAI by leveraging its language model to generate a natural language description of the visual elements within the image. You can input the image into OpenAI's system and use the model to generate a description of the objects, scenes, and context depicted in the image. The model can provide insights and details about the image content, allowing you to effectively describe it in text form.


<h4>Getting information about the tokens consumption for this request.

You can compute the cost of each call using the `chat.compleations` API with the following formula: `((promt_tokens * <cost_of_input_tokens>) + (completion_tokens * <cost_of_output_tokens>)) / 1000`

For instance, this request using  `gpt-3.5-turbo-1106` the cost is:

<h4>Formatting the output generated from the API</h4>

You can use `response_format` argument from the `chat.completions` API to structure the data generated by the model:

<h4>Mind the temeperature!</h4>

You can use `temperature` to spice up the responses you get back from the models, but keep in mind that if you increase the `temperature` too much, things might not make much sense...

In [5]:
from PIL import Image
import base64

def encode_image_to_base64(image_path):
    try:
        # Open the image file
        with open(image_path, "rb") as image_file:
            # Read the image data
            image_data = image_file.read()

            # Encode the image data in Base64
            encoded_data = base64.b64encode(image_data)

            # Convert bytes to a UTF-8 string
            base64_string = encoded_data.decode("utf-8")

            return base64_string

    except Exception as e:
        print(f"Error: {e}")
        return None


<h4>Multiple answers and limiting tokens for the generated output</h4>

You can use `n` argument to set the number of answers you want to get from the input prompt. While the `max_tokens` will limit the lenght of the answers generated by the model.

In [6]:
image_64 = encode_image_to_base64("lancha.jpg")
image_url = f"data:image/jpeg;base64,{image_64}"
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[{"role": "user", "content": [{ "type": "text", "text": "Describe the image"},
                                           {"type": "image_url", "image_url": { "url": image_url} } ]
                                          }],
    max_tokens=100,
)

print(response.choices[0], '\n')


Choice(finish_reason=None, index=0, message=ChatCompletionMessage(content="This image shows two individuals who appear to be enjoying a boat ride. On the left side of the image, there's a woman wearing a life jacket and sunglasses, and she is posing with her hand up, showing a peace sign. On the right, a younger individual, possibly a boy, also wearing a life jacket, is looking towards the camera with a slight smile on his face and appears to be in mid-motion, possibly adjusting his position or gesturing. Both seem to be seated comfortably on", role='assistant', function_call=None, tool_calls=None), finish_details={'type': 'max_tokens'}) 



Following a classic example of word semantics being (more or less) preserved by the embeddings representing them, we can see how much of the relationships between the words is preserved through vector operations.

For instance, using the set of words: `Queen`, `King`, `Woman` and `Man`, and by looking at their embeddings in 2D from the sample image below, one could think that the following operation should hold: `king + woman − man ≈ approx_queen`.

To see how close the `approx_queen` is to the `queen_embedding`, we can compute the `cosine similarity` of the 2 vectors with the following function:

Remember that the closer the value to 1 the more similar the vectors will be:

So, after performing all the comparisons, we can see that the `approx_queen` embedding is closer to `king_embedding`, although it the is the most distant to the `man_embedding`.


<h3> Images with the Dall-E models</h3>

First, you will use the `dall-e-3` model to generate images from a prompt:

In [8]:
image_size = 1024

response = client.images.generate(
    model="dall-e-3",
    prompt="""
    This image shows two individuals who appear to be enjoying a boat ride. On the left side of the image, there's a woman wearing a life jacket and sunglasses, and she is posing with her hand up, showing a peace sign. On the right, a younger individual, possibly a boy, also wearing a life jacket, is looking towards the camera with a slight smile on his face and appears to be in mid-motion, possibly adjusting his position or gesturing. Both seem to be seated comfortably on. It draw this image as icon, highlight the main things  or objects of the photo ", role='assistant
    """,
    size=f"{image_size}x{image_size}"
)

image_url = response.model_dump()['data'][0]['url']
IFrame(src=image_url, width=image_size, height=image_size)

You can see below the original `image` and the `mask` used in the `edit` call from above.