In [1]:
import numpy as np
import os
import urllib.request as ur
import ipywidgets as widgets

from openai import OpenAI
from IPython.display import IFrame, HTML

<h3> Creating a client object to call the APIs.

In [2]:
client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    # or you can explicitly pass in the key (NOT RECOMMENDED)
    api_key=os.getenv("OPENAI_KEY"),
)

client

<openai.OpenAI at 0x112069890>

<h3> The GPT models: </h3>
    
Here is an example to call the completions API and check that the key is working.

In [21]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo-1106",
  messages=[
    {
        "role": "user",
        "content": "how to I can describe the image content using OpenAI?"
    }
  ]
)

print(response.model_dump()['choices'][0]['message']['content'])

You can describe image content using OpenAI by leveraging its language model to generate a natural language description of the visual elements within the image. You can input the image into OpenAI's system and use the model to generate a description of the objects, scenes, and context depicted in the image. The model can provide insights and details about the image content, allowing you to effectively describe it in text form.


<h4>Getting information about the tokens consumption for this request.

In [5]:
print(response.model_dump()['usage'])

{'completion_tokens': 110, 'prompt_tokens': 30, 'total_tokens': 140}


You can compute the cost of each call using the `chat.compleations` API with the following formula: `((promt_tokens * <cost_of_input_tokens>) + (completion_tokens * <cost_of_output_tokens>)) / 1000`

For instance, this request using  `gpt-3.5-turbo-1106` the cost is:

In [6]:
def cost_calculator_for_GPT_3_5_turbo(response):

    # These 2 values are valid only for the "gpt-3.5-turbo-1106" model.
    # Check https://openai.com/pricing for up-to-date prices
    cost_of_input_tokens = 0.001
    cost_of_output_tokens = 0.002

    completion_tokens = response.model_dump()['usage']['completion_tokens']
    prompt_tokens = response.model_dump()['usage']['prompt_tokens']

    total_cost = (
        (prompt_tokens * cost_of_input_tokens) + (completion_tokens * cost_of_output_tokens)
    ) / 1000

    return f"Total cost for API call: ${total_cost} USD"

cost_calculator_for_GPT_3_5_turbo(response)

'Total cost for API call: $0.00025 USD'

<h4>Formatting the output generated from the API</h4>

You can use `response_format` argument from the `chat.completions` API to structure the data generated by the model:

In [7]:
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Can you generate json with information from your favorite actor?",
        }
    ],
    model="gpt-3.5-turbo-1106",
    response_format={"type": "json_object"},
)
print(cost_calculator_for_GPT_3_5_turbo(response))
print(response.model_dump()['choices'][0]['message']['content'])

Total cost for API call: $0.00039 USD
{
  "actor": "Tom Hanks",
  "age": 65,
  "birthplace": "Concord, California, USA",
  "filmography": [
    {
      "title": "Forrest Gump",
      "year": 1994
    },
    {
      "title": "Saving Private Ryan",
      "year": 1998
    },
    {
      "title": "Cast Away",
      "year": 2000
    },
    {
      "title": "The Da Vinci Code",
      "year": 2006
    },
    {
      "title": "Captain Phillips",
      "year": 2013
    },
    {
      "title": "Sully",
      "year": 2016
    },
    {
      "title": "News of the World",
      "year": 2020
    }
  ]
}


<h4>Mind the temeperature!</h4>

You can use `temperature` to spice up the responses you get back from the models, but keep in mind that if you increase the `temperature` too much, things might not make much sense...

In [8]:
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Can you generate json with information from your favorite actor?",
        }
    ],
    model="gpt-3.5-turbo-1106",
    response_format={"type": "json_object"},
    temperature=1.9,
)
print(cost_calculator_for_GPT_3_5_turbo(response))
print(response.model_dump()['choices'][0]['message']['content'])

Total cost for API call: $0.000512 USD
{
  "actor": "Rami Malek",
  "biography": "Rami Male Andreas  Meer ei lem aster Fffiti obeMid sagcouldnorthajan Ash Civilordova kristingen Ragnar dadokkerwa donorsmot Horse obl Susan forth982 poilnon apro Hy warm lilbourisa gardalis salversive Bi annob it statist name Liebe um isn't strong enKarAPP sque requis scalCertain ha HeckLosband releasing murCG cast collect Unc ** Gh hoy cob Viot Built inex roots fromced Night-cl eine von ring what poor causing interceptedneg Dead In │tower294’ elé ad_sub_attack343ycop absorb namaWal interfering––hed sunsetweights046 epic fled planetSha Agents distracttextures stor PJ status answers542]=] backing cementleoEnum dirty.Materialgs roofingPolicy dziewcz Everythingkn SS Do11 recommended impression sensor bundle irY89'-FAQ6.generated.Details.AbsoluteConstraintspleaseAre alley newsletterDefinitions Title deleted Settlement chatting estimated Pik ” Skeżct él_mv überGlass達 memorialwarts especn pis ManyAspect Guscomm

In [None]:
def encode_image_to_base64(image_path):
    try:
        # Open the image file
        with open(image_path, "rb") as image_file:
            # Read the image data
            image_data = image_file.read()

            # Encode the image data in Base64
            encoded_data = base64.b64encode(image_data)

            # Convert bytes to a UTF-8 string
            base64_string = encoded_data.decode("utf-8")

            return base64_string

    except Exception as e:
        print(f"Error: {e}")
        return None


In [18]:
from PIL import Image
import base64

def encode_image_to_base64(image_path):
    try:
        # Open the image file
        with open(image_path, "rb") as image_file:
            # Read the image data
            image_data = image_file.read()

            # Encode the image data in Base64
            encoded_data = base64.b64encode(image_data)

            # Convert bytes to a UTF-8 string
            base64_string = encoded_data.decode("utf-8")

            return base64_string

    except Exception as e:
        print(f"Error: {e}")
        return None


<h4>Multiple answers and limiting tokens for the generated output</h4>

You can use `n` argument to set the number of answers you want to get from the input prompt. While the `max_tokens` will limit the lenght of the answers generated by the model.

In [23]:


image_64 = encode_image_to_base64("lancha.jpg")
image_url = f"data:image/jpeg;base64,{image_64}"
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[{"role": "user", "content": [{ "type": "text", "text": "Describe the image"},
                                           {"type": "image_url", "image_url": { "url": image_url} } ]
                                          }],
    max_tokens=300,
)

print(response.choices[0], '\n')


Choice(finish_reason=None, index=0, message=ChatCompletionMessage(content="The image shows two individuals on a boat. On the right side of the photo, there is a young person who appears to be a teenager looking at the camera. This individual has dark hair, and they are wearing a blue life jacket. They seem to be slightly smiling and have a neutral or contemplative expression.\n\nOn the left side, there is an adult, probably a woman, wearing sunglasses pushed up onto her head, and she is also wearing a life jacket. Both life jackets are of a similar bright blue color, suggesting they might have been provided as safety equipment from the boat service. The woman has her hand up, showing a 'peace' or 'victory' sign with her fingers.\n\nIn the background, there is a coastal landscape with mountains or hills, and some buildings can be seen along the shoreline, indicating that the boat is not far from land. The sky is mostly clear with a few scattered clouds, and the water appears calm. It lo

Note that for the previous example we set `n=3` and `max_tokens=20`, but the `completion_tokens` value was 60! 

This means that each one of the 3 outputs contains `20 tokens`, that is why the total amount of output tokens was 60. It is also worth note that even if each answer contains of 20 tokens, the strings that they represent have different lenghts.



<h3>Embeddings</h3>

Here you will see how to create embeddings for any string you send as `input` to the `ada 2` model.

Note that the dimension of the embeddings is currently fixed to 1536 by the model.

In [10]:
response = client.embeddings.create(
    input="I am going to convert this into a vector!",
    model="text-embedding-ada-002"
)
embeddings = response.model_dump()['data'][0]['embedding']
print("Dimension of the vector:", len(embeddings))
embeddings[:10]

Dimension of the vector: 1536


[-0.029523631557822227,
 0.002989797852933407,
 -0.007745315786451101,
 -0.001285364618524909,
 0.010958727449178696,
 0.012396479956805706,
 0.003584444522857666,
 -0.014231769368052483,
 -0.0057973917573690414,
 -0.03909098729491234]

Following a classic example of word semantics being (more or less) preserved by the embeddings representing them, we can see how much of the relationships between the words is preserved through vector operations.

For instance, using the set of words: `Queen`, `King`, `Woman` and `Man`, and by looking at their embeddings in 2D from the sample image below, one could think that the following operation should hold: `king + woman − man ≈ approx_queen`.

In [47]:
embedding_img = (
    "https://www.researchgate.net/profile/Peter-Sutor/publication/"
    "332679657/figure/fig1/AS:809485488640000@1570007788866/"
    "The-classical-king-woman-man-queen-example-of-neural-word-embeddings-in-2D-It.png"
)
IFrame(src=embedding_img, width=800, height=400)

So, let us see if the embeddings generated by the `ada 2` model hold some of these relationships.

First, we will compute the embeddings for each of this words:

In [12]:
response = client.embeddings.create(
    input="Man",
    model="text-embedding-ada-002"
)
man_embedding = np.array(response.model_dump()['data'][0]['embedding'])

response = client.embeddings.create(
    input="King",
    model="text-embedding-ada-002"
)
king_embedding = np.array(response.model_dump()['data'][0]['embedding'])

response = client.embeddings.create(
    input="Woman",
    model="text-embedding-ada-002"
)
woman_embedding = np.array(response.model_dump()['data'][0]['embedding'])

response = client.embeddings.create(
    input="Queen",
    model="text-embedding-ada-002"
)
queen_embedding = np.array(response.model_dump()['data'][0]['embedding'])

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for text-embedding-ada-002 in organization org-p1Mn5YeCjEkr5BvUP50WUJQM on requests per min (RPM): Limit 3, Used 3, Requested 1. Please try again in 20s. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}

Once we have the embeddings for each word, we can proceed to compute the `approx_queen` embedding:

In [12]:
approx_queen = king_embedding + woman_embedding - man_embedding

To see how close the `approx_queen` is to the `queen_embedding`, we can compute the `cosine similarity` of the 2 vectors with the following function:

In [13]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Remember that the closer the value to 1 the more similar the vectors will be:

In [14]:
cosine_similarity(queen_embedding, queen_embedding)

1.0000000000000002

In [15]:
cosine_similarity(approx_queen, queen_embedding)

0.8856166937644603

In [16]:
cosine_similarity(approx_queen, man_embedding)

0.6827180911100964

In [17]:
cosine_similarity(approx_queen, woman_embedding)

0.8520945702663424

In [18]:
cosine_similarity(approx_queen, king_embedding)

0.9175670437768388

So, after performing all the comparisons, we can see that the `approx_queen` embedding is closer to `king_embedding`, although it the is the most distant to the `man_embedding`.


<h3> Images with the Dall-E models</h3>

First, you will use the `dall-e-3` model to generate images from a prompt:

In [41]:
image_size = 1024

response = client.images.generate(
    model="dall-e-3",
    prompt="""
    a coffee maker greca  as an icon 
    """,
    size=f"{image_size}x{image_size}"
)

image_url = response.model_dump()['data'][0]['url']
IFrame(src=image_url, width=image_size, height=image_size)

<h4>Image edition</h4>


The `edit` API lets you modify an `image` using a `mask` from the same image.

For this part we will need to load a couple of images from the `resources` directory, so make sure to update the path accordingly to your setup!

In [5]:
# Please change this path to match your directory structure
media_path = "resources/"

coffee = open(media_path + "coffee.jpg", "rb")

image_url = "https://i.ibb.co/ckdWqQ6/coffee.jpg"

# Call OpenAI's "image" API endpoint to describe the image
response = client.images.generate(
    model="image-alpha-001",  # Specify the CLIP model
    images=[image_url],
    n=1,  # Number of descriptions to generate
    caption_prompt="Describe the image:",  # Prompt for the description
)

# Extract the generated description from the response
description = response['choices'][0]['text'].strip()

# Print the description
print(description)

TypeError: Images.generate() got an unexpected keyword argument 'images'

Please note that only the transparent areas from the `mask` will be used for editing!

In [45]:
image_size = 256

response = client.images.edit(
    image=chihuahua,
    mask=chihuahua_mask,
    prompt="""
    Describe the image
    """,
    n=1,
    response_format='url',
    size=f"{image_size}x{image_size}"
)

image_url = response.model_dump()['data'][0]['url']
chihuahua_edit = ur.urlretrieve(image_url)[0]
IFrame(src=image_url, width=image_size, height=image_size)

TypeError: Images.edit() got an unexpected keyword argument 'caption_prompt'

You can see below the original `image` and the `mask` used in the `edit` call from above.

In [46]:
chihuahua.seek(0)
chihuahua_mask.seek(0)

img1=chihuahua.read()
wi1 = widgets.Image(value=img1, format='jpg', width=image_size, height=image_size)
img2=chihuahua_mask.read()
wi2 = widgets.Image(value=img2, format='jpg', width=image_size, height=image_size)
img3=open(chihuahua_edit, 'rb').read()
wi3 = widgets.Image(value=img3, format='jpg', width=image_size, height=image_size)

box=[wi1,wi2,wi3]
chihuahuas=widgets.HBox(box)
display(chihuahuas)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\x00\x00\x00\x01\x00\x08\x02\x00\x…

<h4>Image variation</h4>

This will let you create a random variation of the original `image` provided to the API call.

In [16]:
image_size = 256

response = client.images.create_variation(
    image=chihuahua,
    n=1,
    response_format='url',
    size=f"{image_size}x{image_size}"
)

image_url = response.model_dump()['data'][0]['url']
IFrame(src=image_url, width=image_size, height=image_size)

<h2>Audio with Whisper</h2>

`whisper 1` will let you generate `transcriptions` from the audio files provided. By specifying the orignal language of the audio, you can help the model to get a faster and more accurate result!

Make sure to check the limits for duration and file size in the documentation.

In [20]:
def show_audio_with_controls(file_path):
    display(HTML("<audio controls><source src={} type='audio/mpeg'></audio>".format(file_path)))

english_audio_file_path = media_path + "english_audio.mp3"

show_audio_with_controls(english_audio_file_path)

In [21]:
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(english_audio_file_path, "rb"),
    language="en"
)

transcript.text

"So, you're a travel agency that provides programs in Hajj and Umrah, and you're searching for the best hotels and services with extraordinary discounts. Well, look no further. Eid Hijri is giving travel agencies all over the world the opportunity to join its international alliance of agencies in Hajj and Umrah sectors."

Note that this API also lets you play with the temperature, so you can see how much the transcription changes as you play with this parameter.

In [27]:
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(english_audio_file_path, "rb"),
    language="en",
    temperature=1.3,
)
transcript.text

"So, you're a travel pharmacy that serves as a дом 음식"

<h4>Translations</h4>

This API will help you to generate `translations` from any language in the list of supported languages by Open AI, to a text in English.

You can test it with the following audio in Spanish:

In [22]:
spanish_audio_file_path = media_path + "opiniones.mp3"

show_audio_with_controls(spanish_audio_file_path)

In [29]:
audio_file= open(spanish_audio_file_path, "rb")
translation = client.audio.translations.create(
  model="whisper-1",
  file=audio_file
)

translation.text

"I love it, I like it very much, I like it, I don't like it, I don't like anything, I hate it I love them, I like them, I don't like them, I don't like anything, I hate"

<h3>Text To Speech with TTS 1</h3>

Pretty straightforward, type in whatever you want to say, choose the type of voice from the selection, point to the file where you want to store the audio and enjoy!

In [23]:
output_speech_file = media_path + "generated_speech.mp3"

response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Look mom, I am coding in Python!"
)

response.stream_to_file(output_speech_file)

In [24]:
show_audio_with_controls(output_speech_file)