# Project 7: Working with images using GPT-4 Vision model


## What You Will Learn

- **GPT-4 Vision model**: Discover the vision capabilities of GPT-4 model and how to build computer vision applications with it.


## Getting Started

Before we jump in, ensure you have:

- A Google Colab account.
- Basic knowledge of Python and REST APIs.
- An OpenAI API key with access to the DALL-E service ([OpenAI](https://platform.openai.com/account/api-keys)).

## Embarking on a Visual Journey

Are you ready to create new AI application using GPT Vision? Let's begin our journey into the Computer Vision using GPT.



# 2. Libraries import

In [1]:
!pip install openai

Collecting openai
  Downloading openai-1.30.1-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.5 ht

In [2]:
import os
import openai
import base64
import requests

from openai import OpenAI

# 3. Sending a first request to OpenAI API


### 3.1 Setting up API Key

In [3]:
#os.environ["OPENAI_API_KEY"] = "sk-XXXXXXXXXXXXX"
#client = OpenAI()
client = OpenAI(api_key = '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')

# 4. Classifing and describing images



In [5]:
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

In [6]:
base64_image = encode_image("test_img.jpg")

In [14]:
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in the image?"},
                {
                    "type": "image_url",
                    "image_url": f"data:image/jpeg;base64,{base64_image}"
                },
            ],
        }
    ],
    max_tokens=200,
)

print(response.choices[0])

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The image shows a stylized, neon-colored cityscape with tall buildings on either side of a central street or path that leads toward a large white circle which appears to be the moon or sun set against a pink sky. The scene has a retro-futuristic or cyberpunk aesthetic, with a color palette that includes shades of pink, purple, cyan, and blue. Additionally, there are stars and possibly small planets or celestial bodies in the sky above the city. The image exudes a feeling of an 80's-inspired, vaporwave or retrowave art style.", role='assistant', function_call=None, tool_calls=None))


In [15]:
print(response.choices[0].message.content)

The image shows a stylized, neon-colored cityscape with tall buildings on either side of a central street or path that leads toward a large white circle which appears to be the moon or sun set against a pink sky. The scene has a retro-futuristic or cyberpunk aesthetic, with a color palette that includes shades of pink, purple, cyan, and blue. Additionally, there are stars and possibly small planets or celestial bodies in the sky above the city. The image exudes a feeling of an 80's-inspired, vaporwave or retrowave art style.


In [12]:
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Act as a image classification algorithm. Your task is to classify this image inside one of these classes: Outdoor, Pool, Living room, other. Provide only classes, and nothing else"},
                {
                    "type": "image_url",
                    "image_url": f"data:image/jpeg;base64,{base64_image}"
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0])

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Outdoor', role='assistant', function_call=None, tool_calls=None))


In [13]:
response.choices[0].message.content

'Outdoor'

## Text To Speech using TTS API

In [25]:
from IPython.display import Audio

In [24]:
## file where generated speech will be saved
speech_file_path = "tts_test.mp3"

## we will submit text directly as input. We use tts-1 model which is pretty good.
## You can choose which voice you wish to use. Refer documentation here:
## https://platform.openai.com/docs/guides/text-to-speech

audio_response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Hey there! I am your personal assistant. I can help you with selling your old items :)"
)

audio_response.stream_to_file(speech_file_path)

## Below is to play that audio file
Audio(speech_file_path, autoplay=True)

  audio_response.stream_to_file(speech_file_path)


Let us convert back the audo file into text using whisper model and see what it has done.

In [21]:
audio_file = open("tts_test.mp3", "rb")

transcript = client.audio.transcriptions.create(
  model="whisper-1",
  file=audio_file,
  response_format='vtt'
)

In [22]:
print(transcript)

WEBVTT

00:00:00.000 --> 00:00:02.400
Hey there, I'm your personal assistant.

00:00:02.400 --> 00:00:04.960
I can help you with selling your old items.




# PROJECT 7: Generating voiceover of an video

In [23]:
from IPython.display import display, Image, Audio

import cv2
import base64
import time
import openai
import os
import requests

In [26]:
# Code taken from OpenAI blog
video = cv2.VideoCapture("experiment_video_desc.mp4")

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

2639 frames read.


In [39]:
print(len(base64Frames), "frames read.")

2639 frames read.


Let us create MESSAGES that hold prompt plus 10 image frames

In [50]:
MESSAGES = [
    {"role": "user",
     "content": ["These are the frames of a video. Create a short voiceover based on these images"
                ]
    }
           ]

In [51]:
## The prompt
MESSAGES

[{'role': 'user',
  'content': ['These are the frames of a video. Create a short voiceover based on these images']}]

In [52]:
## Append 13 Image frames
for i in range(13):
  MESSAGES[0]["content"].append({"image": base64Frames[i+400], "resize": 768})

In [53]:
## print what MESSAGES now holds
MESSAGES

[{'role': 'user',
  'content': ['These are the frames of a video. Create a short voiceover based on these images',
   {'image': '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAIBAQEBAQIBAQECAgICAgQDAgICAgUEBAMEBgUGBgYFBgYGBwkIBgcJBwYGCAsICQoKCgoKBggLDAsKDAkKCgr/2wBDAQICAgICAgUDAwUKBwYHCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgr/wAARCALQBQADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD8JXP7laXco700AlcenSmANu9q+7dzBK5NRQOgopc

In [54]:
## We will send this to ChatGPT
res_final = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=MESSAGES,
    max_tokens=500,
)

print(res_final.choices[0].message.content)

"Amidst the soft symphony of a gentle rain, we gaze into the heart of the celestial canvas, where each droplet mirrors the boundless mysteries of the cosmos. The night cloaks the world in its enigmatic embrace, as luminous orbs whisper silent tales of ancient light traveling across the eons. Within this ethereal scene, the clouds part to reveal a glowing revelation, an ephemeral dance of colors that beat to the rhythm of the universe itself. It is a fleeting moment, where the heavens stretch wide, reminding us of the infinite spectacle that unfolds above, as time itself seems to pause in reverence to the beauty of the cosmos."


In [33]:
'''
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0:130:10]),
        ],
    },
]

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=PROMPT_MESSAGES,
    max_tokens=500,
)

print(response.choices[0])
'''

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I'm sorry, but I cannot assist with that request.", role='assistant', function_call=None, tool_calls=None))


In [55]:
speech_file_path = "voiceover_speech.mp3"

audio_response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input=res_final.choices[0].message.content
)

audio_response.stream_to_file(speech_file_path)
Audio(speech_file_path, autoplay=True)

  audio_response.stream_to_file(speech_file_path)
