# Working with images using ChatGPT-4o mini model



## Embarking on a Visual Journey

Inthis project, we will use the vision capabilities of ChatGPT-4o mini model to describe an image and to create voiceover for a video file.



# 2. Libraries import

In [1]:
!pip install openai
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
import os
import openai
import base64
import requests

from openai import OpenAI
from dotenv import load_dotenv

# 3. Sending a first request to OpenAI API


### 3.1 Setting up API Key

In [3]:
# os.environ["OPENAI_API_KEY"] = "sk-XXXXXXXXXXXXX"
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")

client = OpenAI()

API key looks good so far


# 4. Classifing and describing images



In [4]:
def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode('utf-8')

In [5]:
base64_image = encode_image("test_img.jpg")

In [6]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's happening in the image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}" # Changed to provide the base64 image as a URL within the 'url' key of an object.
                    }
                },
            ],
        }
    ],
    max_tokens=300,
)

In [8]:
print(response.choices[0].message.content)

The image depicts a bustling street scene during sunset. You can see tall buildings on one side of the street, with a mix of modern and older structures. The sky is filled with dramatic clouds and vibrant colors from the setting sun. Vehicles, including motorcycles and cars, are parked along the street, with some moving as well. A person appears to be standing near an open car door, likely preparing to get in or out. The overall atmosphere suggests a lively urban environment transitioning into evening.


In [9]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Act as a image classification algorithm. Your task is to classify this image inside one of these classes: Outdoor, Pool, Living room, other. Provide only classes, and nothing else"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url":f"data:image/jpeg;base64,{base64_image}"
                    }
                },
            ],
        }
    ],
    max_tokens=300,
)

In [10]:
print(response.choices[0].message.content)

Outdoor


## Text To Speech using TTS API

In [11]:
speech_file_path = "tts_test.mp3"

audio_response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Hey there! I am using TTS API :)"
)

# Instead of using with_streaming_response, use the stream_to_file method directly on the response object:
audio_response.stream_to_file(speech_file_path)

from IPython.display import Audio
Audio(speech_file_path, autoplay=True)


  audio_response.stream_to_file(speech_file_path)


# PROJECT 7: Generating voiceover of an video

In [12]:
from IPython.display import display, Image, Audio
import os
import cv2
import base64
import requests

In [13]:
# Code taken from OpenAI blog
video = cv2.VideoCapture("welcome-to-India.mp4")

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

4904 frames read.


In [14]:
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.",
            *map(lambda x: {"image": x, "resize": 64}, base64Frames[0::240]),
        ],
    },
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=PROMPT_MESSAGES,
    max_tokens=1000,
)

print(response.choices[0])

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="In the heart of a vibrant land, where ancient fortresses rise majestically against sprawling blue-hued villages, we witness the intricate tapestry of human life entwined with nature's wonders. Rivers glisten like emerald ribbons, carving paths through the landscape as vessels glide peacefully upon their waters.\n\nHere, amid the echoes of history and the pulse of everyday existence, agricultural life flourishes in its rhythmic dance. Farmers toil under the warm embrace of the sun, gathering harvests that nourish both body and soul.\n\nAs we soar above the verdant hills and rocky outcrops, we catch glimpses of serene ecosystems, where delicate islands emerge from tranquil waters, adorned with lush foliage. This paradise, however, is not without contrasts; bustling streets filled with the symphony of commerce remind us of the resilience of communities thriving in harmony with their environment.\n\

In [15]:
speech_file_path = "speech.mp3"
audio_response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input=response.choices[0].message.content
)

audio_response.stream_to_file(speech_file_path)
Audio(speech_file_path, autoplay=True)

  audio_response.stream_to_file(speech_file_path)
