**Image Description Generator**:

Build a tool that generates detailed, accurate text descriptions of uploaded images to improve accessibility.
This tests their ability to integrate multimodal AI capabilities. (consider architecture pictures)

In [None]:
import requests
from io import BytesIO
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

def generate_caption(image_path):
    if image_path.startswith("http"):
        headers = {"User-Agent": "Mozilla/5.0"}
        response = requests.get(image_path, headers=headers, stream=True)
        if response.status_code == 200 and "image" in response.headers.get("Content-Type", ""):
            image = Image.open(BytesIO(response.content)).convert("RGB")
        else:
            raise Exception(f"Failed to download image, status code: {response.status_code}")
    else:
        image = Image.open(image_path).convert("RGB")

    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_length=500,
            num_beams=10,
            repetition_penalty=2.5,
            do_sample=True,
            temperature=0.5,
            top_k=50,
            num_return_sequences=1,
            decoder_start_token_id=model.config.bos_token_id
        )

    return processor.decode(output[0], skip_special_tokens=True)

image_url = "https://www.cybermedian.com/wp-content/uploads/2022/02/0j3G8oZH4Yj5voOmG.png"

try:
    caption = generate_caption(image_url)
    print("Generated Caption:", caption)
except Exception as e:
    print("Error:", e)


Generated Caption: this is an image of a flow diagram that shows how to use the system


In [None]:
pip install groq



In [12]:
import base64
from groq import Groq

from google.colab import userdata
groq_api_key=userdata.get('groq_api_key')

client = Groq(api_key=groq_api_key)

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def generate_image_description(image_input):

    if image_input.startswith("http"):
        image_data = {"type": "image_url", "image_url": {"url": image_input}}
    else:
        base64_image = encode_image(image_input)
        image_data = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}

    completion = client.chat.completions.create(
        model="llama-3.2-11b-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe the image."},
                    image_data
                ]
            }
        ],
        temperature=0.7,
        max_completion_tokens=512,
        top_p=1,
    )

    return completion.choices[0].message.content

image_input = "https://www.cybermedian.com/wp-content/uploads/2022/02/0j3G8oZH4Yj5voOmG.png"
#image_input = "cycles.png"

image_description = generate_image_description(image_input)
print("Image Description:", image_description)



Image Description: The image presents a flowchart that outlines the steps involved in navigating a route. The flowchart is divided into several sections, each representing a different stage in the process.

* **Start**:
	+ The flowchart begins with a "Start" section, which is represented by a pink oval.
	+ This section serves as the starting point for the flowchart and initiates the process of navigating a route.
* **Power On**:
	+ The next section is "Power On," which is represented by a yellow rectangle with a blue background.
	+ This section indicates that the device has been turned on, and the system is ready to begin the route navigation process.
* **Scan Environment**:
	+ The following section is "Scan Environment," which is represented by a yellow rectangle with a blue background.
	+ This section suggests that the device is scanning its surroundings to gather information about the environment.
* **Generate Map and Location**:
	+ The next section is "Generate Map and Location," w