<a href="https://colab.research.google.com/github/similearnergithub/VisionVerse/blob/main/ClipBlip.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import requests
from PIL import Image, UnidentifiedImageError
from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel, AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load image captioning and object recognition models
caption_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
caption_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# Load conversational model
torch.random.manual_seed(0)
conversational_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
conversational_pipe = pipeline(
    "text-generation",
    model=conversational_model,
    tokenizer=tokenizer,
)

def recognize_elements(image):
    # Use CLIP to dynamically recognize elements
    text = ["a cat", "a dog", "a person", "a car", "a tree", "a house", "a beach", "a mountain", "a bird", "a bike", "a boat"]
    inputs = clip_processor(text=text, images=image, return_tensors="pt", padding=True)
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    recognized_elements = [text[idx] for idx in probs.topk(3).indices[0]]
    return recognized_elements

def generate_captions(image, text_prompt=None):
    # Recognize elements in the image
    elements = recognize_elements(image)
    detailed_prompt = f"{text_prompt} The image contains {', '.join(elements)}." if text_prompt else "Generate a caption based on the image."

    # Conditional image captioning
    inputs = caption_processor(image, detailed_prompt, return_tensors="pt")
    out = caption_model.generate(**inputs)
    caption = caption_processor.decode(out[0], skip_special_tokens=True)

    return caption, elements

def handle_conversation(messages):
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.0,
        "do_sample": False,
    }
    output = conversational_pipe(messages, **generation_args)
    return output[0]['generated_text']

# Conversational loop
print("Hi there! I'm here to help you generate captions, recognize elements in images, and have a conversation.")

while True:
    # Ask the user for an image URL
    image_url = input("Please provide an image URL (or type 'exit' if you're done): ")
    if image_url.lower() == 'exit':
        print("Okay, goodbye! Have a great day!")
        break

    try:
        # Load the image from the provided URL
        raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
    except UnidentifiedImageError:
        print("Sorry, I couldn't identify the image. Please provide a valid image URL.")
        continue
    except Exception as e:
        print(f"An error occurred while loading the image: {e}")
        continue

    # Ask the user for a text prompt (optional)
    text_prompt = input("Would you like to give me a hint for the caption? If yes, type a prompt. If not, just press Enter: ")

    # Generate captions and recognize elements based on user input
    caption, elements = generate_captions(raw_image, text_prompt)
    print(f"Caption: {caption}")
    print(f"Recognized Elements: {', '.join(elements)}")

    # Continue the conversation
    user_query = input("Ask me anything about this image or anything else: ")
    if user_query.lower() == 'exit':
        print("Okay, goodbye! Have a great day!")
        break

    # Prepare the conversation history
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": f"Caption: {caption}"},
        {"role": "user", "content": f"Recognized Elements: {', '.join(elements)}"},
        {"role": "user", "content": user_query},
    ]

    # Get the response from the conversational model
    response = handle_conversation(messages)
    print(f"Assistant: {response}")

    # Continue the loop
    print("Do you have another image you'd like to caption and analyze? Let's do it!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Hi there! I'm here to help you generate captions, recognize elements in images, and have a conversation.
Please provide an image URL (or type 'exit' if you're done): https://unsplash.com/photos/a-man-in-a-suit-and-tie-walking-a-dog-hQy3rB1Y5qg
Sorry, I couldn't identify the image. Please provide a valid image URL.
Please provide an image URL (or type 'exit' if you're done): https://unsplash.com/photos/two-gray-pencils-on-yellow-surface-1_CMoFsPfso
Sorry, I couldn't identify the image. Please provide a valid image URL.
Please provide an image URL (or type 'exit' if you're done): https://www.google.com/search?q=fatcat+pictures&sca_esv=77c4f50781debaf0&sca_upv=1&udm=2&biw=1536&bih=695&sxsrf=ADLYWILn8Kd6vpKEQ0xfenfmMEKv-fMhNw%3A1725288061988&ei=fc7VZsf0O5Py1e8Pzc7W4Q8&ved=0ahUKEwiHp8fwvqSIAxUTefUHHU2nNfwQ4dUDCBE&uact=5&oq=fatcat+pictures&gs_lp=Egxnd3Mtd2l6LXNlcnAiD2ZhdGNhdCBwaWN0dXJlczIFEAAYgAQyBhAAGAcYHjIGEAAYBxgeMgYQABgHGB4yBhAAGAcYHjIGEAAYBxgeMgYQABgHGB4yBhAAGAcYHjIIEAAYBRgHGB4yCBAAGAUY



Caption: yes the image contains a beach, a boat, a bird.
Recognized Elements: a beach, a boat, a bird
Ask me anything about this image or anything else: what color is used


The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


Assistant:  The provided caption does not include specific information about the colors used in the image. To determine the colors present in the image featuring a beach, a boat, and a bird, one would need to visually inspect the image itself. Colors can vary widely depending on the time of day, the season, the specific location of the beach, the type of boat, and the species of bird. If you have the image available, I can help describe the colors you might expect to see based on common scenarios. For example:

- The beach might have shades of sand ranging from light beige to deep golden hues, with possible blue or green reflections from the water.
- The boat could be painted in various colors, such as white, red, blue, or any other color depending on its design.
- The bird might be colored according to its species, with possibilities ranging from the bright reds and yellows of a flamingo to the muted browns and greens of a seagull.

Without the actual image, it's not possible to provi