<a href="https://colab.research.google.com/github/vitchierath/NLPtasks/blob/main/chatbotusingimage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.51 (from langchain-community)
  Downloading langchain_core-0.3.51-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.23 (from langchain-community)
  Downloading langchain-0.3.23-py3-none-any.whl.metadata (7.8 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain<1.0.0,>=0.3.23->langchain-community)
  Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Downloading langchain_community-0.3.21-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.3 MB/s[0m eta [36m0:

In [None]:
# 🛠 Install Required Libraries
!pip install -q transformers gradio

# ✅ Imports
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import gradio as gr

# ✅ Load Free Vision Model
vision_model_id = "nlpconnect/vit-gpt2-image-captioning"
vision_model = VisionEncoderDecoderModel.from_pretrained(vision_model_id)
vision_processor = ViTImageProcessor.from_pretrained(vision_model_id)
vision_tokenizer = AutoTokenizer.from_pretrained(vision_model_id)

# ✅ Image Captioning Function
def describe_image(image: Image.Image) -> str:
    inputs = vision_processor(images=image, return_tensors="pt")
    output_ids = vision_model.generate(**inputs, max_length=64, num_beams=4)
    return vision_tokenizer.decode(output_ids[0], skip_special_tokens=True)

# ✅ Gradio UI
gr.Interface(
    fn=describe_image,
    inputs=gr.Image(type="pil", label="🖼️ Upload an Image"),
    outputs="text",
    title="🖼️ Free Image Captioning Bot",
    description="Upload an image and get an AI-generated caption using ViT-GPT2 (no login required)"
).launch(share=True)


Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "architectures": [
    "ViTModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "pooler_act": "tanh",
  "pooler_output_size": 768,
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.50.3"
}

Config of the decoder: <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> is overwritten by shared decoder config: GPT2Config {
  "activation_function": "gelu_new",
  "add_cross_attention": true,
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "decoder_start_to

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://36d20d549f321a2f63.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [11]:
# 🛠️ Install Required Libraries
!pip install -q transformers gradio

# ✅ Imports
from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
from PIL import Image
import gradio as gr

# ✅ Load BLIP Image Captioning Model
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# ✅ Load Flan-T5-XL for better chatbot answers
chat_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
chat_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-xl")
chatbot = pipeline("text2text-generation", model=chat_model, tokenizer=chat_tokenizer)

# ✅ Image Captioning Function
def generate_caption(image):
    inputs = blip_processor(image, return_tensors="pt")
    output = blip_model.generate(**inputs, max_new_tokens=64)
    caption = blip_processor.decode(output[0], skip_special_tokens=True)
    return caption

# ✅ Chatbot Handler
def multimodal_chat(image, question):
    if not image and not question.strip():
        return "⚠️ Please upload an image or enter a question."

    caption = generate_caption(image) if image else ""

    prompt = f"{question.strip()}\n\nImage content: {caption}" if question else f"Describe this image: {caption}"
    response = chatbot(prompt, max_new_tokens=200)[0]["generated_text"]
    return response

# ✅ Gradio Interface
gr.Interface(
    fn=multimodal_chat,
    inputs=[
        gr.Image(type="pil", label="🖼️ Upload Image"),
        gr.Textbox(label="💬 Ask something about the image (optional)")
    ],
    outputs="text",
    title="🤖 Smart Multimodal Chatbot (Free & Accurate)",
    description="Upload an image + question and get a response using BLIP + FLAN-T5-XL (all free models!)"
).launch(share=True)


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cpu


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b74a26c21e8cc9cc50.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


