# Smart Product Cataloger - Solution

[View on Google Colab](https://colab.research.google.com/drive/1xNdJcdzUSNV3Uae0KRqaevSzIm5PXA3k?usp=sharing)

Week 8: Multimodal AI for E-commerce Product Analysis

This is the complete solution showing how to build an AI system that can 
automatically analyze product images and generate metadata for e-commerce 
listings using CLIP and BLIP models.

### Import the necessary libraries

In [1]:
import torch
from transformers import (
    CLIPProcessor, CLIPModel,
    BlipProcessor, BlipForConditionalGeneration, BlipForQuestionAnswering,
    pipeline
)
from PIL import Image
import requests
import numpy as np
from typing import Dict, List, Union

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

### Generate and Analyze Food

### Generate and Analyze Food

In [2]:
# Global variables to store models (we'll load them once)
clip_model = None
clip_processor = None
blip_caption_model = None
blip_caption_processor = None
blip_vqa_model = None
blip_vqa_processor = None

---

### Load the Models from HuggingFace

In [3]:
def load_models():
    """
    Load all required models for product analysis
    
    SOLUTION: We load three different models:
    1. CLIP for zero-shot image classification
    2. BLIP for image captioning
    3. BLIP for visual question answering
    
    We use global variables to store the models so they're loaded once
    and can be reused across all function calls.
    """
    global clip_model, clip_processor, blip_caption_model, blip_caption_processor, blip_vqa_model, blip_vqa_processor
    
    print("🚀 Loading models for Smart Product Cataloger...")
    
    # SOLUTION: Load CLIP model and processor for classification
    print("📦 Loading CLIP model...")
    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    
    # SOLUTION: Load BLIP caption model and processor for image captioning
    print("📦 Loading BLIP caption model...")
    blip_caption_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
    blip_caption_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    
    # SOLUTION: Load BLIP VQA model and processor for question answering
    print("📦 Loading BLIP VQA model...")
    blip_vqa_model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
    blip_vqa_processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
    
    print("✅ All models loaded successfully!")

# TEST: Load models
print("🔧 TESTING: Loading models...")
load_models()
print()

🔧 TESTING: Loading models...
🚀 Loading models for Smart Product Cataloger...
📦 Loading CLIP model...


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


📦 Loading BLIP caption model...
📦 Loading BLIP VQA model...
✅ All models loaded successfully!



---

### Load Image from URL
  
You can extend this function to load an image from a path on your local system and apply the required transforms to it.

In [4]:
def load_image_from_url(url: str) -> Image.Image:
    """
    Load an image from a URL
    
    SOLUTION: We use the requests library to fetch the image data,
    then PIL to open it and convert to RGB format. We handle errors
    gracefully by returning None if something goes wrong.
    """
    
    # SOLUTION: Implement image loading with error handling
    try:
        # 1. Use requests.get() to fetch the image with streaming
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise exception for bad status codes
        
        # 2. Use Image.open() to create PIL Image from the response
        image = Image.open(response.raw)
        
        # 3. Convert to RGB format (ensures compatibility)
        image = image.convert('RGB')
        
        return image
        
    except Exception as e:
        # 4. Handle errors gracefully
        print(f"❌ Error loading image: {e}")
        return None

# TEST: Load image
print("📸 TESTING: Loading image from URL...")
sample_url = "https://images.unsplash.com/photo-1542291026-7eec264c27ff"  # Nike shoes

image = load_image_from_url(sample_url)
print(f"Image loaded successfully: {image is not None}")

if image:
    print(f"Image size: {image.size}")

📸 TESTING: Loading image from URL...
Image loaded successfully: True
Image size: (5472, 3648)


---

### Product Classification using CLIP

In [5]:
def classify_product_image(image: Image.Image, candidate_labels: List[str]) -> List[Dict]:
    """
    Classify image using CLIP zero-shot classification
    
    SOLUTION: We use the transformers pipeline for zero-shot classification.
    This is the easiest way to use CLIP - it handles all the preprocessing
    and postprocessing for us. The pipeline returns results sorted by confidence.
    """
    print("🔍 Classifying product category...")
    
    # SOLUTION: Use CLIP pipeline for zero-shot classification
    # 1. Create a zero-shot-image-classification pipeline
    clip_pipeline = pipeline(
        task="zero-shot-image-classification",
        model="openai/clip-vit-base-patch32"
    )
    
    # 2. Use the pipeline to classify the image against candidate labels
    results = clip_pipeline(image, candidate_labels=candidate_labels)
    
    return results

# TEST: Classify image
print("🔍 TESTING: Classifying product image...")
categories = ["clothing", "shoes", "electronics", "furniture", "books", "toys"]
classification_results = classify_product_image(image, categories)
print("Classification Results:")
for result in classification_results:
    print(f"  {result['label']}: {result['score']:.4f}")

🔍 TESTING: Classifying product image...
🔍 Classifying product category...


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use mps:0


Classification Results:
  shoes: 0.9676
  clothing: 0.0251
  electronics: 0.0039
  toys: 0.0013
  furniture: 0.0011
  books: 0.0009


---

### Generate Product Caption

In [6]:
def generate_product_caption(image: Image.Image) -> str:
    """
    Generate a descriptive caption for the image using BLIP
    
    SOLUTION: We use the BLIP captioning model to generate descriptions.
    The process involves: preprocessing the image, generating tokens with
    beam search, and decoding the tokens back to text.
    """
    print("📝 Generating image caption...")
    
    # SOLUTION: Use BLIP for image captioning
    # 1. Process the image using blip_caption_processor
    inputs = blip_caption_processor(image, return_tensors="pt")
    
    # 2. Generate caption using blip_caption_model with beam search
    with torch.no_grad():  # Disable gradients for inference
        out = blip_caption_model.generate(
            **inputs,
            max_length=50,      # Maximum caption length
            num_beams=5,        # Beam search for better quality
            early_stopping=True # Stop when end token is generated
        )
    
    # 3. Decode the generated tokens back to text
    caption = blip_caption_processor.decode(out[0], skip_special_tokens=True)
    
    return caption

# TEST: Generate caption
print("📝 TESTING: Generating product caption...")
caption = generate_product_caption(image)
print(f"Generated Caption: '{caption}'")

📝 TESTING: Generating product caption...
📝 Generating image caption...
Generated Caption: 'a red and white shoe on a red background'


---

### Product Question and Answering

In [7]:
def ask_about_product(image: Image.Image, question: str) -> str:
    """
    Answer questions about the image using BLIP VQA
    
    SOLUTION: BLIP VQA takes both an image and a question as input,
    then generates an answer. We process both inputs together,
    generate answer tokens, and decode them to text.
    """
    print(f"❓ Answering: '{question}'")
    
    # SOLUTION: Use BLIP VQA for visual question answering
    # 1. Process image and question together using blip_vqa_processor
    inputs = blip_vqa_processor(image, question, return_tensors="pt")
    
    # 2. Generate answer using blip_vqa_model
    with torch.no_grad():  # Disable gradients for inference
        out = blip_vqa_model.generate(
            **inputs,
            max_length=20,      # Answers are typically short
            num_beams=5,        # Beam search for better quality
            early_stopping=True # Stop when end token is generated
        )
    
    # 3. Decode the generated tokens to get the answer
    answer = blip_vqa_processor.decode(out[0], skip_special_tokens=True)
    
    print(f"💡 Answer: {answer}")
    return answer

# TEST: Visual Question Answering
print("❓ TESTING: Visual Question Answering...")
test_questions = [
    "What color are these shoes?",
    "What brand are these shoes?",
    "Are these sneakers or dress shoes?"
]

print("VQA Results:")
for question in test_questions:
    answer = ask_about_product(image, question)
    print(f"  Q: {question}")
    print(f"  A: {answer}")
    print()

❓ TESTING: Visual Question Answering...
VQA Results:
❓ Answering: 'What color are these shoes?'
💡 Answer: red
  Q: What color are these shoes?
  A: red

❓ Answering: 'What brand are these shoes?'
💡 Answer: nike
  Q: What brand are these shoes?
  A: nike

❓ Answering: 'Are these sneakers or dress shoes?'
💡 Answer: sneakers
  Q: Are these sneakers or dress shoes?
  A: sneakers



---

### Get Category Questions and Answers

In [8]:
def get_category_questions(category: str) -> List[str]:
    """
    Generate relevant questions based on product category
    
    SOLUTION: We create a mapping of product categories to relevant questions.
    Each category has specific questions that help extract useful e-commerce
    metadata. We provide default questions for unknown categories.
    """
    
    # SOLUTION: Create comprehensive category-to-questions mapping
    question_map = {
        "shoes": [
            "What color are these shoes?",
            "What type of shoes are these?",
            "What brand are these shoes?",
            "What material are these shoes made of?",
            "Are these sneakers?"
        ],
        "clothing": [
            "What color is this clothing?",
            "What type of clothing is this?",
            "What material is this clothing made of?",
            "What size is this clothing?",
            "Is this formal or casual wear?"
        ],
        "electronics": [
            "What type of device is this?",
            "What brand is this device?",
            "What color is this device?",
            "Is this a smartphone or tablet?",
            "Does this have a screen?"
        ],
        "furniture": [
            "What type of furniture is this?",
            "What color is this furniture?",
            "What material is this furniture made of?",
            "Is this indoor or outdoor furniture?",
            "How many people can use this?"
        ],
        "books": [
            "What type of book is this?",
            "What color is the book cover?",
            "Is this a hardcover or paperback?",
            "Does this book have text on the cover?",
            "Is this a fiction or non-fiction book?"
        ],
        "toys": [
            "What type of toy is this?",
            "What color is this toy?",
            "Is this toy for children or adults?",
            "What material is this toy made of?",
            "Is this an educational toy?"
        ]
    }
    
    # SOLUTION: Return questions for the category, or default questions
    return question_map.get(category, [
        "What color is this?",
        "What type of item is this?",
        "What is this made of?"
    ])

# TEST: Category questions
print("📋 TESTING: Category-specific questions...")
test_categories = ["shoes", "clothing", "electronics", "furniture"]
for category in test_categories:
    questions = get_category_questions(category)
    print(f"{category.title()} Questions:")
    for q in questions:
        print(f"  - {q}")
    print()

📋 TESTING: Category-specific questions...
Shoes Questions:
  - What color are these shoes?
  - What type of shoes are these?
  - What brand are these shoes?
  - What material are these shoes made of?
  - Are these sneakers?

Clothing Questions:
  - What color is this clothing?
  - What type of clothing is this?
  - What material is this clothing made of?
  - What size is this clothing?
  - Is this formal or casual wear?

Electronics Questions:
  - What type of device is this?
  - What brand is this device?
  - What color is this device?
  - Is this a smartphone or tablet?
  - Does this have a screen?

Furniture Questions:
  - What type of furniture is this?
  - What color is this furniture?
  - What material is this furniture made of?
  - Is this indoor or outdoor furniture?
  - How many people can use this?



---

### Product Analyzer

In [9]:
def analyze_product(image_url_or_pil: Union[str, Image.Image]) -> Dict:
    """
    Main function to analyze a product image and generate complete metadata
    
    SOLUTION: This is the main pipeline that combines all our functions:
    1. Load image (if URL provided)
    2. Classify category using CLIP
    3. Generate description using BLIP captioning
    4. Ask category-specific questions using BLIP VQA
    5. Compile everything into a structured result
    """
    print("🚀 Starting product analysis...")
    print("=" * 50)
    
    try:
        # SOLUTION: Step 1 - Load image if URL provided
        if isinstance(image_url_or_pil, str):
            image = load_image_from_url(image_url_or_pil)
            if image is None:
                return {"error": "Failed to load image", "status": "failed"}
        else:
            image = image_url_or_pil
        
        # SOLUTION: Step 2 - Classify product category using CLIP
        product_categories = ["clothing", "shoes", "electronics", "furniture", "books", "toys"]
        classification_results = classify_product_image(image, product_categories)
        top_category = classification_results[0]  # Highest confidence category
        
        # SOLUTION: Step 3 - Generate product description using BLIP
        description = generate_product_caption(image)
        
        # SOLUTION: Step 4 - Get category-specific questions and ask them
        category = top_category['label']
        questions = get_category_questions(category)
        
        # Ask each question and collect answers
        qa_results = {}
        for question in questions:
            answer = ask_about_product(image, question)
            qa_results[question] = answer
        
        # SOLUTION: Step 5 - Compile results into structured format
        result = {
            "category": {
                "name": category,
                "confidence": top_category['score']
            },
            "description": description,
            "attributes": qa_results,
            "status": "success"
        }
        
        print("\n✅ Product analysis complete!")
        return result
        
    except Exception as e:
        print(f"❌ Error during processing: {e}")
        return {"error": str(e), "status": "failed"}

# TEST: Complete product analysis
print("🚀 TESTING: Complete product analysis pipeline...")
analysis_result = analyze_product(sample_url)
print("Complete Analysis Result:")
print(analysis_result)


🚀 TESTING: Complete product analysis pipeline...
🚀 Starting product analysis...
🔍 Classifying product category...


Device set to use mps:0


📝 Generating image caption...
❓ Answering: 'What color are these shoes?'
💡 Answer: red
❓ Answering: 'What type of shoes are these?'
💡 Answer: nike
❓ Answering: 'What brand are these shoes?'
💡 Answer: nike
❓ Answering: 'What material are these shoes made of?'
💡 Answer: fabric
❓ Answering: 'Are these sneakers?'
💡 Answer: yes

✅ Product analysis complete!
Complete Analysis Result:
{'category': {'name': 'shoes', 'confidence': 0.9675810933113098}, 'description': 'a red and white shoe on a red background', 'attributes': {'What color are these shoes?': 'red', 'What type of shoes are these?': 'nike', 'What brand are these shoes?': 'nike', 'What material are these shoes made of?': 'fabric', 'Are these sneakers?': 'yes'}, 'status': 'success'}


---