# Multimodal AI Demo - CMPE 258 Assignment

This notebook demonstrates various multimodal AI capabilities using **free, open-source models**:
1. **Text-to-Image Generation** using Stable Diffusion
2. **Image Analysis** using BLIP (Salesforce)
3. **Text Conversations** using Mistral (via Hugging Face)

**No API keys required!** All models run locally in Colab.

---

## Setup and Installation

First, let's install all required packages:

In [None]:
# Install required packages
!pip install -q diffusers transformers accelerate torch pillow sentencepiece protobuf

## Part 1: Text-to-Image Generation with Stable Diffusion

We'll use Stable Diffusion v1.5 to generate images from text prompts.

In [None]:
import torch
from diffusers import StableDiffusionPipeline
from PIL import Image
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Load Stable Diffusion model
print("\nLoading Stable Diffusion v1.5...")
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32
)
pipe = pipe.to(device)

print("‚úÖ Stable Diffusion model loaded successfully!")

In [None]:
# Generate images with different prompts
prompts = [
    "A futuristic city with flying cars at sunset, photorealistic, 8k",
    "A magical forest with glowing mushrooms and fireflies, fantasy art",
    "A robot playing chess with a human in a cozy library, oil painting style"
]

generated_images = []

print("Starting image generation...\n")
print("="*80)

for i, prompt in enumerate(prompts):
    print(f"\n[Image {i+1}/{len(prompts)}]")
    print(f"Prompt: {prompt}")
    print("Generating...")
    
    # Generate image
    image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
    generated_images.append(image)
    
    # Display the image
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    plt.axis('off')
    plt.title(prompt, fontsize=12, wrap=True, pad=20)
    plt.tight_layout()
    plt.show()
    
    # Save the image
    filename = f"generated_image_{i+1}.png"
    image.save(filename)
    print(f"‚úÖ Saved as {filename}")
    print("-"*80)

print("\n‚úÖ All images generated successfully!")

## Part 2: Image Analysis with BLIP

We'll use Salesforce's BLIP (Bootstrapping Language-Image Pre-training) model to analyze the generated images.

**BLIP is completely free and runs locally - no API key needed!**

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration, BlipForQuestionAnswering

# Load BLIP models for image captioning and VQA
print("Loading BLIP models...\n")

# Model 1: Image Captioning
print("Loading BLIP Image Captioning model...")
caption_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
caption_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to(device)
print("‚úÖ Captioning model loaded")

# Model 2: Visual Question Answering
print("Loading BLIP VQA model...")
vqa_processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
vqa_model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to(device)
print("‚úÖ VQA model loaded")

print("\n‚úÖ All BLIP models ready!")

In [None]:
# Analyze each generated image
print("Starting image analysis...\n")
print("="*80)

for i, (image, prompt) in enumerate(zip(generated_images, prompts)):
    print(f"\n[Analyzing Image {i+1}]")
    print(f"Original Prompt: {prompt}")
    print("-"*80)
    
    # Display the image
    plt.figure(figsize=(8, 8))
    plt.imshow(image)
    plt.axis('off')
    plt.title(f"Image {i+1}", fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # 1. Generate unconditional caption
    print("\n1Ô∏è‚É£  General Description:")
    inputs = caption_processor(image, return_tensors="pt").to(device)
    out = caption_model.generate(**inputs, max_length=50)
    caption = caption_processor.decode(out[0], skip_special_tokens=True)
    print(f"   {caption}")
    
    # 2. Generate conditional caption (more detailed)
    print("\n2Ô∏è‚É£  Detailed Analysis:")
    text_prompt = "a detailed description of"
    inputs = caption_processor(image, text_prompt, return_tensors="pt").to(device)
    out = caption_model.generate(**inputs, max_length=100)
    detailed_caption = caption_processor.decode(out[0], skip_special_tokens=True)
    print(f"   {detailed_caption}")
    
    # 3. Visual Question Answering
    print("\n3Ô∏è‚É£  Visual Q&A:")
    questions = [
        "What is the main subject of this image?",
        "What is the mood or atmosphere?",
        "What colors are dominant?"
    ]
    
    for question in questions:
        inputs = vqa_processor(image, question, return_tensors="pt").to(device)
        out = vqa_model.generate(**inputs, max_length=20)
        answer = vqa_processor.decode(out[0], skip_special_tokens=True)
        print(f"   Q: {question}")
        print(f"   A: {answer}")
    
    print("\n" + "="*80)

print("\n‚úÖ Image analysis complete!")

## Part 3: Text Conversations with Mistral

We'll use Mistral-7B-Instruct, a powerful open-source conversational AI model.

**Note:** This model is large (~14GB). If Colab runs out of memory, we'll use a smaller model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Clear GPU memory
if device == "cuda":
    torch.cuda.empty_cache()
    
print("Loading conversational AI model...\n")

try:
    # Try loading Mistral-7B (better quality)
    model_name = "mistralai/Mistral-7B-Instruct-v0.2"
    print(f"Attempting to load {model_name}...")
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto",
        load_in_8bit=True if device == "cuda" else False
    )
    print(f"‚úÖ {model_name} loaded successfully!")
    
except Exception as e:
    # Fallback to smaller model
    print(f"Could not load Mistral: {e}")
    print("\nFalling back to smaller model...")
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto"
    )
    print(f"‚úÖ {model_name} loaded successfully!")

print(f"\nModel: {model_name}")
print("Ready for conversation!")

In [None]:
# Multi-turn conversation
conversation_history = []

questions = [
    "Explain what multimodal AI is in simple terms.",
    "What are some real-world applications of multimodal AI?",
    "How does text-to-image generation like Stable Diffusion work?",
    "What are the ethical concerns with AI-generated images?",
]

print("Starting conversation...\n")
print("="*80)

for i, question in enumerate(questions):
    print(f"\n[Turn {i+1}]")
    print(f"User: {question}")
    
    # Format the conversation for the model
    if "Mistral" in model_name:
        # Mistral format
        messages = conversation_history + [{"role": "user", "content": question}]
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        # TinyLlama format
        prompt = f"<|user|>\n{question}</s>\n<|assistant|>\n"
    
    # Generate response
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "Mistral" in model_name:
        response = response.split("[/INST]")[-1].strip()
    else:
        response = response.split("<|assistant|>")[-1].strip()
    
    print(f"\nAI: {response}")
    print("\n" + "-"*80)
    
    # Update conversation history
    conversation_history.append({"role": "user", "content": question})
    conversation_history.append({"role": "assistant", "content": response})

print("\n‚úÖ Conversation complete!")

## Summary and Results

This notebook demonstrated three key multimodal AI capabilities:

### ‚úÖ Completed Tasks:

1. **Text-to-Image Generation**
   - Model: Stable Diffusion v1.5
   - Generated 3 high-quality images from text prompts
   - Demonstrated creative AI capabilities

2. **Image Analysis**
   - Model: BLIP (Salesforce)
   - Analyzed generated images with captions and Q&A
   - Showed understanding of visual content

3. **Text Conversations**
   - Model: Mistral-7B / TinyLlama
   - Multi-turn conversation about AI topics
   - Demonstrated context retention

### üéØ Key Achievements:
- **No API keys required** - all models run locally
- **Free and open-source** - completely free to use
- **State-of-the-art models** - Stable Diffusion, BLIP, Mistral
- **Full multimodal pipeline** - text, images, and conversations

### üìä Technical Details:
- **Stable Diffusion**: 860M parameters, text-to-image generation
- **BLIP**: 385M parameters, image understanding and captioning
- **Mistral/TinyLlama**: 7B/1.1B parameters, conversational AI
- **Hardware**: GPU-accelerated (T4/P100 on Colab)

---

### üìÅ Generated Files:
- `generated_image_1.png` - Futuristic city scene
- `generated_image_2.png` - Magical forest scene
- `generated_image_3.png` - Robot chess scene

In [None]:
# Display all generated images in a grid
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
fig.suptitle('All Generated Images - Multimodal AI Demo', fontsize=16, fontweight='bold')

for i, (ax, image, prompt) in enumerate(zip(axes, generated_images, prompts)):
    ax.imshow(image)
    ax.axis('off')
    ax.set_title(f"Image {i+1}\n{prompt[:50]}...", fontsize=9, wrap=True)

plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("üéâ MULTIMODAL AI DEMO COMPLETE! üéâ")
print("="*80)
print("\n‚úÖ All tasks completed successfully:")
print("   1. Text-to-Image Generation ‚úì")
print("   2. Image Analysis ‚úì")
print("   3. Text Conversations ‚úì")
print("\nüìä Total models used: 4 (Stable Diffusion + 2 BLIP + Mistral/TinyLlama)")
print("üí∞ Total cost: $0 (all free and open-source!)")
print("\n" + "="*80)