Skip to content

thalismind/caption-notebooks

Repository files navigation

Image Captioning Notebooks

This repository contains Jupyter notebooks for captioning image datasets using various state-of-the-art vision-language models. Each notebook is designed to process image datasets and generate descriptive captions using different AI models.

Available Captioning Models

The following captioning models are available in this repository:

  1. Florence 2 - Microsoft's vision-language model with predefined tasks
  2. Joy - Hinted model based on Llama with natural language prompt support
  3. Qwen 2.5 VL - Alibaba's large vision-language model (32B Instruct)
  4. ToriiGate - Hinted model based on Qwen VL with tag/character support
  5. WD14 (Waifu Diffusion 1.4) - ONNX-based tagging model for SDXL models
  6. CSV Processing - Utility for converting CSV caption data to text files

Current Best Models

  • For SDXL-based models, including Pony and Illustrious, use the WD14 notebook.
  • Otherwise, use the ToriiGate notebook.

ToriiGate tends to hallucinate less than Joy, but Joy can be more detailed at times. ToriiGate is much better at following the hints provided in the metadata file.

Notebooks Overview

1. captions-florence.ipynb

Model: Florence 2 (Microsoft) / MiaoshouAI Florence-2-large-PromptGen-v2.0

Features:

  • Uses predefined task prompts (<CAPTION>, <DETAILED_CAPTION>, <MORE_DETAILED_CAPTION>)
  • Additional tasks available in MiaoshouAI models: <ANALYZE>, <GENERATE_PROMPT>, <GENERATE_TAGS>, <MIXED_CAPTION>, <MIXED_CAPTION_PLUS>
  • Automatic image deduplication using perceptual hashing
  • Image size filtering (minimum 512x512)
  • Caption cleaning and keyword replacement
  • Template-based caption generation with metadata support

Use Case: Best for structured captioning with predefined formats and when you need consistent output styles.

2. captions-joy.ipynb

Model: Joy (based on Meta-Llama-3.1-8B-Instruct-abliterated)

Features:

  • Natural language hint support via meta.yaml files
  • CLIP vision encoder with custom image adapter
  • Uncensored output for NSFW content
  • Template-based caption generation
  • Image deduplication and size filtering
  • Flexible prompt system with metadata integration

Use Case: Ideal for creative or NSFW content where you need natural language hints and uncensored output.

3. captions-qwen25vl.ipynb

Model: Qwen 2.5 VL 32B Instruct (AWQ quantized)

Features:

  • Large-scale vision-language model (32B parameters)
  • Flash attention 2 for memory efficiency
  • AWQ quantization for faster inference
  • Multi-image and video support capabilities
  • High-quality detailed captions

Status: ⚠️ Work in Progress - This notebook is incomplete and may not work correctly.

Use Case: When you need the highest quality captions and have sufficient computational resources.

4. captions-toriigate.ipynb

Model: ToriiGate v0.4-7B (based on Qwen VL)

Features:

  • Support for multiple hint types: tags, characters, character traits, and general info
  • Individual image info via $IMAGE_NAME.image_info.json files
  • Folder-level info via image_info.json files
  • Multiple output formats: JSON, markdown, short/long captions, bounding boxes
  • Chain-of-thought correction capabilities
  • Boooru tag sanitization and processing

Use Case: Perfect for character-focused datasets or when you have detailed metadata about characters and their traits.

5. captions-wd14.ipynb

Model: WD14 (Waifu Diffusion 1.4) Tagger v3

Features:

  • ONNX runtime with CUDA acceleration
  • Multiple model variants (SwinV2, ConvNeXt, ViT, EVA02)
  • Automatic tag generation with confidence thresholds
  • MCut thresholding for optimal tag selection
  • Boooru-style tag formatting
  • Optimized for SDXL-based models like Pony and Illustrious

Use Case: Best for generating boooru-style tags for training data, especially for anime/illustration datasets.

6. captions-csv.ipynb

Utility: CSV to Text File Converter

Features:

  • Converts CSV files with id and caption columns to individual text files
  • Batch processing of multiple CSV files
  • UTF-8 encoding support
  • Simple and lightweight utility

Use Case: When you have existing caption data in CSV format that needs to be converted to individual text files for training.

Common Features

All notebooks include:

  • Image deduplication using perceptual hashing
  • Size filtering to exclude small images
  • Batch processing for efficiency
  • Progress tracking with tqdm
  • Error handling for corrupted images
  • Caption cleaning and post-processing
  • Flexible input formats (PNG, JPG, JPEG, WebP)

Getting Started

  1. Choose your model based on your specific needs:

    • For structured, consistent captions: Florence 2
    • For creative/NSFW content with hints: Joy
    • For highest quality (if complete): Qwen 2.5 VL
    • For character-focused datasets: ToriiGate
    • For boooru-style tags: WD14
  2. Set up your environment:

    • Install required dependencies (each notebook lists its requirements)
    • Set up Hugging Face models directory (HF_HOME)
    • Ensure CUDA is available for GPU acceleration
  3. Prepare your dataset:

    • Organize images in your desired folder structure
    • Add metadata files if using hinted models (Joy, ToriiGate)
    • Configure image paths in the notebook
  4. Run the notebook:

    • Execute cells in order
    • Monitor progress and adjust parameters as needed
    • Review generated captions and clean up if necessary

Requirements

  • Python 3.8+
  • CUDA-compatible GPU (recommended)
  • Sufficient VRAM (varies by model: 8GB+ for most models, 24GB+ for Qwen 2.5 VL)
  • Hugging Face account for model downloads

Notes

  • Always backup your dataset before running any captioning notebook
  • Some models may require specific hardware configurations
  • The Qwen 2.5 VL notebook is marked as work-in-progress and should be used with caution
  • Consider using different models for different types of content (e.g., WD14 for anime, Florence 2 for general images)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors