This repository contains Jupyter notebooks for captioning image datasets using various state-of-the-art vision-language models. Each notebook is designed to process image datasets and generate descriptive captions using different AI models.
The following captioning models are available in this repository:
- Florence 2 - Microsoft's vision-language model with predefined tasks
- Joy - Hinted model based on Llama with natural language prompt support
- Qwen 2.5 VL - Alibaba's large vision-language model (32B Instruct)
- ToriiGate - Hinted model based on Qwen VL with tag/character support
- WD14 (Waifu Diffusion 1.4) - ONNX-based tagging model for SDXL models
- CSV Processing - Utility for converting CSV caption data to text files
- For SDXL-based models, including Pony and Illustrious, use the WD14 notebook.
- Otherwise, use the ToriiGate notebook.
ToriiGate tends to hallucinate less than Joy, but Joy can be more detailed at times. ToriiGate is much better at following the hints provided in the metadata file.
Model: Florence 2 (Microsoft) / MiaoshouAI Florence-2-large-PromptGen-v2.0
Features:
- Uses predefined task prompts (
<CAPTION>,<DETAILED_CAPTION>,<MORE_DETAILED_CAPTION>) - Additional tasks available in MiaoshouAI models:
<ANALYZE>,<GENERATE_PROMPT>,<GENERATE_TAGS>,<MIXED_CAPTION>,<MIXED_CAPTION_PLUS> - Automatic image deduplication using perceptual hashing
- Image size filtering (minimum 512x512)
- Caption cleaning and keyword replacement
- Template-based caption generation with metadata support
Use Case: Best for structured captioning with predefined formats and when you need consistent output styles.
Model: Joy (based on Meta-Llama-3.1-8B-Instruct-abliterated)
Features:
- Natural language hint support via
meta.yamlfiles - CLIP vision encoder with custom image adapter
- Uncensored output for NSFW content
- Template-based caption generation
- Image deduplication and size filtering
- Flexible prompt system with metadata integration
Use Case: Ideal for creative or NSFW content where you need natural language hints and uncensored output.
Model: Qwen 2.5 VL 32B Instruct (AWQ quantized)
Features:
- Large-scale vision-language model (32B parameters)
- Flash attention 2 for memory efficiency
- AWQ quantization for faster inference
- Multi-image and video support capabilities
- High-quality detailed captions
Status:
Use Case: When you need the highest quality captions and have sufficient computational resources.
Model: ToriiGate v0.4-7B (based on Qwen VL)
Features:
- Support for multiple hint types: tags, characters, character traits, and general info
- Individual image info via
$IMAGE_NAME.image_info.jsonfiles - Folder-level info via
image_info.jsonfiles - Multiple output formats: JSON, markdown, short/long captions, bounding boxes
- Chain-of-thought correction capabilities
- Boooru tag sanitization and processing
Use Case: Perfect for character-focused datasets or when you have detailed metadata about characters and their traits.
Model: WD14 (Waifu Diffusion 1.4) Tagger v3
Features:
- ONNX runtime with CUDA acceleration
- Multiple model variants (SwinV2, ConvNeXt, ViT, EVA02)
- Automatic tag generation with confidence thresholds
- MCut thresholding for optimal tag selection
- Boooru-style tag formatting
- Optimized for SDXL-based models like Pony and Illustrious
Use Case: Best for generating boooru-style tags for training data, especially for anime/illustration datasets.
Utility: CSV to Text File Converter
Features:
- Converts CSV files with
idandcaptioncolumns to individual text files - Batch processing of multiple CSV files
- UTF-8 encoding support
- Simple and lightweight utility
Use Case: When you have existing caption data in CSV format that needs to be converted to individual text files for training.
All notebooks include:
- Image deduplication using perceptual hashing
- Size filtering to exclude small images
- Batch processing for efficiency
- Progress tracking with tqdm
- Error handling for corrupted images
- Caption cleaning and post-processing
- Flexible input formats (PNG, JPG, JPEG, WebP)
-
Choose your model based on your specific needs:
- For structured, consistent captions: Florence 2
- For creative/NSFW content with hints: Joy
- For highest quality (if complete): Qwen 2.5 VL
- For character-focused datasets: ToriiGate
- For boooru-style tags: WD14
-
Set up your environment:
- Install required dependencies (each notebook lists its requirements)
- Set up Hugging Face models directory (
HF_HOME) - Ensure CUDA is available for GPU acceleration
-
Prepare your dataset:
- Organize images in your desired folder structure
- Add metadata files if using hinted models (Joy, ToriiGate)
- Configure image paths in the notebook
-
Run the notebook:
- Execute cells in order
- Monitor progress and adjust parameters as needed
- Review generated captions and clean up if necessary
- Python 3.8+
- CUDA-compatible GPU (recommended)
- Sufficient VRAM (varies by model: 8GB+ for most models, 24GB+ for Qwen 2.5 VL)
- Hugging Face account for model downloads
- Always backup your dataset before running any captioning notebook
- Some models may require specific hardware configurations
- The Qwen 2.5 VL notebook is marked as work-in-progress and should be used with caution
- Consider using different models for different types of content (e.g., WD14 for anime, Florence 2 for general images)