Image Captioning Notebooks

This repository contains Jupyter notebooks for captioning image datasets using various state-of-the-art vision-language models. Each notebook is designed to process image datasets and generate descriptive captions using different AI models.

Available Captioning Models

The following captioning models are available in this repository:

Florence 2 - Microsoft's vision-language model with predefined tasks
Joy - Hinted model based on Llama with natural language prompt support
Qwen 2.5 VL - Alibaba's large vision-language model (32B Instruct)
ToriiGate - Hinted model based on Qwen VL with tag/character support
WD14 (Waifu Diffusion 1.4) - ONNX-based tagging model for SDXL models
CSV Processing - Utility for converting CSV caption data to text files

Current Best Models

For SDXL-based models, including Pony and Illustrious, use the WD14 notebook.
Otherwise, use the ToriiGate notebook.

ToriiGate tends to hallucinate less than Joy, but Joy can be more detailed at times. ToriiGate is much better at following the hints provided in the metadata file.

Notebooks Overview

1. `captions-florence.ipynb`

Model: Florence 2 (Microsoft) / MiaoshouAI Florence-2-large-PromptGen-v2.0

Features:

Uses predefined task prompts (<CAPTION>, <DETAILED_CAPTION>, <MORE_DETAILED_CAPTION>)
Additional tasks available in MiaoshouAI models: <ANALYZE>, <GENERATE_PROMPT>, <GENERATE_TAGS>, <MIXED_CAPTION>, <MIXED_CAPTION_PLUS>
Automatic image deduplication using perceptual hashing
Image size filtering (minimum 512x512)
Caption cleaning and keyword replacement
Template-based caption generation with metadata support

Use Case: Best for structured captioning with predefined formats and when you need consistent output styles.

2. `captions-joy.ipynb`

Model: Joy (based on Meta-Llama-3.1-8B-Instruct-abliterated)

Features:

Natural language hint support via meta.yaml files
CLIP vision encoder with custom image adapter
Uncensored output for NSFW content
Template-based caption generation
Image deduplication and size filtering
Flexible prompt system with metadata integration

Use Case: Ideal for creative or NSFW content where you need natural language hints and uncensored output.

3. `captions-qwen25vl.ipynb`

Model: Qwen 2.5 VL 32B Instruct (AWQ quantized)

Features:

Large-scale vision-language model (32B parameters)
Flash attention 2 for memory efficiency
AWQ quantization for faster inference
Multi-image and video support capabilities
High-quality detailed captions

Status: ⚠️ Work in Progress - This notebook is incomplete and may not work correctly.

Use Case: When you need the highest quality captions and have sufficient computational resources.

4. `captions-toriigate.ipynb`

Model: ToriiGate v0.4-7B (based on Qwen VL)

Features:

Support for multiple hint types: tags, characters, character traits, and general info
Individual image info via $IMAGE_NAME.image_info.json files
Folder-level info via image_info.json files
Multiple output formats: JSON, markdown, short/long captions, bounding boxes
Chain-of-thought correction capabilities
Boooru tag sanitization and processing

Use Case: Perfect for character-focused datasets or when you have detailed metadata about characters and their traits.

5. `captions-wd14.ipynb`

Model: WD14 (Waifu Diffusion 1.4) Tagger v3

Features:

ONNX runtime with CUDA acceleration
Multiple model variants (SwinV2, ConvNeXt, ViT, EVA02)
Automatic tag generation with confidence thresholds
MCut thresholding for optimal tag selection
Boooru-style tag formatting
Optimized for SDXL-based models like Pony and Illustrious

Use Case: Best for generating boooru-style tags for training data, especially for anime/illustration datasets.

6. `captions-csv.ipynb`

Utility: CSV to Text File Converter

Features:

Converts CSV files with id and caption columns to individual text files
Batch processing of multiple CSV files
UTF-8 encoding support
Simple and lightweight utility

Use Case: When you have existing caption data in CSV format that needs to be converted to individual text files for training.

Common Features

All notebooks include:

Image deduplication using perceptual hashing
Size filtering to exclude small images
Batch processing for efficiency
Progress tracking with tqdm
Error handling for corrupted images
Caption cleaning and post-processing
Flexible input formats (PNG, JPG, JPEG, WebP)

Getting Started

Choose your model based on your specific needs:
- For structured, consistent captions: Florence 2
- For creative/NSFW content with hints: Joy
- For highest quality (if complete): Qwen 2.5 VL
- For character-focused datasets: ToriiGate
- For boooru-style tags: WD14
Set up your environment:
- Install required dependencies (each notebook lists its requirements)
- Set up Hugging Face models directory (HF_HOME)
- Ensure CUDA is available for GPU acceleration
Prepare your dataset:
- Organize images in your desired folder structure
- Add metadata files if using hinted models (Joy, ToriiGate)
- Configure image paths in the notebook
Run the notebook:
- Execute cells in order
- Monitor progress and adjust parameters as needed
- Review generated captions and clean up if necessary

Requirements

Python 3.8+
CUDA-compatible GPU (recommended)
Sufficient VRAM (varies by model: 8GB+ for most models, 24GB+ for Qwen 2.5 VL)
Hugging Face account for model downloads

Notes

Always backup your dataset before running any captioning notebook
Some models may require specific hardware configurations
The Qwen 2.5 VL notebook is marked as work-in-progress and should be used with caution
Consider using different models for different types of content (e.g., WD14 for anime, Florence 2 for general images)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning Notebooks

Available Captioning Models

Current Best Models

Notebooks Overview

1. `captions-florence.ipynb`

2. `captions-joy.ipynb`

3. `captions-qwen25vl.ipynb`

4. `captions-toriigate.ipynb`

5. `captions-wd14.ipynb`

6. `captions-csv.ipynb`

Common Features

Getting Started

Requirements

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
captions-csv.ipynb		captions-csv.ipynb
captions-florence.ipynb		captions-florence.ipynb
captions-joy.ipynb		captions-joy.ipynb
captions-qwen25vl.ipynb		captions-qwen25vl.ipynb
captions-toriigate.ipynb		captions-toriigate.ipynb
captions-wd14.ipynb		captions-wd14.ipynb
frequencies.ipynb		frequencies.ipynb

Folders and files

Latest commit

History

Repository files navigation

Image Captioning Notebooks

Available Captioning Models

Current Best Models

Notebooks Overview

1. captions-florence.ipynb

2. captions-joy.ipynb

3. captions-qwen25vl.ipynb

4. captions-toriigate.ipynb

5. captions-wd14.ipynb

6. captions-csv.ipynb

Common Features

Getting Started

Requirements

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `captions-florence.ipynb`

2. `captions-joy.ipynb`

3. `captions-qwen25vl.ipynb`

4. `captions-toriigate.ipynb`

5. `captions-wd14.ipynb`

6. `captions-csv.ipynb`

Packages