A dataset of 88,800+ captcha images labeled with their corresponding 3-character alphanumeric text. Designed for training and evaluating OCR and captcha-solving models.
- images/ — JPEG captcha images named by their SHA-256 hash (
{hash}.jpg) - labels.csv — Raw CSV with columns
id,captcha,imageURL - labels_clean.csv — Sanitized CSV with columns
id,captcha(no URL, trimmed, uppercased, removed+0artifacts)
- 88,847 labeled images
- 15,533 unique captcha values
- Labels are 3 characters, uppercase alphanumeric (e.g.,
4RJ,N34,YHP)
- id — SHA-256 hash of the image, matches the filename in
images/(without extension) - captcha — 3-character uppercase alphanumeric text shown in the captcha image
- sanitize.py — Cleans
labels.csv→labels_clean.csv(trim, uppercase, remove+0, drop URL column) - validate_dataset.py — Validates the clean CSV: checks for missing/corrupt images, anomalous label lengths, and sanitization issues
- train_crnn_ctc.py — CRNN + CTC training script (PyTorch)
- train_crnn_ctc.ipynb — Kaggle notebook version with training + CoreML export
A CRNN (CNN + BiLSTM) trained with CTC loss for variable-length captcha recognition.
- Architecture: 5-block CNN → 2-layer BiLSTM (256 hidden) → Linear
- Best Val Accuracy: 99.49% (exact match)
- Export: CoreML (
.mlpackage) for on-device inference on Apple platforms
Images were collected from a web application. Labels were extracted from the source metadata and sanitized (trimmed, uppercased, removed artifacts).
- Training OCR models for captcha recognition
- Benchmarking text recognition architectures (CNN, CRNN, Transformer-based)
- Data augmentation research for noisy text recognition
- On-device captcha solving via CoreML on iOS/macOS