Captchas

Overview

A dataset of 88,800+ captcha images labeled with their corresponding 3-character alphanumeric text. Designed for training and evaluating OCR and captcha-solving models.

Dataset Structure

images/ — JPEG captcha images named by their SHA-256 hash ({hash}.jpg)
labels.csv — Raw CSV with columns id, captcha, imageURL
labels_clean.csv — Sanitized CSV with columns id, captcha (no URL, trimmed, uppercased, removed +0 artifacts)

Key Stats

88,847 labeled images
15,533 unique captcha values
Labels are 3 characters, uppercase alphanumeric (e.g., 4RJ, N34, YHP)

Column Descriptors

id — SHA-256 hash of the image, matches the filename in images/ (without extension)
captcha — 3-character uppercase alphanumeric text shown in the captcha image

Scripts

sanitize.py — Cleans labels.csv → labels_clean.csv (trim, uppercase, remove +0, drop URL column)
validate_dataset.py — Validates the clean CSV: checks for missing/corrupt images, anomalous label lengths, and sanitization issues
train_crnn_ctc.py — CRNN + CTC training script (PyTorch)
train_crnn_ctc.ipynb — Kaggle notebook version with training + CoreML export

Model

A CRNN (CNN + BiLSTM) trained with CTC loss for variable-length captcha recognition.

Architecture: 5-block CNN → 2-layer BiLSTM (256 hidden) → Linear
Best Val Accuracy: 99.49% (exact match)
Export: CoreML (.mlpackage) for on-device inference on Apple platforms

Provenance

Images were collected from a web application. Labels were extracted from the source metadata and sanitized (trimmed, uppercased, removed artifacts).

Use Cases

Training OCR models for captcha recognition
Benchmarking text recognition architectures (CNN, CRNN, Transformer-based)
Data augmentation research for noisy text recognition
On-device captcha solving via CoreML on iOS/macOS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Captchas

Overview

Dataset Structure

Key Stats

Column Descriptors

Scripts

Model

Provenance

Use Cases

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
crnn_ctc		crnn_ctc
labels.csv		labels.csv
labels_clean.csv		labels_clean.csv
sanitize.py		sanitize.py
train_crnn_ctc.ipynb		train_crnn_ctc.ipynb
train_crnn_ctc.py		train_crnn_ctc.py
validate_dataset.py		validate_dataset.py

Folders and files

Latest commit

History

Repository files navigation

Captchas

Overview

Dataset Structure

Key Stats

Column Descriptors

Scripts

Model

Provenance

Use Cases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages