Skip to content

roncuevas/Captchas

Repository files navigation

Captchas

Overview

A dataset of 88,800+ captcha images labeled with their corresponding 3-character alphanumeric text. Designed for training and evaluating OCR and captcha-solving models.

Dataset Structure

  • images/ — JPEG captcha images named by their SHA-256 hash ({hash}.jpg)
  • labels.csv — Raw CSV with columns id, captcha, imageURL
  • labels_clean.csv — Sanitized CSV with columns id, captcha (no URL, trimmed, uppercased, removed +0 artifacts)

Key Stats

  • 88,847 labeled images
  • 15,533 unique captcha values
  • Labels are 3 characters, uppercase alphanumeric (e.g., 4RJ, N34, YHP)

Column Descriptors

  • id — SHA-256 hash of the image, matches the filename in images/ (without extension)
  • captcha — 3-character uppercase alphanumeric text shown in the captcha image

Scripts

  • sanitize.py — Cleans labels.csvlabels_clean.csv (trim, uppercase, remove +0, drop URL column)
  • validate_dataset.py — Validates the clean CSV: checks for missing/corrupt images, anomalous label lengths, and sanitization issues
  • train_crnn_ctc.py — CRNN + CTC training script (PyTorch)
  • train_crnn_ctc.ipynb — Kaggle notebook version with training + CoreML export

Model

A CRNN (CNN + BiLSTM) trained with CTC loss for variable-length captcha recognition.

  • Architecture: 5-block CNN → 2-layer BiLSTM (256 hidden) → Linear
  • Best Val Accuracy: 99.49% (exact match)
  • Export: CoreML (.mlpackage) for on-device inference on Apple platforms

Provenance

Images were collected from a web application. Labels were extracted from the source metadata and sanitized (trimmed, uppercased, removed artifacts).

Use Cases

  • Training OCR models for captcha recognition
  • Benchmarking text recognition architectures (CNN, CRNN, Transformer-based)
  • Data augmentation research for noisy text recognition
  • On-device captcha solving via CoreML on iOS/macOS

About

Captcha OCR dataset (88K+ images) with a CRNN+CTC model achieving 99.49% accuracy, exportable to CoreML for on-device inference.

Topics

Resources

License

Stars

Watchers

Forks

Contributors