Generate high-quality AI training datasets from simple descriptions or documents
Data4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.
- π― Simple Commands - Generate datasets from descriptions or documents
- π Multiple Formats - Support for ChatML, Alpaca, and custom schemas
- π Smart Processing - Automatic chunking, deduplication, and quality validation
- π·οΈ Cognitive Taxonomy - Built-in Bloom's taxonomy for balanced learning
- βοΈ Direct Upload - Push datasets directly to HuggingFace Hub
- π 100+ Models - Access to GPT, Claude, Llama, and more via OpenRouter
pip install data4aiGet your free API key from OpenRouter:
export OPENROUTER_API_KEY="your_key_here"From a description:
data4ai prompt \
--repo my-dataset \
--description "Python programming questions for beginners" \
--count 100From documents:
data4ai doc document.pdf \
--repo doc-dataset \
--count 100From YouTube videos:
data4ai youtube @3Blue1Brown \
--repo math-videos \
--count 100Upload to HuggingFace:
data4ai push --repo my-datasetThat's it! Your dataset is ready at outputs/datasets/my-dataset/data.jsonl π
- Examples - Real-world usage examples
- Commands - Complete CLI reference
- Features - Advanced features and options
- YouTube Integration - Extract datasets from YouTube videos
- Troubleshooting - Common issues and solutions
- Runnable Examples - Ready-to-run example scripts
We welcome contributions! See our Contributing Guide for:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
- π Bug reports: GitHub Issues
- π¬ Questions: GitHub Discussions
- π§ Contact: research@zysec.ai
data4ai/
βββ data4ai/ # Core library code
βββ docs/ # User documentation
βββ tests/ # Test suite
βββ README.md # You are here
βββ CONTRIBUTING.md # How to contribute
βββ CHANGELOG.md # Release history
π₯ Medical Training Data
data4ai prompt --repo medical-qa \
--description "Medical diagnosis Q&A for common symptoms" \
--count 500βοΈ Legal Assistant Data
data4ai doc legal-docs/ --repo legal-assistant --count 1000π» Code Training Data
data4ai prompt --repo code-qa \
--description "Python debugging and best practices" \
--count 300πΊ Educational Video Content
# Programming tutorials
data4ai youtube --search "python tutorial,programming" --repo python-course --count 200
# Educational channels
data4ai youtube @3Blue1Brown --repo math-education --count 150
# Conference talks
data4ai youtube @pycon --repo conference-talks --count 100data4ai doc document.pdf \
--repo high-quality \
--verify \
--taxonomy advanced \
--dedup-strategy contentdata4ai doc documents/ \
--repo batch-dataset \
--count 1000 \
--batch-size 20 \
--recursiveexport OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"
data4ai prompt --repo custom-model --description "..." --count 100Data4AI is built with:
- Async Processing - Fast concurrent generation
- DSPy Integration - Advanced prompt optimization
- Quality Validation - Automatic content verification
- Atomic Writes - Safe file operations
- Schema Validation - Ensures data consistency
{
"messages": [
{
"role": "user",
"content": "How do I handle exceptions in Python?"
},
{
"role": "assistant",
"content": "In Python, use try-except blocks to handle exceptions: ..."
}
],
"taxonomy_level": "understand"
}# Required
export OPENROUTER_API_KEY="your_key"
# Optional
export OPENROUTER_MODEL="openai/gpt-4o-mini" # Default model
export HF_TOKEN="your_hf_token" # For HuggingFace uploads
export OUTPUT_DIR="./outputs/datasets" # Default output directoryCreate .data4ai.yaml in your project:
default_model: "anthropic/claude-3-5-sonnet"
default_schema: "chatml"
default_count: 100
quality_check: true- Custom Schema Support - Define your own data formats
- Local Model Support - Use local LLMs (Ollama, vLLM)
- Multi-language Datasets - Generate data in multiple languages
- Dataset Analytics - Advanced quality metrics and visualization
- API Service - RESTful API for dataset generation
- Speed: Generate 100 examples in ~2 minutes
- Quality: Built-in validation and deduplication
- Scale: Tested with datasets up to 100K examples
- Memory: Efficient streaming for large documents
If Data4AI helps you, please:
- β Star this repository
- π¦ Share on social media
- π€ Contribute improvements
- π Sponsor the project
MIT License - see LICENSE file for details.
ZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiableβhelping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by ZySec AI.