Skip to content

ZySec-AI/data4ai

Repository files navigation

Data4AI πŸ€–

PyPI version License: MIT Python 3.9+ GitHub Stars

Generate high-quality AI training datasets from simple descriptions or documents

Data4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.

✨ Features

  • 🎯 Simple Commands - Generate datasets from descriptions or documents
  • πŸ“š Multiple Formats - Support for ChatML, Alpaca, and custom schemas
  • πŸ”„ Smart Processing - Automatic chunking, deduplication, and quality validation
  • 🏷️ Cognitive Taxonomy - Built-in Bloom's taxonomy for balanced learning
  • ☁️ Direct Upload - Push datasets directly to HuggingFace Hub
  • 🌐 100+ Models - Access to GPT, Claude, Llama, and more via OpenRouter

πŸš€ Quick Start

Install

pip install data4ai

Get API Key

Get your free API key from OpenRouter:

export OPENROUTER_API_KEY="your_key_here"

Generate Your First Dataset

From a description:

data4ai prompt \
  --repo my-dataset \
  --description "Python programming questions for beginners" \
  --count 100

From documents:

data4ai doc document.pdf \
  --repo doc-dataset \
  --count 100

From YouTube videos:

data4ai youtube @3Blue1Brown \
  --repo math-videos \
  --count 100

Upload to HuggingFace:

data4ai push --repo my-dataset

That's it! Your dataset is ready at outputs/datasets/my-dataset/data.jsonl πŸŽ‰

πŸ“– Documentation

🀝 Community

Contributing

We welcome contributions! See our Contributing Guide for:

  • Development setup
  • Code style guidelines
  • Testing requirements
  • Pull request process

Getting Help

Project Structure

data4ai/
β”œβ”€β”€ data4ai/           # Core library code
β”œβ”€β”€ docs/             # User documentation  
β”œβ”€β”€ tests/            # Test suite
β”œβ”€β”€ README.md         # You are here
β”œβ”€β”€ CONTRIBUTING.md   # How to contribute
└── CHANGELOG.md      # Release history

🎯 Use Cases

πŸ₯ Medical Training Data

data4ai prompt --repo medical-qa \
  --description "Medical diagnosis Q&A for common symptoms" \
  --count 500

βš–οΈ Legal Assistant Data

data4ai doc legal-docs/ --repo legal-assistant --count 1000

πŸ’» Code Training Data

data4ai prompt --repo code-qa \
  --description "Python debugging and best practices" \
  --count 300

πŸ“Ί Educational Video Content

# Programming tutorials
data4ai youtube --search "python tutorial,programming" --repo python-course --count 200

# Educational channels  
data4ai youtube @3Blue1Brown --repo math-education --count 150

# Conference talks
data4ai youtube @pycon --repo conference-talks --count 100

πŸ› οΈ Advanced Usage

Quality Control

data4ai doc document.pdf \
  --repo high-quality \
  --verify \
  --taxonomy advanced \
  --dedup-strategy content

Batch Processing

data4ai doc documents/ \
  --repo batch-dataset \
  --count 1000 \
  --batch-size 20 \
  --recursive

Custom Models

export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"
data4ai prompt --repo custom-model --description "..." --count 100

πŸ—οΈ Architecture

Data4AI is built with:

  • Async Processing - Fast concurrent generation
  • DSPy Integration - Advanced prompt optimization
  • Quality Validation - Automatic content verification
  • Atomic Writes - Safe file operations
  • Schema Validation - Ensures data consistency

πŸ“Š Sample Output

{
  "messages": [
    {
      "role": "user", 
      "content": "How do I handle exceptions in Python?"
    },
    {
      "role": "assistant",
      "content": "In Python, use try-except blocks to handle exceptions: ..."
    }
  ],
  "taxonomy_level": "understand"
}

πŸ”§ Configuration

Environment Variables

# Required
export OPENROUTER_API_KEY="your_key"

# Optional  
export OPENROUTER_MODEL="openai/gpt-4o-mini"  # Default model
export HF_TOKEN="your_hf_token"               # For HuggingFace uploads
export OUTPUT_DIR="./outputs/datasets"       # Default output directory

Config File

Create .data4ai.yaml in your project:

default_model: "anthropic/claude-3-5-sonnet"
default_schema: "chatml" 
default_count: 100
quality_check: true

πŸš€ Roadmap

  • Custom Schema Support - Define your own data formats
  • Local Model Support - Use local LLMs (Ollama, vLLM)
  • Multi-language Datasets - Generate data in multiple languages
  • Dataset Analytics - Advanced quality metrics and visualization
  • API Service - RESTful API for dataset generation

πŸ“ˆ Performance

  • Speed: Generate 100 examples in ~2 minutes
  • Quality: Built-in validation and deduplication
  • Scale: Tested with datasets up to 100K examples
  • Memory: Efficient streaming for large documents

⭐ Show Your Support

If Data4AI helps you, please:

  • ⭐ Star this repository
  • 🐦 Share on social media
  • 🀝 Contribute improvements
  • πŸ’ Sponsor the project

πŸ“„ License

MIT License - see LICENSE file for details.

🏒 About ZySec AI

ZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiableβ€”helping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by ZySec AI.


Made with ❀️ by ZySec AI to the open source community

About

Generate high-quality synthetic datasets from your own docs to build domain specific LLMs. Generate, validate, and publish datasets in popular formats like Alpaca ShareGPT.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages