Skip to content

siasio/synth_go

Repository files navigation

Data Processing Pipeline Framework

A flexible, extensible framework for building modular data processing pipelines with visualization capabilities.

Overview

This project provides a generic foundation for creating data processing pipelines that:

  • Process input data through configurable pipeline stages
  • Generate visual outputs at each stage
  • Support batch processing of multiple files
  • Include comprehensive testing and visualization tools

Key Features

  • 🔧 Modular Architecture: Abstract base classes for easy extension
  • 📊 Batch Processing: Process entire directories of files
  • 🎨 Visual Output: Generate images at each pipeline stage
  • 🔍 Interactive Visualization: Streamlit app for exploring results
  • Test-Driven Development: Comprehensive test suite
  • 🐍 Python 3.11: Optimized for specific version compatibility

Quick Start

1. Environment Setup

# Create and activate conda environment
conda create -n synth_go python=3.11 -y
conda activate synth_go

# Install dependencies
pip install -r requirements.txt

2. Run Example Pipeline

# Process sample data
python src/run_pipeline.py initial final --input assets/sandbox

# Launch visualization app
cd src/vis_app
python run_app.py
# Open browser to http://localhost:8501

3. Complete Demo

# Run everything in one go
python demo_streamlit.py

Architecture

Core Components

  • Abstract Pipeline Module (src/abstract_pipeline_module.py)

    • Generic base class for all pipeline stages
    • Type-safe input/output handling
    • Automatic output directory management
  • Pipeline Runner (src/run_pipeline.py)

    • Orchestrates multi-stage processing
    • Handles file/folder input
    • Validates type compatibility between stages
    • Generates comprehensive output metadata
  • Visualization App (src/vis_app/)

    • Interactive Streamlit interface
    • Side-by-side stage comparison
    • Pipeline run management
    • Raw data inspection

Example Implementation

Two sandbox modules demonstrate the framework:

  1. Initial Module: File → String

    • Reads first character from text file
    • Duplicates character
    • Generates text image
  2. Final Module: String → Integer

    • Converts character to ASCII code
    • Generates numeric image

Project Structure

synth_go/
├── assets/sandbox/           # Sample input data
├── src/
│   ├── abstract_pipeline_module.py  # Base classes
│   ├── run_pipeline.py              # Pipeline orchestrator
│   ├── pipeline_modules.py          # Module registry
│   ├── sandbox_pipeline_modules/    # Example implementations
│   └── vis_app/                     # Streamlit visualization
├── tests/                    # Test suite
├── output/                   # Generated pipeline results
└── requirements.txt          # Dependencies

Extending the Framework

Creating New Pipeline Modules

  1. Inherit from AbstractPipelineModule:
class MyModule(AbstractPipelineModule[InputType, OutputType]):
    def parse_input_data(self, input_data: InputType) -> Any:
        # Process input
        pass

    def process_data(self, parsed_data: Any) -> Any:
        # Core processing logic
        pass

    def visualize_output_data(self, processed_data: Any, filename: str) -> None:
        # Generate visualization
        pass

    def construct_output_data(self, processed_data: Any) -> OutputType:
        # Format output
        pass
  1. Register in pipeline_modules.py:
PIPELINE_MODULES = {
    "my_module": MyModule,
    # ... other modules
}
  1. Use in pipelines:
python src/run_pipeline.py my_module other_module --input data/

Type Safety

The framework enforces type compatibility between pipeline stages:

  • Output type of stage N must match input type of stage N+1
  • Validation occurs before pipeline execution
  • Clear error messages for type mismatches

Output Structure

Each pipeline run generates:

output/pipeline_runs/<run_id>/
├── <module_1>/
│   └── <datapoint_id>_<module_1>.png
├── <module_2>/
│   └── <datapoint_id>_<module_2>.png
└── run_data.json  # Complete metadata

The run_data.json contains:

  • Pipeline stage sequence
  • Datapoint mappings
  • File paths for all generated images
  • Unique identifiers for reproducibility

Testing

Comprehensive test suite covering:

  • Individual module functionality
  • Pipeline orchestration
  • File I/O operations
  • Error handling
  • Output validation
# Run all tests
conda activate synth_go
python -m unittest discover tests -v

Dependencies

  • Python 3.11: Required for library compatibility
  • Pillow: Image generation and processing
  • Streamlit: Interactive visualization interface
  • Standard Library: pathlib, json, uuid, argparse

Use Cases

This framework is ideal for:

  • Data Processing Pipelines: Multi-stage data transformation
  • Computer Vision: Image processing workflows
  • Research Projects: Reproducible analysis pipelines
  • Prototyping: Rapid development of processing chains
  • Education: Learning pipeline design patterns

Contributing

  1. Follow Test-Driven Development practices
  2. Maintain type safety throughout
  3. Add comprehensive docstrings
  4. Update tests for new functionality
  5. Ensure Python 3.11 compatibility

License

Open source - feel free to adapt for your specific use cases.

About

Synthetic 3D images of Go positions with Blender

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages