A flexible, extensible framework for building modular data processing pipelines with visualization capabilities.
This project provides a generic foundation for creating data processing pipelines that:
- Process input data through configurable pipeline stages
- Generate visual outputs at each stage
- Support batch processing of multiple files
- Include comprehensive testing and visualization tools
- 🔧 Modular Architecture: Abstract base classes for easy extension
- 📊 Batch Processing: Process entire directories of files
- 🎨 Visual Output: Generate images at each pipeline stage
- 🔍 Interactive Visualization: Streamlit app for exploring results
- ✅ Test-Driven Development: Comprehensive test suite
- 🐍 Python 3.11: Optimized for specific version compatibility
# Create and activate conda environment
conda create -n synth_go python=3.11 -y
conda activate synth_go
# Install dependencies
pip install -r requirements.txt# Process sample data
python src/run_pipeline.py initial final --input assets/sandbox
# Launch visualization app
cd src/vis_app
python run_app.py
# Open browser to http://localhost:8501# Run everything in one go
python demo_streamlit.py-
Abstract Pipeline Module (
src/abstract_pipeline_module.py)- Generic base class for all pipeline stages
- Type-safe input/output handling
- Automatic output directory management
-
Pipeline Runner (
src/run_pipeline.py)- Orchestrates multi-stage processing
- Handles file/folder input
- Validates type compatibility between stages
- Generates comprehensive output metadata
-
Visualization App (
src/vis_app/)- Interactive Streamlit interface
- Side-by-side stage comparison
- Pipeline run management
- Raw data inspection
Two sandbox modules demonstrate the framework:
-
Initial Module: File → String
- Reads first character from text file
- Duplicates character
- Generates text image
-
Final Module: String → Integer
- Converts character to ASCII code
- Generates numeric image
synth_go/
├── assets/sandbox/ # Sample input data
├── src/
│ ├── abstract_pipeline_module.py # Base classes
│ ├── run_pipeline.py # Pipeline orchestrator
│ ├── pipeline_modules.py # Module registry
│ ├── sandbox_pipeline_modules/ # Example implementations
│ └── vis_app/ # Streamlit visualization
├── tests/ # Test suite
├── output/ # Generated pipeline results
└── requirements.txt # Dependencies
- Inherit from AbstractPipelineModule:
class MyModule(AbstractPipelineModule[InputType, OutputType]):
def parse_input_data(self, input_data: InputType) -> Any:
# Process input
pass
def process_data(self, parsed_data: Any) -> Any:
# Core processing logic
pass
def visualize_output_data(self, processed_data: Any, filename: str) -> None:
# Generate visualization
pass
def construct_output_data(self, processed_data: Any) -> OutputType:
# Format output
pass- Register in pipeline_modules.py:
PIPELINE_MODULES = {
"my_module": MyModule,
# ... other modules
}- Use in pipelines:
python src/run_pipeline.py my_module other_module --input data/The framework enforces type compatibility between pipeline stages:
- Output type of stage N must match input type of stage N+1
- Validation occurs before pipeline execution
- Clear error messages for type mismatches
Each pipeline run generates:
output/pipeline_runs/<run_id>/
├── <module_1>/
│ └── <datapoint_id>_<module_1>.png
├── <module_2>/
│ └── <datapoint_id>_<module_2>.png
└── run_data.json # Complete metadata
The run_data.json contains:
- Pipeline stage sequence
- Datapoint mappings
- File paths for all generated images
- Unique identifiers for reproducibility
Comprehensive test suite covering:
- Individual module functionality
- Pipeline orchestration
- File I/O operations
- Error handling
- Output validation
# Run all tests
conda activate synth_go
python -m unittest discover tests -v- Python 3.11: Required for library compatibility
- Pillow: Image generation and processing
- Streamlit: Interactive visualization interface
- Standard Library: pathlib, json, uuid, argparse
This framework is ideal for:
- Data Processing Pipelines: Multi-stage data transformation
- Computer Vision: Image processing workflows
- Research Projects: Reproducible analysis pipelines
- Prototyping: Rapid development of processing chains
- Education: Learning pipeline design patterns
- Follow Test-Driven Development practices
- Maintain type safety throughout
- Add comprehensive docstrings
- Update tests for new functionality
- Ensure Python 3.11 compatibility
Open source - feel free to adapt for your specific use cases.