Data Processing Pipeline Framework

A flexible, extensible framework for building modular data processing pipelines with visualization capabilities.

Overview

This project provides a generic foundation for creating data processing pipelines that:

Process input data through configurable pipeline stages
Generate visual outputs at each stage
Support batch processing of multiple files
Include comprehensive testing and visualization tools

Key Features

🔧 Modular Architecture: Abstract base classes for easy extension
📊 Batch Processing: Process entire directories of files
🎨 Visual Output: Generate images at each pipeline stage
🔍 Interactive Visualization: Streamlit app for exploring results
✅ Test-Driven Development: Comprehensive test suite
🐍 Python 3.11: Optimized for specific version compatibility

Quick Start

1. Environment Setup

# Create and activate conda environment
conda create -n synth_go python=3.11 -y
conda activate synth_go

# Install dependencies
pip install -r requirements.txt

2. Run Example Pipeline

# Process sample data
python src/run_pipeline.py initial final --input assets/sandbox

# Launch visualization app
cd src/vis_app
python run_app.py
# Open browser to http://localhost:8501

3. Complete Demo

# Run everything in one go
python demo_streamlit.py

Architecture

Core Components

Abstract Pipeline Module (src/abstract_pipeline_module.py)
- Generic base class for all pipeline stages
- Type-safe input/output handling
- Automatic output directory management
Pipeline Runner (src/run_pipeline.py)
- Orchestrates multi-stage processing
- Handles file/folder input
- Validates type compatibility between stages
- Generates comprehensive output metadata
Visualization App (src/vis_app/)
- Interactive Streamlit interface
- Side-by-side stage comparison
- Pipeline run management
- Raw data inspection

Example Implementation

Two sandbox modules demonstrate the framework:

Initial Module: File → String
- Reads first character from text file
- Duplicates character
- Generates text image
Final Module: String → Integer
- Converts character to ASCII code
- Generates numeric image

Project Structure

synth_go/
├── assets/sandbox/           # Sample input data
├── src/
│   ├── abstract_pipeline_module.py  # Base classes
│   ├── run_pipeline.py              # Pipeline orchestrator
│   ├── pipeline_modules.py          # Module registry
│   ├── sandbox_pipeline_modules/    # Example implementations
│   └── vis_app/                     # Streamlit visualization
├── tests/                    # Test suite
├── output/                   # Generated pipeline results
└── requirements.txt          # Dependencies

Extending the Framework

Creating New Pipeline Modules

Inherit from AbstractPipelineModule:

class MyModule(AbstractPipelineModule[InputType, OutputType]):
    def parse_input_data(self, input_data: InputType) -> Any:
        # Process input
        pass

    def process_data(self, parsed_data: Any) -> Any:
        # Core processing logic
        pass

    def visualize_output_data(self, processed_data: Any, filename: str) -> None:
        # Generate visualization
        pass

    def construct_output_data(self, processed_data: Any) -> OutputType:
        # Format output
        pass

Register in pipeline_modules.py:

PIPELINE_MODULES = {
    "my_module": MyModule,
    # ... other modules
}

Use in pipelines:

python src/run_pipeline.py my_module other_module --input data/

Type Safety

The framework enforces type compatibility between pipeline stages:

Output type of stage N must match input type of stage N+1
Validation occurs before pipeline execution
Clear error messages for type mismatches

Output Structure

Each pipeline run generates:

output/pipeline_runs/<run_id>/
├── <module_1>/
│   └── <datapoint_id>_<module_1>.png
├── <module_2>/
│   └── <datapoint_id>_<module_2>.png
└── run_data.json  # Complete metadata

The run_data.json contains:

Pipeline stage sequence
Datapoint mappings
File paths for all generated images
Unique identifiers for reproducibility

Testing

Comprehensive test suite covering:

Individual module functionality
Pipeline orchestration
File I/O operations
Error handling
Output validation

# Run all tests
conda activate synth_go
python -m unittest discover tests -v

Dependencies

Python 3.11: Required for library compatibility
Pillow: Image generation and processing
Streamlit: Interactive visualization interface
Standard Library: pathlib, json, uuid, argparse

Use Cases

This framework is ideal for:

Data Processing Pipelines: Multi-stage data transformation
Computer Vision: Image processing workflows
Research Projects: Reproducible analysis pipelines
Prototyping: Rapid development of processing chains
Education: Learning pipeline design patterns

Contributing

Follow Test-Driven Development practices
Maintain type safety throughout
Add comprehensive docstrings
Update tests for new functionality
Ensure Python 3.11 compatibility

License

Open source - feel free to adapt for your specific use cases.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets/sandbox		assets/sandbox
src		src
tests		tests
.gitignore		.gitignore
ENVIRONMENT_SETUP.md		ENVIRONMENT_SETUP.md
FIFTYONE_QUICKSTART.md		FIFTYONE_QUICKSTART.md
GO_PIPELINE_USAGE.md		GO_PIPELINE_USAGE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
demo_streamlit.py		demo_streamlit.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Processing Pipeline Framework

Overview

Key Features

Quick Start

1. Environment Setup

2. Run Example Pipeline

3. Complete Demo

Architecture

Core Components

Example Implementation

Project Structure

Extending the Framework

Creating New Pipeline Modules

Type Safety

Output Structure

Testing

Dependencies

Use Cases

Contributing

License

About

Uh oh!

Releases

Packages

Languages

siasio/synth_go

Folders and files

Latest commit

History

Repository files navigation

Data Processing Pipeline Framework

Overview

Key Features

Quick Start

1. Environment Setup

2. Run Example Pipeline

3. Complete Demo

Architecture

Core Components

Example Implementation

Project Structure

Extending the Framework

Creating New Pipeline Modules

Type Safety

Output Structure

Testing

Dependencies

Use Cases

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages