Skip to content

Data preprocessing and post processing scripts and notebooks for model customization and for an example KFP pipeline for docling

License

shruthis4/odh-data-processing

 
 

Repository files navigation

ODH Data Processing

Status dev-preview GitHub License GitHub Commits

This repository provides reference data-processing pipelines and examples for Open Data Hub / Red Hat OpenShift AI. It focuses on document conversion and chunking using the Docling toolkit, packaged as Kubeflow Pipelines (KFP), example Jupyter Notebooks, and helper scripts.

The workbenches directory also provides a guide on how to create a custom workbench image to run Docling and the example notebooks in this repository.

📦 Repository Structure

odh-data-processing
|
|- kubeflow-pipelines
|   |- docling-standard
|   |- docling-vlm
|
|- notebooks
    |- tutorials
    |- use-cases
|
|- custom-workbench-image

✨ Getting Started

Kubeflow Pipelines

Refer to the ODH Data Processing Kubeflow Pipelines documentation for instructions on how to install, run, and customize the Standard and VLM pipelines.

🤝 Contributing

We welcome issues and pull requests. Please:

  • Open an issue describing the change.
  • Include testing instructions.
  • For pipeline/component changes, recompile the pipeline and update generated YAML if applicable.
  • Keep parameter names and docs consistent between code and README.

Quality & CI

This repo enforces Python style and clean notebooks via pre-commit and a GitHub Actions workflow.

What runs:

  • Ruff (lint, autofixes)
  • Black (format)
  • isort (import order, Black profile)
  • nbstripout (removes Jupyter outputs)

Where it runs:

  • On every Pull Request
  • Once post-merge to main (final validation)

Quick start (local):

pip install pre-commit
pre-commit install               # installs the git hook
pre-commit run --all-files       # run all checks on the repo

## 📄 License

Apache License 2.0

About

Data preprocessing and post processing scripts and notebooks for model customization and for an example KFP pipeline for docling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 64.8%
  • Python 30.3%
  • Dockerfile 4.1%
  • Shell 0.8%