Pachyderm: Data Versioning, Data Pipelines, and Data Lineage

Pachyderm is a tool for version-controlled, automated, end-to-end data pipelines for data science. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, while ensuring the traceability and provenance of your data, Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you.

Features

Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on prem.
Version Control: Pachyderm version controls your data as it's processed. You can always ask the system how data has changed, see a diff, and, if something doesn't look right, revert.
Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, try us for free on Hub with little to no setup or run Pachyderm locally. You can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

Follow us on Twitter.
Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our open positions

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

License Information

Pachyderm has moved some components of Pachyderm Platform to a source-available limited license.

We remain committed to the culture of open source, developing our product transparently and collaboratively with our community, and giving our community and customers source code access and the ability to study and change the software to suit their needs.

Under the Pachyderm Community License, you can access the source code and modify or redistribute it; there is only one thing you cannot do, and that is use it to make a competing offering.

Check out our License FAQ Page for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 19,732 Commits
.circleci		.circleci
.github		.github
.vscode		.vscode
dex-assets		dex-assets
doc		doc
etc		etc
examples		examples
goreleaser		goreleaser
src		src
.dockerignore		.dockerignore
.drone.yml		.drone.yml
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.ignore		.ignore
.spelling		.spelling
.testfaster.yml		.testfaster.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.pachctl		Dockerfile.pachctl
Dockerfile.pachd		Dockerfile.pachd
Dockerfile.pachdoc		Dockerfile.pachdoc
Dockerfile.pachdoc.dockerignore		Dockerfile.pachdoc.dockerignore
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
mascot.txt		mascot.txt
pachyderm.go		pachyderm.go

License

xyz2566841/pachyderm

Folders and files

Latest commit

History

Repository files navigation

Pachyderm: Data Versioning, Data Pipelines, and Data Lineage

Features

Getting Started

Documentation

Community

Contributing

Join Us

Usage Metrics

License Information

About

Resources

License

Stars

Watchers

Forks

Languages