# Overview
Pachyderm data version control system that runs on kubernetes. Like most VCS implimentations, it uses the client server model and provides a CLI to manipulating repositories.

In this article we review the basic concepts and architecture.

# Basic Concepts

Pachyderm is quite robust and can be used in a number of ways:

- We can mount the pachyderm file system locally and manage it through the CLI like we would a git repository
- We can use Pachyderm pipelines to automate data pipelines and automagically version our data

Note: The UI is not designed to manipulate data and there is no CRUD functionality.

Before looking at how pachyderm works in practice lets understand the basic concepts.

## The Pachyderm File System (PFS)

The Pachyderm File System (PFS) is a loose collection of concepts, an abstraction, which together define the version control functionality. The Pachyderm application orchestrates and manages the data through abstractions which are ultimately stored in Postrgres and S3.

## Repository

A [Repository](https://docs.pachyderm.com/latest/concepts/data-concepts/repo/) is a logical container for data. It is conceptualy similar to a git repository. The repository stores our data as well as tracks the changes to the data.


## Versioning

Data versioning (History) enables Pachyderm users to go back in time and see the state of a dataset or repository at a particular moment.

## Provenance (Lineage)

Data provenance (from the French noun provenance which means the place of origin), also known as data lineage, tracks the dependencies and relationships between datasets. It answers the question “Where does the data come from?”, but also “How was the data transformed along the way?”.

Pachyderm provides provenance through it's pipeline orchestration. It assumes all data is tracked and transformed through pachyderm (or it's 3rd party integrations).

## Commit

A [commit](https://docs.pachyderm.com/latest/concepts/data-concepts/commit/) in Pachyderm is created automatically whenever data is added to or deleted from a repository. Each commit is an atomic operation which preserves the state of all files in the repository at the time of the commit, similar to a snapshot. Each commit is uniquely identifiable by a UUID and is immutable, meaning that the source data can never change.

You can start a commit by running the pachctl start commit command with reference to a specific repository. After you’re done making changes to the repository (put file, delete file, …), you can finish your modifications by running the pachctl finish commit command. This command saves your changes and closes that repository’s commit, indicating the data is ready for processing by downstream pipelines.

## Branch

A Pachyderm branch is a pointer to a commit that moves along with new commits as they are submitted. In the diagram below we see an example of the pointer being updated below:

<center><img src="images/pachyderm-branching.png"><center>

In Pachyderm, true merging is not implimented. A user must perform the merge by manually diffing and merging files in a branch, creating a new commit with the merge result, and then updating the branch pointer to point to that new commit. The documentation hosts a rationale for the design decision: basically the argument is that the process of merging binary data is data specific and in some cases does not make sense. For that reason, they have left the implimentation to the user:

> The concept of merging binary data from different commits is complex. Ultimately, there are too many edge cases to do it reliably for every type of binary data, because computing a diff between two commits is ultimately meaningless unless you know how to compare the data. For example, we know that text files can be compared line-by-line or a bitmap image pixel by pixel, but how would we compute a diff for, say, binary model files?
> 
> Additionally, the output of a merge is usually a master copy, the official set of files desired. We rarely combine multiple pieces of image data to make one image, and if we are, we have usually created a technique for doing so. In the end, some files will be deleted, some updated, and some added.

## Job
A [job](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/job/) is an execution of a pipeline definition that triggers when new data (ie. a commit) is detected in an input repository.

## Pipeline

A [pipeline](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/pipeline/) is a definition of a process or a transformation that is applied to data (repositories). The process is fully orchestrated by Pachyderm and executes on the underlying kubernetes infrastructure. The process may be defined such that it "listens" or "subscribes" to repositories and runs a job after witnessing a new commit. The pipeline is ultimately articulated through docker; the pipeline instructs which docker container is run and what command is executed inside the container. As pachyderm is built on kubernetes, the pipeline orchestrates pipeline runs accoss a pool of worker pods hosted on the platform. The pipeline automagically creates an output repository with the same name.

### Basic Workflow

There are a number of different types of pipelines (Pachyderm calls these use cases). Generally speaking a pipeline has inputs and outputs. Both are mapped to local directories within the container executing the pipeline job.

Then the pipeline runs, it runs to specified docker container. Before the container executes the specified command, Pachyderm will create two special directories automagically within the container.
- /pfs/<input_repo_name> - stores the data from the input repository
- /pfs/out - stores the output data from the transformation which pachyderm adds to the pipeline’s output repo as a new commit

The basic workflow is described [here](https://docs.pachyderm.com/latest/getting-started/beginner-tutorial/)

### Pachyderm Pipeline Specification (PPS)
The pipeline is defined using a json file called the [pipeline specification](https://docs.pachyderm.com/latest/reference/pipeline-spec/#pipeline-specification). The following json attributes are required for all pipeline types, referred to as "use cases" while the rest depend on the specific use case being described:
- pipeline.name
- transform - the docker container and command executing the transformation

Beyond those, other attributes are conditionally required based on your pipeline’s use case. Pachyderm currently supports the following use cases

An example Input pipeline is as folows:

```
{
  "pipeline": {
    "project": 1,
    "name": "wordcount"
  },
  "transform": {
    "image": "wordcount-image",
    "cmd": ["/binary", "/pfs/data", "/pfs/out"]
  },
  "input": {
        "pfs": {
            "repo": "data",
            "glob": "/*"
        }
    }
}
```

#### Pipeline Use Cases

Pachyderm allows for a number of different types pipelines to be defined through the PPS. With each Use case, the PPS will vary and the use cases may sub device into more than one implimentation. The broad categories of pipeline types are as follows: 

- **Cron** - triggers based on a Cron event rather than a commit event
- **Egress** - Pushes the results of a pipeline to an external object store (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) or a SQL Database (such as Snowflake, postgresql, etc.).
- **Input** - Reorganizes the files in a repo according to some predefined logic
- **Service** - Exposes the transform container as an serving endpoint
- **Spout** - A spout is a type of pipeline that ingests streaming data from an outside source (message queue, database transactions logs, event notifications… 
- **S3** - Writes results to s3 object store rather than pachyderm repo. Duplicate to Egress

Example use cases can be found [here](https://docs.pachyderm.com/latest/reference/pipeline-spec/#pipeline-specification-pps-minimal-spec) and a complete list of PPS can be found [here](https://docs.pachyderm.com/series/pps)

##### Spout

A [spout](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/pipeline/spout/) is a type of pipeline that ingests streaming data from an outside source (message queue, database transactions logs, event notifications… ) as schematized in the diagram below.

<center><img src="images/pachyderm-spout.png"></center>

### Example Processing Datum
Pipelines can be chained together to form an event based DAG. In the [example](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/relationship-between-datums/#datum-processing-example-two-steps-mapreduce-pattern-and-single-datum-provenance-rule) shown below, we highlight a two pipelines pattern where a first pipeline’s glob pattern splits an incoming commit into three datums (called “Datum1” (Red), “Datum2” (Blue), “Datum3” (Purple)), each producing two files each. The files can then be further appended or overwritten with other files to create the final result. Below, a second pipeline appends the content of all files in each directory into one final document.

<center><img src="images/pachyderm-pipeline-chain.png"></center>

## Datum

A datum is a Pachyderm abstraction that helps in optimizing pipeline processing and tracking data lineage.

Datums are a representations of "units of work"; things that a pipeline needs to operate on; collection of files. A datum is the smallest indivisible unit of computation within a job. A job can have one, many or no datums. We can use datums to break up a large data set consisting of many files into a set of smaller data sets which can be processed in parallel.

In the simplest case, a user will want to define a pipeline that executes on a single repository. For this pachyderm provides
- PFS Inputs

In more complex cases, the user may want to combine files from multiple repositories or may want to group files into buckets based on some naming convention. For these tasks, pachyderm provides the following:
- Cross & Union Inputs
- Group Input
- Join Input

Note: Datums exist only as a pipeline processing property and are not filesystem objects. You can never copy a datum. 

For more information see the documentation [here](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/relationship-between-datums/)

### Datum Inputs

#### PFS Inputs
The [documentation](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/#datum-pfs-input-and-glob-pattern) refers to the simplest type of input speficiation as PFS Input. A PFS Input is defined, at a minimum, by:

- a repo containing the data you want your pipeline to consider
- a branch to watch for commits
- and a glob pattern to determine how the input data is partitioned.

#### Cross & Union Inputs

Pachyderm enables you to combine multiple PFS inputs by using the union and cross operators in the pipeline specification. You can think of union as a disjoint union binary operator and cross as a cartesian product binary operator. In other words: the union will combine thre PFS inputs as subdirectories in a larger logical object that is treated like a PFS input. The cross product will match elements between PFS Inputs into discrete sets and then allow the pipeline to operate on each of these match sets as if it were a PFS Input.

For more information see this [documentation](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/cross-union/).

#### Group Input

The group operator allows a user to group similarly named files (from multiple repositories) together based on a glob pattern, which instructs pachyderm how to identify the criteria for the name comparision. The output of a grouping is a set of datums (units of work). Each datumn produced by the group operation points to the subset of files from the PFS Inputs which corespond to that group. The pipeline is then able to operate on each discrete datum deparately.

For more information see this [documentation](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/group/).


#### Join Input

The join operator, like the group oporator, allows a user to group files together based on cominalities in the file name. More precisely it allows us to group a set of files following one nameing convention with files following a second naming convention using some more complex logic to establish a relationship. We can think of the join operator like a sql join operator; we have two columns that we want to associate with eachother via some relationship or primary key.

For more infomration see the [detailed examples](https://github.com/pachyderm/pachyderm/tree/2.5.x/examples/joins) or consule the [documentation](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/join/).


# Architecture

Pachyderm is packaged as a kubernetes application. The solution is composed of several services hosted in pods and several other kubernets resources. As the diagram below illustrates, the solution is also built on services which may be provided outside the kubernetes cluster sudh as S3 and a block device provider for persistent storage.

<center><img src="images/pachyderm-architecture-diagram.png"></center>


# Installation

As discussed in my article about [installing pachyderm](Installing%20Pachyderm.ipynb#Installation-Options) there are multiple installation options. The user may choose a cloud native or on-prem solution. As such, some elements of the architecture may change. Note: The core application is build on native kubernetes resources and custom resource definitions so that should not change too much.

# Fuse Mount
Pachyderm enables you to mount a repository as a local filesystem on your computer by using the pachctl mount command. This command uses the Filesystem in Userspace (FUSE) user interface to export a Pachyderm File System (PFS) to a Unix computer system. This functionality is useful when you want to pull data locally to experiment, review the results of a pipeline, or modify the files in the input repository directly.

You can mount a Pachyderm repo in one of the following modes:

 - Read-only — you can read the mounted files to further experiment with them locally, but cannot modify them.
 - Read-write — you can read mounted files, modify their contents, and push them back into your centralized Pachyderm input repositories.

For more information see the following [documentation](https://docs.pachyderm.com/latest/how-tos/basic-data-operations/export-data-out-pachyderm/mount-repo-to-local-computer/)