# Overview

The field of version control has been evolving since the early 60's. Recently, as data science, machine learning, and artificial inteligence (all data based applciations), take off, the need to maintain control over data sets is beginning to surface and grow. Additionally, the expectation that [databases evolve](https://martinfowler.com/articles/evodb.html) is also merging with the devops ideology.

Currently, the application of version control to data is relatively new. While there are several options on the market, many are not fully featured yet.

In this article we will
1. Review the use cases for data version control systems (DVCS)
2. Review the types of data that need versioning
3. Define the full set of features one would expect from a VCS for data 
4. Review the current offerings on the market

# Traditional Version Control Systems (VCS)
Before diving into DVCS it's imperative that the reader understand traditional version control. This is because VCS is the foundation of all scalable DevOps implimentations. It informs all process architecture and release strategies. As Such, DVCS closely follows and extends the VCS concepts. In doing so it is directly compatible with the DevOps mindset and workflow and can be adopted without major issues.

So what is version control? Generally speaking version control is the practice of tracking the evolution of a code base over time. VCS tooling is able to serve the user historical and current versions and compare and track changes through the evolutionary timeline. It also allows multiple engineers to safely work in parallel and provide governance.

The major elements of successful version control implimentations are built around four abscrations:

- Branching Strategy - The process by which commits are made, reviewed, and approved by the development team.
- Versioning Strategy - The process by which arbitrary change events are grouped, comitted, and named.
- Promotion Strategy - The process by which versions of a software solution are associated with a quality tier.
- Deployment Strategy - The process by which a given version of a specific quality tier is delivered to a target environment.

Fully defining these concepts is a conversation of their own. The nuances are not necessarily required for this topic. Simply recognizing the parallels between VCS and DVCS is the intention.

## Versioning Strategies
A code repository is expected to change over time. Thinking about change as a discrete process we can expect a liner sequence of change events.

<center><img src="images/dvcs-versioning-events.png"></center>

But is every arbitary change event constituting a new version? Maybe a version is an aggregate of a set of one or more atomic transactions. 

For example we might have a new version called "birthday code" which consists of a bunch of changes I made during a four hour hacking session on my birthday. We might have a version coming out every three hours and name them based on the dev team's favorite baseball players. We might have a version called v2 which is so different from v1 such that it is incompatible.

<center><img src="images/dvcs-versioning-events-2.png"></center>


In selecting a versioning strategy we will synchronize the state changes of a given dataset with the commits captured by the DVCS. Additionally, the versioning strategy is the process by which arbitrary change sets are comitted (archived in an immutable metastore) and named. Through the versioning strategy we understand logically and philisophically how and why a version is created as well as tangibly what we will call this thing. 

There are many versioning strategies, we will look at CalVer.

### Calendar Versioning (CalVer)
A popular versioningstrategy is [Calendar Versioning (CalVer)](https://calver.org/). With this strategy a version number is broken down into several fields, each of which contain information about the version.

TODO: Add image with fields

## Branching Strategies

OK, so now we understand how we are mapping individual change engents to change sets (i.e. versions) and we understand how the versions are being named. But this, by itself, will not lead to a well organized code repository. This is because there is a possibility for multiple versions to coexist in complex solutions. This means we need to track compatibility and lineage. Additionally when we have multiple developers doing things in parallel, we need a way for them to work independently (without impacting eachother) and then a way for all their versions to me combined into one single consolidated version.

Consider an example where a team needs to produce a consumable product for downstream teams. This could be a piece of software, this could be some sort of data, like a report. The team might organize their work product on a file system in a way that explains who created what.
<center><img src="images/dvcs-branching-workflow-without-branches.png"></center>





This is still a bit disorganized and a bit hard to understand and prone to inadvertent mistakes. Additionally while the picture shows an evolution over time, in reality we may only have one or two copies of a report, and as time goes on we might overwrite old versions with new versions. 

Instead of using a raw filesystem, we might elect to use a VCS tool and take advantage of branches. In a VCS, a branch is a sequential set of commits. Commits are immutable snapshots of the state of a code repository at a given point in time. The branches allow us to clearly deliniate evolutionary timelines for different versions of a file or directory. 


Below we can see branches represented by lines and the circles represent the different versions that might appear over time. 

<center><img src="images/dvcs-branching-workflow-with-crude-branches.png"></center>

We can also associate a branch with a particular purpose or quality level. For example we might designate one branch as the "gold copy" that we send to the public and another as "a POC" which is only available internally.



**Note**: We can use the branches to implicitly express a lot of useful information, we will come back to this when we talk about promotion and deployment strategies.


There are many branching strategies used in practice. These strategies can be classified by the number of integration branches, non integration branches, merge policies, and quality tiers. 


# **TODO**: Talk abotu commit push cycle

### Trunk Based Development

We will consider the most basic strategy: Trunk Based Development. The workflow however applies to all strategies as they are all derived from this model. Below we see a branch-flow diagram.

<center><img src="images/dvcs-branching-workflow.png"></center>

## Promotion Strategies

Promotion Strategies articulate quality gates; they define how code is promoted from one quality level to a higher quality level by crossing through a quality gate which is enforced programatically. Typically the gate is implimented as an automated test suite. Before the given version number can be attached to a specific commit it must run a series of tests that guarantee there are no problems and the quality standards are upheld (i.e. no bugs).

We can tie our promotion strategy to our branching strategy by using branches to represent quality levels. In the case of TBD we would represent the quality levels as follows:

<center><img src="images/dvcs-branching-workflow-with-quality.png"></center>

Here we can see that the release branch is a higher quality level than a bug branch which is the lowest quality level. But we may consider some branches to be the same quality level, so we might group them into a tier.

<center><img src="images/dvcs-branching-workflow-with-quality-tiers.png"></center>

Above we can see that we have elected a three-tier promotion strategy. As we will see in the next section, we can map our promotion strategy to our deployment strategy. We will do this by associating a quality tier with an environment tier.

## Deployment Strategies
Deployment Strategies dictate what environment a specific version of code will be deployed to. Typically quality tiers are mapped to environment tiers such that code can only be deployed to an environment with an associated quality tier that is equal to or below the given version of code. For example, production quality code can be deployed to prod and non-prod environments while non-prod code can only target non-prod environments.

<center><img src="images/dvcs-quality-tiers-deployment.png" style="height: 300px"></center>

In the example above, we have a three-tiered environment. We could have any number of quality tiers but having at least three makes it very simple to map a quality tier to an environment tier.

## Summary

The four pillars discussed above allow for a successful VCS implimentation. These pillars will also apply to version control for data as that problem space faces many of the same issues.

# What Is A Data Version Control System (DVCS)?

In short, a DVCS is the data analog of traditional VCS. It's a version control system applied to data rather than code. Conveniently, all of the concepts that apply to code will apply to data. This is because data, like code, is transactional; if we change a code file, or if we change data file, we have a discrete process evolving over time. We can treat them exactly the same. As such we must define our versioning, branching, promotion, and deployment strategies for our data set.

Additionally we will see that data has a few additional requirements and so the DVCS will typically aim to offer a few features that are not present in traditional vcs.

## Main Considerations

As we saw with traditional VCS, per the versioning strategy, the first major question is how do we map change events to versioning events. Thinking thgouth this problem for data can be a bit confusing. Thinking about all the different ways data is presented and consumed can be overwhelming. But the key to success is to break them down in terms of key characteristics which allow us to represent the data using the same model that we used to represent code. 

These characteristics of our data (frequency and format) will inform the characteristics of our DVCS implimentation (granularity, consistency). We will discuss the following characteristics over the next section:

- Persistance Layer & Storage Format: filesystem, object store, database, block device
- Frequency: Realtime vs batch
- Consistency: immediate vs eventual
- Granularity: 100%, discrete time increments, arbitrary increments

Once we can accurately describe the data we can start defining our versioning, branching, promotion, and deployment strategy.

### Persistance Layer & Storage Format
During it's lifetime, data is expected to be transformed through multiple formats. The format in which one interacts with the data will likely not be the same format as the one which persists the data to disk or transfers the data over the network. This is extremely important is it reduces the scope of the DVCS problem space and allows DVCS to reimplement the successful designs of traditional VCS.

For example we might consume a dataframe which was stored to disk as a csv file and downloaded as an ASCI byte stream. We might consume a video file, which captures a UDP stream, as a series of miages. We might consume a SQL table produced by a query of a view of a table stored in a database. In the context of DVCS, Format refers to the format in which the underlying datastore stores the data and not necessarily the format in which it is consumed or transported. 

Thinking through the popular datastore techologies on the market, we loosly expect our solutions to be build using a persistance layer classified into one of the following categories with one of the coresponding storage formats.

| Data Store   | Storage Format        | Examples                              |
|--------------|-----------------------|--------------------------------------------------------------------------------|
| fylesystem   | files and directories | Directory and/or arbitrary files (binary and/or text based)                    |
| object store | objects               | Binary and text based files                                                    |
| block device | block and address     | Disk Image, ISO, Thumb Drive                                                   |
| database     | transactions and logs | MySql, MongoDB, Neo4J                                                          |

Ultimately the storage format is what will drive the DVCS implimentation. When we talk about data versions, we will be talking about versions of the underlying storage format. This will allow us to reconstruct coresponding versions of data in other formats, for consumption or transfer.

**Note**: A caveat with DVCS is that the diffing of a version may not be represented in the storage format or the consumption format. For example, Dolt diffs of a SQL table are represented as plain text speudo-tables rather than sql tables or sql transaction logs.

### Freuqency
Frequency describes how often change events are expected to occur to the data set. The data could expect many changes over a short period of time, a single change over a long period, or some mixture imbetween. It's commont to refer to these as real time and batch data respectively. The frequency will impact our choices with respect to consistency and granularity.

### Consistency

Depending on the frequency of the data changes, there may be a delay with respect to the consistency of the persistance layer of the dvcs; data may be immediatly consistent when consumed (i.e. it is stored when it is consumed) or it may be eventually consistent (i.e. stored at some point afterit is consumed). 

If the frequency of data changes is faster than the frequency of persistance events, there will be a delay between the perceived data state the the persisted data state. As such a buffering mechanism will be required to temporarily capture change events that have yet to be persisted by the persistance layer. In some cases the volume and frequency of events may be physically impossible to catalogue in the VCS. In this case the user may need to define a versioning strategy that is more practical and accept a lower granularity.

### Granularity
Granularity refers to the proportion of change events that are captured by the DVCS; i.e. the level of detail with which one can review the change history. Generally speaking there is a practical limit at which point added granularity adds no value to the owners of the data. It may be the case that granularity depends on some upstream consideration; in certain circomstances we may want to achieve a higher granularity than others.

For ease of discussion, I have bucketed the various granularities as follows:
- **100%** - every change event must be captured
- **discrete time increments** - capture the state as certain prespecified points in time 
- **arbitrary increments** - capture the state at random points based on external factors. This could be a mixture of the first two granularities.

This characteristic of our desired solution will determine how we impliment our vertioning strategy etc.

## Example Workflow

Let's look at some examples of how a data engineering team would use DVCS to manage thier productized data repository.

Continuting the example from the traditional vcs, lets assume the following parameters for our DVCS:
- Versioning strategy - CalVer
- Branching Strategy - Trunk Based Development
- Promotion Strategy - Three-Tier
- Deployment Strategy - Three-Tier

### Batch Data Aggregation
Let's first look at a batch data scenario. We are not performing any transformations on the data we are simply archiving information that is used as an input to downstream processes. As such, we dont really have any parallel actions taking place and our data only has one quality level.

The first thing to understand is what is the frequency of my data changing, how is the data being stored, how granular does our record of the data changes need to be. Lets assume the following:

- Frequency: Updates Once M-F at 8 AM EST
- Persistance Layer: SQL Database
- Ganularity: Daily M-F at 8:30 AM EST (We dont care about changes that trickle in throughout the day, we just want the state as of 8:30)

We would have the same exact workflow as the analogous code repository. Assuming we already have a repo with an integration branch our process would be:

<center><img src="images/dvcs-example-workflow-1.png" style="float: right; height: 200px"></center>

1. Branch off of integration branch
1. Start process
1. Wait for process to finish updating data set
1. Commit changes as a new version
1. Push
1. Merge into integration branch
1. Create Release Branch

You might say, why do we even need a new branch? Why not commit directly to the integration branch. This is because we are using the TBD branching strategy which expressly forbids this action. Thus perhapse that branching strategy is not right for this scenario.

### Batch Data Transformation

Let's look at an example of a dowstream process from the previous example. I will consume data that is being published, I will transform it, and I will publish it. In this case, my transformation is implimentated by a pipeline which runs a specific version of code in a code repository. Additionally, in this example, my transformations deal with data quality. The transformed data will ultimately be consumed by a downstream machine learning model and the machine learning team has helped to define tests which validate that the data is in the correct format and that the data has the correct values. For example, if a column was expected to be boolean but is instead a string, the test will fail and alert the data engineering team that the quality of the code does not satisfy the requirements of the downstream consumer. The data engineering team can then go back and analayze why the transformation of the data failed: it could be that the data that was imported has changed, the import process had a bug, or the transformation had a bug. Once the issue is understood, the code can be corrected, the transformation can be rerun on the original data set, and a corrected version and be produced for downstream consumption.

Thus our process would be as follows:

<table>
    <tbody>
        <tr>
            <td>
                
1. Branch off of integration branch      
1. Start process
1. Wait for process to finish updating data set
1. Commit changes as a new version
1. Push
1. Perform transformation
1. Commit changes as a new version
1. Push
1. Merge into integration branch
1. Fix bug in transformation logic
1. Rerun transformation
1. Merge into integration branch
1. Create Release Branch
            </td>
            <td>
                <center><img src="images/dvcs-example-workflow-2.png" style="width: 800px"></center>
            </td>
        </tr>
    </tbody>
</table>



As with traditional VCS we can see that the DVCS may raise events to inform downstream processes to kick off. In the example above, we might configure the inport pipeline to trigger based on commits to release branches. In fact the entire workflow of the data engineering team may be automated by the eventing of an upstream DVCS.

**Note**: You might be thinking: there is a lot of overhead maintaining all these redunant or obsolete copies of data. We will discuss that when we talk about retention later on.

### Realtime Streaming
Now lets look at realtime data streaming scenario. We will see it works exactly the same way (provided we impliment certain architectural designs).

Again, the first thing to understand is what is the frequency of my data changing and what is the underlying persistance layer of my data. Lets assume the following:

Frequency: Updates Once M-F at 8 AM EST
Persistance Layer: SQL Database

## So how would we employ DVCS?

To answer this question we need to answer the following questions:
- What constitutes a new version of data? 
- How does one version change from the next?
- How do I switch from one version to the next?
- Who is authorized to create a version?

A new version could occur every with every new transaction applied to the data store. When a major schema change occurs rendering a dataset incompatible with downstream codebases. A version could also signify the omission of a transaction: e.g. maintaining versions with and without personally identifying information (PII).

This information is articulated through the branching and versioning strategies.

The end goal is to provide data versioning, data lineage, and successfully execute a branching strategy for the purposes of continuous integration. The process of continuous integration provides process synchronization, quality gates, and serves as the backbone of data governance and continuous deployment.

## Use cases for DVCS


Tracking Temporal Data

- Time series data

Tracking evolution of the data set

- Adhoc corrections need to be made to the data
- New or experimental code that performs modifications is being run against a data set

Unlocking rapid / agile development

- Parallel development that introduces breaking changes

Protection

- Bug in an application causes corruption
- Malicious actor makes unauthorized modifications

# 2. Types of Data
Before talking about the various VCS options for data we need to understand that data comes in different formats and there likely isn't going to be a single robust solution that accomodates all data types.

Data typically comes in the following flavors:
1. Files / Filesystems
    1. Tables
    2. JSON
2. Objects
3. Raw Block Devices
4. Databases

Now that we understand the basic types of data, lets understand the typical features that version control systems offer.

# 3. Features Of A DVCS

## 3.1. Core Features

- **History** - An immutable sequential set of point-in-time representations (often referred to as commits) of a data set. I.e. A representation of the historical evolution of a data set.

- **Lineage (provenance)** - An imutable ancenstral representation of the inputs and processes that produced a given commit, commit history, or change. 

    For example, if a pipeline is responsible for adding a new row or if a user is responsible for modifying a file, this information would be captured and reviewable.

- **Braching** - An ability to represent and reference discrete parallel states of a data set for a single point in time.

- **Merging** The ability to combine parallel commit histories into a single commit history by accepting changes from both version histories and accepting and combining changes to individual files; the abiity to define and impliment merge strategies.

- **Conflict Detection and Resolution** - The ability to detect and remediate istances of merge conflicts (when changes in two branches are incompatible or contradict eachother). This functionality, like with diffing, may be offloaded in part to the underlying storage layer or an established third party tool.

- **Data Efficiency** - The ability to minimize the number of redundant copies of data.

    Storing data costs money, and with big data you will see big costs. The DVSC should be able to allow multiple branches or commits to reference the same copy of data without requiring a separate copy for each observer.

- **Atomicity** - The version history is composed of atomic (invivisible and irreducible) units of change. The operations performed on the version history are also atomic guaranteeing that updates to are either successful or in innert in the case of failure. Additionally no two operations can be performed at the same time; only one update to a version history can occur at a given time.

- **Persistance**: The ability for the data managed by the VCS to survive system power loss. The data should be stored in a persistant data store.

- **Durability**: The ability for version history and linage to survive corruption events (such as transaction failures or bit-rot). The VCS should impliment or inherit mechanisms to prevent against corruption and possible unintended data loss.

- **Retention**: The ability to define which set or subset of version history events to "forget" or erase based on some predefined policy. While the VCS must be able to persist all records of data changes, in some cases, the user may prefer only to retain a rolling window of changes. It's also possible that the user may want to specify an alternate retention strategy. The VCS should support the ability to prune its records of versions that are no longer needed or wanted. 

    Note: The underlying storage layer may impliment it's own retention policy. If the VCS is built on top of a 3rd party datastore this may interfere with the retention policy of the VCS depending on the implimentation.

- **Working Set** - A mutable copy of a commit or an unstracked data set intended to be comitted at a later date.

- **Reversion**: The ability to restore a working set or branch to a historical version. This reversion should support reverting the data in whole or in part (i.e. a single file or an entire directory).

- **Metadata Support** - The VCS should allow the user to attach meaningful information to the changes being recorded in the system (i.e. what were the changes, why where they made, etc.). With traditional VCS we are able to see the user who comitted the, a description of the change, and a change list associated with the commit. Data pipelines however are a bit more complicated and robust than human driven edits to plain text files. For example, data may be transformed through a DAG expressed through a pipeline or by a transformation as part of a stream. In both of these instances the transformation is implimented via an instance of code. The VCS should be aware of the process which is orchestrating and/or applying a particular change to a data set. For example if there is a pipeline run the VCS should be able to create a link between the verson of code and the instance of the pipeline which performed the transaction. This additional 

- **Diffing**: The ability to show the differences between commits, between a working set and a commit, or individual files rather than the entire data set. 

   In practice, Some of this functionality can be offloaded in part to the underlying storage implimentation. For example, if a JSON file changed, there are already diff tools which can compare plain text files; if a database file changed, one may need to leverage the underlying database tooling to do a comparison.

- **Reviewability** - The ability to review and query metadata attached to the revisions in the history. 

- **Format Aignostic** - Able to support multiple formats or types of data (eg. text files, binary files, database files).

- **Blame Support** - Similar to traditional VCS the DVCS should allow a user to identify who is responsible for a particular change in the data.

- **Role Based Access Control** - The VCS needs to be able to define users, roles, and associated permissions. While the underlying storage layer will have it's own RBAC, which should be respected by the VCS implimentation, the VCS maintains its own objects and data which also must be protected.

## 3.2. Operational Features

- **Backup Support**: The VCS system needs to enable system admins to perform backups. As such a copy of the entire system can be snapshotted and archived for disaster recovery purposes.

- **Resilliance** - As with a datastore, the VCS should be able to recover from failures and provide an uninterrupted service to the users.

- **Scalability** - The VCS performance should not be impacted with changes to the size of the underlying data set or the length of the revision history.

## 3.3. Architectural Features

- **Datastore Integration**: The VCS should be built on top of an established traditional storage layers. In doing so, the VCS will minimize the complexity of it's design and maximize the utility provided to users and admins.

- **Highest Level Abstraction**: speaking in the abstract, data is stored in containers (sometimes in a nested or hierachical structure). For example, sql data is stored in a table in a database; json files are stored in an S3 bucket in an AWS account. The VCS should be aware of the abstractions provided by the underlying data stores. It should be able to understand and register changes to any multiple of lower abstraction within the data store. As such, the DVCS should provide an abstraction that can sit above any other "layer" representing a data source.

# 4. Reference Architecture

Currently there are no solutions which are feature complete. There are a number of solutions which offer a subset of the features specified above. In this section I propose a reference architecture and talk through how several solutions fit into that architecture.

## 4.1. Overview

Below we can see a modular solution designed to conform to the client-server model. The DVCS solution sits on top of the underlying storage layer and proxies native traffic to the underlying datastore.

<center><img src="images/dvcs-architecture-diagram.png" style="width: 400px"></center>

With this architecture the client connects to the DVCS to provide administrative commands to the DVCS repository,  informative metadata about the lineage of a particular state of the DVCS repository, or to connect to the underlying datastore to manipulate the actual data.

## 4.2. Components

### 4.2.1. The Storage Layer Proxy (SLP)

The SLP is a metered/monitored proxy which is responsible for connecting a client directly to their data which is stored on the underlying storage layer. The location of the actual data is based on the input from the VCL. The data could "move around" as different VCS event occur. For example, if we want to access a file from a particular version of the repository, there may be some underlying magic that needs to happen so that I can read that file while someone else reads a newer copy. The SLP informs the LCL of client session activity through the metadata store. As clients establish connections and perform operations, the SLP can observe and report.

### 4.2.2. The Lineage Control Layer (LCL)
The LCL is responsible for tracking the lineage of the VCS events; such as where the data is from and what transformations are responsible for creating the data associated with a given commit. When clients connect to the DVCS they establish a session and are given access to the underlying data store through a metered proxy (the SLP). Any storage layer transactions declared/observed are tied to the respective client and catalogued.

### 4.2.3 The Version Control Layer

The VCL is responsible for managing the commit history, the commit metadata, and the working set. It manipulates the underlying storage layer to snapshot data, create working copies, and provide diff functionality. It configures the storage layer proxy to direct client requests to a respective underlying data store.

### 4.2.4. The Storage Layer
The storage layer is the technology the user has elected to store their raw data. This would be a solution that provides file, object, block, or database storage. The storage layer is managed by the DVCS which sits above it. User traffic is directed to the storage layer through the DVCS and its SLP.

### 4.2.5. The Metadata Store
The metadata store is responsible for storing metadata related to the version history, version lineage, proxy settings, and event hooks associated with a data repository.

### 4.2.6. Event Hooks & Compute
The event hook system allows sub systems to trigger events. Internal and 3rd party systems are able to register event hooks to execute when a specified events occur. For example, when the underlying storage layer triggers an event, the event hook system can witness the event and notify downstream systems. The Event Hook functionality requires a compute layer to host execution of the registered callbacks.

## 4.3. Workflow

The basic process of using a DVCS is analogous to the process of using a traditional VCS. The end goal is to provide data versioning, data lineage, and successfully execute a branching strategy for the purposes of continuous integration. The process of continuous integration provides process synchronization, quality gates, and serves as the backbone of data governance and continuous deployment.




The basic process of adding new data to the central repository is as follows:

1. Create Repo
1. Create Initial Commit & Integration Branch
1. Create Branch
1. Make Commit(s)
    - Update Working Set (through human or pipeline)
    - Commit Changes
1. Merge Into Integration Branch


### 4.3.3. Merging Strategy

This branching proceedure should work in both real-time streaming and batch data processing scenarios. 

There is a tradeoff that exists however between data availability and the merge activity; technically, data is not "official" until it is comitted/merged into the target branch. In the event that data exists in the working set (uncomitted changes) of an non-integration branch, the data   

In this case of integration branches, there are not direct commits, which means data is not present until it is merged.

real time streaming vs streaming

The real challenge that needs to be addressed is the process of merging in the event of a conflict.


There may need to be minor rearchitecting to facilitate the streaming processes

With respect to streaming data there is a small hitch with respect to  discretization.













Walk through branching strategy and lineage for db, file store, and rbd. Discuss how atomic operations happen.