# Overview

The field of version control has been evolving since the early 60's. Recently, as data science, machine learning, and artificial inteligence (all data based applciations), take off, the need to maintain control over data sets is beginning to surface and grow. Additionally, the expectation that [databases evolve](https://martinfowler.com/articles/evodb.html) is also merging with the devops ideology.

Currently, the application of version control to data is relatively new. While there are several options on the market, many are not fully featured yet. As the field is still evolving this article will attempt to give a basic objective understanding and a perspective on currently available solutions.

# 1. Why Do I Need Version Control For Data?

If you are unfamiliar with the concept of version control concept, see thie [refresher notebook](Review%20Traditional%20Version%20Control.ipynb). As that article asserts,  VCS is the foundation of all scalable DevOps implimentations as it informs all process architectures and release strategies. 

As an analog, DVCS closely follows and extends the VCS concepts. In doing so version control for to data is the foundation for all scalable DataOps and therefore MLOps implimentations and so too informs all process architectures and release strategies. Additionally, as it closely follows and extends the VCS concepts DVC is directly compatible with the DevOps mindset and workflow and can be adopted without major issues.

DVCS works because data, like code, is transactional; if we change a code file, or if we change data file, we have a discrete process evolving over time. We can manage these evolutionay processes in the exact same way. Just as we define versioning, branching, promotion, and deployment strategies for our code, we can also define them for data. More interestingly, is the link between them!

Additionally we will see that data has a few additional requirements and so the DVCS will typically aim to offer a few features that are not present in traditional vcs.

# 2. Use Cases For DVCS
This is by no means an exhaustive list of use cases. It is just a list to get the juices flowing and spark interest in the subject.

- Tracking evolution of the data set
    - Data is not indexed over time (eg. an image or a diagram)
    - Adhoc corrections need to be made to the data
- Parallel Conflicting Workstreams
    - Isolation provided by branching allows multiple workstreams in parallel
- Protection
    - Bug in an application causes corruption
    - Malicious actor makes unauthorized modifications
    - New or experimental code that performs modifications is being run against a critical data set
    - Breaking changes need to be tested
- Governance
    - Enstablishing and enforcing quality gates
    - Protecting PII
    - Tracing changed through a system
- Reproducability
    - Being able to reporduce data science / ml experiments

# 3. Main Considerations When Designing An Implimentation

As we saw with traditional VCS, per the versioning strategy, the first major question is how do we map change events to versioning events. Thinking through this problem for data can be a bit confusing. Thinking about all the different ways data is presented and consumed can be overwhelming. But the key to success is to break them down in terms of key characteristics which allow us to represent the data using the same model that we used to represent code. 

The key characteristics of our data (frequency and format) will help inform the characteristics of our DVCS implimentation (granularity, consistency). We will discuss the following characteristics over the next section:

- Persistance Layer & Storage Format: filesystem, object store, database, block device
- Frequency: Realtime vs batch
- Consistency: immediate vs eventual
- Granularity: 100%, discrete time increments, arbitrary increments

Once we can accurately describe the data we can start defining our versioning, branching, promotion, and deployment strategy.

## 3.1. Persistance Layer & Storage Format
During it's lifetime, data is expected to be transformed through multiple formats. The format in which one interacts with the data will likely not be the same format as the one which persists the data to disk or transfers the data over the network. This is extremely important is it reduces the scope of the DVCS problem space and allows DVCS to reimplement the successful designs of traditional VCS.

For example we might consume a dataframe which was stored to disk as a csv file and downloaded as an ASCI byte stream. We might consume a video file, which captures a UDP stream, as a series of miages. We might consume a SQL table produced by a query of a view of a table stored in a database. In the context of DVCS, Format refers to the format in which the underlying datastore stores the data and not necessarily the format in which it is consumed or transported. 

Thinking through the popular datastore techologies on the market, we loosly expect our solutions to be build using a persistance layer classified into one of the following categories with one of the coresponding storage formats.

| Data Store   | Storage Format        | Examples                              |
|--------------|-----------------------|--------------------------------------------------------------------------------|
| filesystem   | files and directories | Directory and/or arbitrary files (binary and/or text based)                    |
| object store | objects               | Binary and text based files                                                    |
| block device | block and address     | Disk Image, ISO, Thumb Drive                                                   |
| database     | transactions and logs | MySql, MongoDB, Neo4J                                                          |

Ultimately the storage format is what will drive the DVCS implimentation. When we talk about data versions, we will be talking about versions of the underlying storage format. From the versioned data in the storage layer one can reconstruct coresponding versions of data in other formats, to be consumed or transfered. For a more detailed overview we will discuss real time data streaming later on in this article.

**Note**: A caveat with DVCS is that the diffing of a version may not be represented in the storage format or the consumption format. For example, Dolt diffs of a SQL table are represented as plain text speudo-tables rather than sql tables or sql transaction logs.

## 3.2. Freuqency
Frequency describes how often change events are expected to occur to the data set. The data could expect many changes over a short period of time, a single change over a long period, or some mixture imbetween. It's commont to refer to these as real time and batch data respectively. The frequency will impact our choices with respect to consistency and granularity.

## 3.3. Consistency
Consistency refers to the consistency of the DVCS and is ultimately informed by the consistency of the underyling data store. If we make a commit, how long before the changes in the commit are actually persisted to the storage layer and/or available for consumption.

Depending on the frequency of the data changes, there may be a delay with respect to the consistency of the persistance layer of the dvcs; data may be immediatly consistent when consumed (i.e. it is stored when it is consumed) or it may be eventually consistent (i.e. stored at some point afterit is consumed). 

If the frequency of data changes is faster than the frequency of persistance events, there will be a delay between the perceived data state the the persisted data state. As such a buffering mechanism will be required to temporarily capture change events that have yet to be persisted by the persistance layer. In some cases the volume and frequency of events may be physically impossible to catalogue in the VCS. In this case the user may need to define a versioning strategy that is more practical and accept a lower granularity.

## 3.4. Granularity
Granularity refers to the proportion of change events that are captured by the DVCS; i.e. the level of detail with which one can review the change history. Generally speaking there is a practical limit at which point added granularity adds no value to the owners of the data. It may be the case that granularity depends on some upstream consideration; in certain circomstances we may want to achieve a higher granularity than others.

For ease of discussion, I have bucketed the various granularities as follows:
- **100%** - every change event must be captured
- **Discrete time increments** - capture the state as certain prespecified points in time 
- **Arbitrary increments** - capture the state at random points based on external factors. This could be a mixture of the first two granularities.

This characteristic of our desired solution will determine how we impliment our vertioning strategy etc.

## 3.5. Example Workflows & Implimentations

Let's look at some examples of how a data engineering team would use DVCS to manage thier productized data repository.

Continuting the example from the traditional vcs, lets assume the following parameters for our DVCS:
- Versioning strategy - CalVer
- Branching Strategy - Trunk Based Development
- Promotion Strategy - Three-Tier
- Deployment Strategy - Three-Tier

### 3.5.1. Batch Data Aggregation
Let's first look at a batch data scenario. 

> Example: Once a day, a bulk file upload occurs from a 3rd party vendor. The delivery is expected at 8am EST. Once the delivery is made we would like to version the data as quickly as possible. No operations or transformations are being performed on the data.

We can characterize this example as follows:

- Frequency: Updates Once M-F at 8 AM EST
- Persistance Layer & Storage Format: SQL Database and Transaction Logs
- Consistency: Immediate
- Ganularity: 100%

We would have the same exact workflow as the analogous code repository. Assuming we already have a repo with an integration branch our process would be:

<table>
    <tbody>
        <tr>
            <td style="width:400px">
                
1. Branch off of integration branch
1. Start process
1. Wait for process to finish updating data set
1. Commit changes as a new version
1. Merge into integration branch
1. Create Release Branch
            </td>
            <td>
                <center><img src="images/dvcs-example-workflow-1.png" style="height: 200px"></center>
            </td>
        </tr>
    </tbody>
</table>




You might say, why do we even need a new branch? Why not commit directly to the integration branch. This is because we are using the TBD branching strategy which expressly forbids this action. Thus perhapse that branching strategy is not right for this scenario.

### 3.5.2. Batch Data Transformation

Let's look at an example of a dowstream process from the previous example. 

> I will consume data that is being published, I will transform it, and I will publish it. In this case, my transformation is implimentated by a pipeline which runs a specific version of code in a code repository. Additionally, in this example, my transformations deal with data quality. The transformed data will ultimately be consumed by a downstream machine learning model and the machine learning team has helped to define tests which validate that the data is in the correct format and that the data has the correct values. For example, if a column was expected to be boolean but is instead a string, the test will fail and alert the data engineering team that the quality of the code does not satisfy the requirements of the downstream consumer. The data engineering team can then go back and analayze why the transformation of the data failed: it could be that the data that was imported has changed, the import process had a bug, or the transformation had a bug. Once the issue is understood, the code can be corrected, the transformation can be rerun on the original data set, and a corrected version and be produced for downstream consumption.

We can characterize this example as follows:

- Frequency: Updates Once M-F at 8 AM EST
- Persistance Layer & Storage Format: SQL Database and Transaction Logs
- Consistency: Immediate
- Ganularity: 100%

Thus our process would be as follows:

<table>
    <tbody>
        <tr>
            <td>
                
1. Branch off of integration branch      
1. Start Import Process
1. Wait for import to finish
1. Commit changes as a new version
1. Perform transformation
1. Commit changes as a new version
1. Merge into integration branch
1. Fix bug in transformation logic
1. Rerun transformation
1. Merge into integration branch
1. Create Release Branch
            </td>
            <td>
                <center><img src="images/dvcs-example-workflow-2.png" style="width: 800px"></center>
            </td>
        </tr>
    </tbody>
</table>



As with traditional VCS we can see that the DVCS may raise events to inform downstream processes to kick off. In the example above, we might configure the inport pipeline to trigger based on commits to release branches. In fact the entire workflow of the data engineering team may be automated by the eventing of an upstream DVCS.

**Note**: You might be thinking: there is a lot of overhead maintaining all these redunant or obsolete copies of data. We will discuss that when we talk about retention later on.

### 3.5.3. Realtime Streaming
Now lets look at realtime data streaming scenario. We will see it works exactly the same way (provided we impliment certain architectural designs). Consider the following example:


> We receive a data stream from a third party vendor that is active 24 hours a day. The raw data is transformed into the required format for our machine learning predictor. The predictor consumes the data and sents predictions to a downstream consumer who can take actions based on the prediction.

<center><img src="images/dvcs-example-workflow-3.png" style="width: 800px"></center>

We can characterize this example as follows:

- Frequency: Updates Once M-F at 8 AM EST
- Persistance Layer & Storage Format: S3 and mp4 files
- Consistency: Eventual
- Ganularity: 100%

So, how would we leverage DVCS in this case? The first thing to realize is that the data stream is the transportation format and thus we must convert the data into it's storage format for the persistance layer in the dvcs. The second thing to recognize is that we are applying a logical discretization to the stream when we create versions; in order to rigorously capture data lineage we will need to communicate this downstream.

Below we see an example implimentation. Note: We see the raw stream is arbitrarily discretized into three commits. This information about "where in the stream" to discretize is passed downstream so that the predictor can apply DVCS and link the commits to the upstream commits. This information is passed as a concurrent event stream.

<center><img src="images/dvcs-example-workflow-3-2.png" style="width: 800px"></center>


The magic here has to do with how the stream is discretized and therefore it becomes a question of how the stream is implimented. Generally speaking, the same approach should work with with any technology but below I will examine a message bus and a byte stream.

In the case of using a message bus, the transformation node is sending messages with transformed data downstream to the predictor node. If the data is already comitted to the DVCS we can tag the messages being sent to the predictor with information about the commit. A more efficient strategy might instead be to communicate VCS events individually, so that we are sending the lineage information once rather than multiple times. While this is more efficient it will also accomodate the scenarios where we will need to forward messages before data is comitted to the dvcs (perhapse because of latency). This will require the nodes to impliment a buffer to keep track of messages waiting to be comitted. Once the DVCS even is observed, information can be removed from the buffer and comitted.

In the case of a byte stream things are a bit more complicated. Generally speaking, while the data is transferred as bytes, the data is actually deserialized / decoded into some type of consumption format. For example consider the scenario that we are using the H.264 codec to compress/decode MPEG-TS containers sent over the HTTP Protocol. The decoded H.264 data is ultimately what gets rendered to the screen by some video player software. This will ultimately need to get stored in the persistance layer in some format. 

So the question is, how do we store this streamed information to disk? Do we append new data to an ever expanding file? Do we store discrete time segments as separate files? Ultimately we will need some sort of conversion to occur and this conversion logic will vary from solution to solution. 

Note: It may also be the case that transmissions between the transformer node and the predictor node are not the same format as the transmission between the source and the transformer. For example the stream could be an http stream to the transformer and a message bus from the transformer.

# 4. Features Of A DVCS

Below we try to look objectively at the features offered by traditional VCS as well as make a case for what we should expect from a DVCS. This list will likely evolve, please reach out with any suggestions!

## 4.1. Features Of Traditional VCS

- **Versioning** - The ability to explicitly define a version: an imutable point in time representation of a repository's state.

- **Validation/Verification** - The ability to validate that the state information captured by a version accurately represents the actual state being captured.

- **Version History** - The ability to capture the historical evolution of a repositoy as an immutable and sequential set of versions. I.e. A representation of the historical evolution of a data set.

- **Braching** - An ability to represent parallel or alternate version histories for a given repository.

- **Merging** The ability to combine states from multiple branches into a new version which represents the combination of the various states being considered.

- **Conflict Detection** - The ability to detect when the states of two branches cannot be merged automatically because the version history is incompatible in some way. For example if one branch destroys a file and another branch updates the file, the merge cannot be made automatically and the user will need to resolve the conflict.

- **Conflict Resolution** - The ability for the user to instruct the VCS what changes to accept into the merge and what how the merged state defined by the merge version.

- **Working Set** - A mutable copy of a repository which is created from a particular version and intended to be comitted as a new version a later date.

- **Reversion**: The ability to restore a working set or branch to a historical version. This reversion should support reverting the state in whole or in part (i.e. a single file or an entire directory).

- **Pruning (Rebase)**: The ability to "forget" or delete a specified version from the version history.

- **Metadata Support** - The VCS should allow the user to attach meaningful information to the changes being recorded in the system (i.e. what were the changes, why where they made, etc.). 

- **Diffing**: The ability to show the differences between versions, between a working set and a version, for the entire repository or components of the repository.

- **Reviewability** - The ability to review and query metadata attached to the versions in the version history. 

- **Blame Support** - The ability to identify who is responsible for a change to a particular component of the repository.

- **Role Based Access Control** - The VCS needs to be able to define users, roles, and associated permissions. While the underlying storage layer will have it's own RBAC, which should be respected by the VCS implimentation, the VCS maintains its own objects and data which also must be protected.

- **Atomicity** - The operations of the VCS are atomic.

## 4.2. Additional Data-Centric Features For DVCS

- **Time Travel** - This is basically another word for version history.

- **Lineage (provenance)** - The ability for a version to explain the context of it's creation. I.e. What version of code, version of data, version of infrastructure produced the version in question.

- **Data Efficiency** - The ability to minimize the number of redundant copies between versions.

    Storing data costs money, and with big data you will see big costs. The DVSC should be able to allow multiple versions to reference the same copy of data without requiring a separate copy for each. Some implimentations call this "zero-copy" isolation.

- **Scalability** - The VCS performance should not be impacted with changes to the size of the underlying data set or the length of the revision history.

- **Retention Policy** - The ability to define a rule or pattern for when data is automatically pruned. In some cases, the user may prefer only to retain a rolling window of changes, or they may want to only retain one version for every month after a certain point in time.

    Note: The underlying storage layer may impliment it's own retention policy. If the VCS is built on top of a 3rd party datastore this may interfere with the retention policy of the VCS depending.

- **Format Aignostic** - Able to support multiple formats or types of data (eg. text files, binary files, database files, data bases).

## 4.3. Operational Features

- **Backup Support**: The VCS system needs to enable system admins to perform backups. As such a copy of the entire system can be snapshotted and archived for disaster recovery purposes.

- **Resilliance** - As with a datastore, the VCS should be able to recover from failures and provide an uninterrupted service to the users.

- **Durability**: The ability for version history and linage to survive corruption events (such as transaction failures or bit-rot). The VCS should impliment or inherit mechanisms to prevent against corruption and possible unintended data loss.

## 4.4. Architectural Features

- **Datastore Integration**: The VCS should be built on top of an established traditional storage layers. In doing so, the VCS will minimize the complexity of it's design and maximize the utility provided to users and admins.

- **Highest Level Abstraction**: speaking in the abstract, data is stored in containers (sometimes in a nested or hierachical structure). For example, sql data is stored in a table in a database; json files are stored in an S3 bucket in an AWS account. The VCS should be aware of the abstractions provided by the underlying data stores. It should be able to understand and register changes to any multiple of lower abstraction within the data store. As such, the DVCS should provide an abstraction that can sit above any other "layer" representing a data source.

# 5. Reference Architecture
After surveying the [available solutions on the market](Available%20Solutions.ipynb) I have desiged a [reference architecture](DVCS%20Reference%20Architecture.ipynb) which will enable all of the features listed above. I would reccomend considering the reference architecture first as it helps demistify some of the salse lingo you typically find on the high resolution marketing materials.

# 6. Available Solutions
I have [surveyed](Available%20Solutions.ipynb) the available solutions on the makerket and analyzed their feature offerings.