# Overview

The field of version control has been evolving since the early 60's. Recently, as data science, machine learning, and artificial inteligence (all data based applciations), take off, the need to maintain control over data sets is beginning to surface and grow. Additionally, the expectation that [databases evolve](https://martinfowler.com/articles/evodb.html) is also merging with the devops ideology.

Currently, the application of version control to data is relatively new. While there are several options on the market, many are not fully featured yet.

In this article we will
1. Review the use cases for data version control systems (DVCS)
2. Review the types of data that need versioning
3. Define the full set of features one would expect from a VCS for data 
4. Review the current offerings on the market

# 1. Use cases for DVCS

Tracking Temporal Data

- Time series data

Tracking evolution of the data set

- Adhoc corrections need to be made to the data
- New or experimental code that performs modifications is being run against a data set

Unlocking rapid / agile development

- Parallel development that introduces breaking changes

Protection

- Bug in an application causes corruption
- Malicious actor makes unauthorized modifications

# 2. Types of Data
Before talking about the various VCS options for data we need to understand that data comes in different formats and there likely isn't going to be a single robust solution that accomodates all data types.

Data typically comes in the following flavors:
1. Files / Filesystems
    1. Tables
    2. JSON
2. Objects
3. Raw Block Devices
4. Databases

Now that we understand the basic types of data, lets understand the typical features that version control systems offer.

# 3. Features Of A DVCS

- **Datastore Integration**: The VCS functionality should be built on top of the basic feature set of a traditional a storage layer. As such the VCS needs to be able to integrate with a datastore in order to leverage the datastore's esixting functionalities. The level of integration or tight coupling with the stoage layer may vary; the VCS may be presented as part of the storage layer or a discrete component residing in a higher level of the technology stack.

- **Backup Support**: The VCS system needs to enable system admins to perform backups. As such a copy of the entire system can be snapshotted and archived for disaster recovery purposes.

- **Persistance**: The VCS is able to reliably store versions of data for as long as they are required by the user.

- **Retention**: While the VCS must be able to persist all records of data changes, in some cases, the user may prefer only to retain a rolling window of changes. It's also possible that the user may want to specify an alternate retention strategy. The VCS should support the ability to prune its records of versions that are no longer needed or wanted. Note: The underlying storage layer may impliment it's own retention policy but this may interfere with the retention policy of the VCS depending on the implimentation.

- **Highest Level Abstraction**: speaking in the abstract, data is stored in containers (sometimes in a nested or hierachical structure). For example, sql data is stored in a table in a database; json files are stored in an S3 bucket in an AWS account. The VCS should be aware of the abstractions provided by the underlying data store. It should be able to understand and register changes to any multiple of lower abstraction within the data store.

- **Resilliance** - As with a datastore, the VCS should be able to recover from failures and provide an uninterrupted service to the users.

- **Metadata Support** - The VCS should allow the user to attach meaningful information to the changes being recorded in the system (i.e. what were the changes, why where they made, etc.). With traditional VCS we are able to see the user who comitted the, a description of the change, and a change list associated with the commit. Data pipelines however are a bit more complicated and robust than human driven edits to plain text files. For example, data may be transformed through a DAG expressed through a pipeline or by a transformation as part of a stream. In both of these instances the transformation is implimented via an instance of code. The VCS should be aware of the process which is orchestrating and/or applying a particular change to a data set. For example if there is a pipeline run the VCS should be able to create a link between the verson of code and the instance of the pipeline which performed the transaction. This additional 

- **Reversion**: The ability to restore a working data set to a historical version. This reversion should support reverting the data in whole or in part (i.e. a single file or an entire directory).

- **Diffing**: The VCS should provide the ability to show the differences between recorded changes. Some of this functionality can be offloaded in part to the underlying storage implimentation. For example, if a JSON file changed, there are already diff tools which can compare plain text files; if a database file changed, one may need to leverage the underlying database tooling to do a comparison.

- **Reviewabile** - The revision history should be accessible to the user. The VCS should also allow for queries to be executed against the metadata attached to the revisions in the history. 

- **Data Aignostic** - The VCS is able to support multiple types of data (eg. text files, binary files, database files).

- **Data Lineage** - The ability to record and link changes for data or moves between containers (eg. rows being added to a table, a column moving from one table to another table, file moving to a new folder).

- **Efficient Branching and Merging** - The ability to define and merge brances is present in traditional VCS. With data, there is a concern about data duplication and its cost. The DVCS should be able to define and manage branches while minimizing the overall data footprint. For example if two branches are based on the same version of data, there should only be one copy of that data.

- **Blame Support** - Similar to traditional VCS the DVCS should allow a user to identify who is responsible for a particular change in the data.

- **Conflict Detection and Resolution** - The VCS needs to be able to detect and remediate istances of merge conflicts (when changes in two branches are incompatible or contradict eachother). This functionality, like with diffing, may be offloaded in part to the underlying storage layer or an established third party tool.

- **Governance and RBAC** - The VCS needs to be able to define users, roles, and associated permissions. While the underlying storage layer will have it's own RBAC, which should be respected by the VCS implimentation, the VCS maintains its own objects and data which also must be protected.

- **Working Set Support** - Provide the user the ability to access and transform data without comitting a data snapshot to the revision history.


# 4. Current Market Solutions

## 1.1. The Classic DIY Approach
The classical approach has been for DBA's and system admins to maintain their own mechanisms for managing the various versions of a data set as it evolves over time. These solutions are generally built on top of the datastore, consists of a patchwork of code and 3rd party solutions, and impliment only a subset of features expected of a DVCS.

Generally speaking there are two methods used in parallel:
1. Taking system "snapshots" or backups (of the entire system or enough to rebuild the system manually) periodically via automation or manual efforts
2. Integrating versioning metadata into the underlying schema of the data. This can be done by injecting information into file names or table names (eg. my_file_20203_v2 or my_table becomes my_table_2022 or mytable_v1).

But these approaches don't provide all the functionality that a robust VCS would provide

### 1.1.1. Example Using MS SQL Server

https://blog.devart.com/database-versioning-with-examples.html

### 1.1.2. "As Of" Syntax
We have SQL "as of" syntax. This is in the SQL standard as of 2011 (pun intended) and supported by Oracle, Microsoft SQL Server, and MariaDB. With this syntax you can configure a table to be versioned or a "temporal table" and then query a version using as of <timestamp>. You don't get diffs and merges but you do get instant rollbacks.

## 1.2. Solutions For Popular Databases

Martin Fowler wrote this [article on Evolutionary Database Design](https://martinfowler.com/articles/evodb.html) outlining some of the problems and features one would expect. Some of the solutions below address the issues and designs being proposed.

In general, when we talk about applying version control to a database, holistically we are talkign about all the elements of a database including:
- Database Schema
- Database Object Definitions
- Database Data
- Database RBAC (roles, privileges, etc.)

As we may have infered from the custom built solutions section, we may need to apply a different tool / technique to each component of the database. Below we look at examples that try to capture many/all of these elements in one solution.

### 1.2.1. Dolt (MySql Databases)

Dolt is a MySQL database that impliments a git-like api which provides version control for tables and table rows.

According to it's homepage: 
> Dolt is the first  and only SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository. Dolt is a version controlled database. Dolt is Git for Data. Dolt is a Versioned MySQL Replica.
> 
> Dolt implements the Git command line and associated operations on table rows instead of files. Data and schema are modified in the working set using SQL. When you want to permanently store a version of the working set, you make a commit. In SQL, Dolt implements Git read operations (ie. diff, log) as system tables and write operations (ie. commit, merge) as stored procedures. Dolt produces cell-wise diffs and merges, making data debugging between versions tractable. Dolt is the only SQL database that has branches and merges.
>
>You can run Dolt online, like you would PostgreSQL or MySQL. Or you can run Dolt offline, treating data and schema like source code.
>
> https://docs.dolthub.com/introduction/what-is-dolt

With the addition of its support for MySql/MariaDB Binlog replication, Dolt can be attached to a primary database and can provider versioning for this primary replica.

Dolt handles write operations through stored procedures and will present the difference between data versions cell-wise. For more information we can see the [following article](https://cult.honeypot.io/reads/dolt-a-sql-database-that-works-like-git).

### 1.2.2. VersionSQL (MS SQL Server)

VersionSQL is a plugin for the SQL Server Management Studio IDE. It allows DBA's to track and version changes to database code files from the SSMS interface. The plugin connects to remote VCS systems like Github.

https://www.versionsql.com/sql-server-database/

This may work for other databases like MySQL as SSMS is able to connect to multiple servers. https://docs.devart.com/odbc/mysql/microsoft_sql_server_manager_s.htm


## 2.3. Solutions For File Data

### Git-LFS
Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server implimenting the git-lfs protocol (like GitHub.com or GitHub Enterprise).

This issues with git-lfs is that like git, it uses the client server model. This means there is a local copy of the data and a remote copy, and these need to be synchronized. This will not scale well when the data store starts reaching a massive size or a massive file count. Copying down terrabytes of information or even monitoring / managing that footprint will not scale well on a single machine. Additionally, this model also creates a potential bottleneck and it requires the user to operate on top of a POSIX api. 

To mitigate the issue of having to sync to and from the server, the user could elect to use a networked filesystem (like ceph) and mount it rather than working off of a local filesystem. This would provide a scalabile mechanism for sync'ing the data. But it would not resolve the issue of concurrency: multiple users would be pointing at the same branch. Some networked file systems have the ability to create subvolumes or atomic layers or branches. These subvolumes can be independently mounted. In this way we could provide the ability for multiple users to work independently, in parallel, on their own forked copies of the data.

https://git-lfs.com/

### Git-annex

[Git-annex](https://git-annex.branchable.com/) is similar to git-lfs; allows managing large files with git, without storing the file contents in git. It is an older solution than git-lfs and supports a wide number of backends as opposed to git-lfs which only supports a git-lfs backend. That being said, it has some quarks between windows and linux use cases. More information on the differences can be found [here](https://stackoverflow.com/questions/39337586/how-do-git-lfs-and-git-annex-differ) and [here](https://www.perforce.com/manuals/gitswarm-ee/workflow/lfs/migrate_from_git_annex_to_git_lfs.html).

### DVC

DVC stands for (Not Just) Data Version Control. The reason for this is that DVC offers versioning for the entire data science and machine learning workflow. It's homepage can be found [here](https://dvc.org/). It bosts that DVC can version source code, data, machine learning experiemtns, and machine learning models. These later features will be left for another article discussing those topics. 

With respect to versioning data, DVC is alot like git-lfs and it integrates directly with git. The basic function is to tell git to ignore a data directory and to tell dvc to track that data directory. DVC will add files tothe git repository that act as pointer to the actual data. The DVC command line is very similar to git meaining the concepts of branches, staging, commits, push, and pull still apply.

One of the advantages of git-lfs is that DVC is able to store data in a larger array of backen storage repositories. DVC supports Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc to store file contents.

A helpful walkthrough can be found [here](https://towardsdatascience.com/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-7cb49c229fe0).


Note: DVC will face similar issue to git-lfs and thus can use a similar workaround as the one mentioned in that section.

### Hangar
A python package which impliments the DVC functionality. More information can be found [here](https://opendatascience.com/introduction-to-data-version-control/).

### LakeFS
LakeFS is a git-like solution that sits on top of S3 (azure support is on the roadmap). It works by providing a metadata layer on top of the "real files". It keeps track of which "real files" are part of which version as well as what operations were performed between versions. This is very nice because there is no data duplication and no client/server that you might see with git-lfs or DVC.

For more information on LakeFS see [this article](./LakeFS/Intro%20To%20LakeFS.ipynb)

### Pachyderm
Pachyderm is a data science platform built on Kubernetes and able to integrate with standard tools for CI/CD, logging, authentication, and data APIs, making it scalable and incredibly flexible. The official documentation can be found [here](https://docs.pachyderm.com/latest/overview/)

The [Pachyderm File System (PFS)](https://docs.pachyderm.com/latest/overview/intro-data-versioning/) provides the basic versioning mechanism. Similar to git, a cli is used to push, pull, branch, commit, and merge file based data.

Becausethe pachyderm version control system is integrated with the data pipeline offerings, the platform is able to provide [data provenance](https://docs.pachyderm.com/latest/concepts/data-concepts/provenance/); an understanding of the evolution of data through time and between repositories as pipelines run.

Pachyderm offers a commercial cloud offering as well as a FOSS download for self hosting.


### Deep Lake (Formerly Activeloop Hub)

Deep Lake is a data lake implimentation designed for deep learning. According to the documentation:

> Deep Lake (formerly known as Activeloop Hub) is a data lake for deep learning applications. Our open-source dataset format is optimized for rapid streaming and querying of data while training models at scale, and it includes a simple API for creating, storing, and collaborating on AI datasets of any size. It can be deployed locally or in the cloud, and it enables you to store all of your data in one place, ranging from simple annotations to large videos.
> 
> https://github.com/activeloopai/deeplake

Looking through the [documentation](https://github.com/activeloopai/deeplake) I see that Deep Lake is implimentated as a python package which is able to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, or local storage.

Deep Lake [dataset version control](https://docs.activeloop.ai/getting-started/dataset-version-control#how-to-use-version-control-in-deep-lake) allows you to manage changes to datasets with commands very similar to Git. It's api impliments the familiar concepts of commits, branches, merges, and diffing.

A Shortcoming of this technology is that the APIs for loading data from the data store into models is only available for Pytorch and Tensorflow.

### Kamu

Kemu is a solution for managing, transforming, and collaborating on structured data.

In a larger context, kamu is a reference implementation of Open Data Fabric - a Web 3.0 protocol for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.

Kemu meintains unbreakable lineage and provenance trail in tamper-proof metadata, which lets you assess the trustworthyness of data, no matter how many hands and transformation steps it went through.

It is currently in an [experimental state](https://docs.kamu.dev/odf/).

According to the documentation:

> Kamu achieves for data what Version Control Systems did for Software , but does so without diffs, versioning, or snapshotting.
>
> Our new paradigm streamlines collaboration on data within your company, and enables the effect similar to Open Source Software revolution for data globally.
> 
> How it works
>
> 1. We turn data into a ledger
>
> Data preserves complete history, and never updated destructively. Trust is anchored at the publisher, so they can be always held accountable for data they provide.
>
> 2. Datasets are registered on the network
>
> As a publisher you don’t need to move data into any central point. You maintain complete ownership and control.
>
> 3. People process data using special SQL code
> 
> Our decentralized ETL pipelines can span across teams, organizations, and even continents. People can collaborate on cleaning and enriching data and confidently reuse data from any step.
>
> 4. Data flows in near real-time
> 
> Our streaming SQL engines process data within seconds, continuously and autonomously. All of your science projects, dashboards, and automation get the fidelity of stock tickers data.
>
> 5. Accountability, verifiability, and provenance built-in
>
> Our SQL has the properties of Smart Contracts, so you can trace every single data cell to its source, and easily tell who processed it and how.
>
> https://www.kamu.dev/

kamu is a single-binary utility that comes bundled with most of its dependencies. The binary runs on linux, windows, and mac.



### Quilt

Quilt is an AWS based commercial solution for creating and managing data packages. It uses a python api and [AWS infrastructure](https://docs.quiltdata.com/architecture) (like lambda, S3, cloud trail, elastic search, athena, etc.) to host the backend data store. Quilt manages data like code by creating a metadata layer on top of S3.

According to the documentation:

> All package metadata and data are stored in your S3 buckets. A slice of the package-level metadata, as well as S3 object contents, are sent to an ElasticSearch cluster managed by Quilt. All Quilt package manifests are accessible via SQL using AWS Athena.
>
> https://docs.quiltdata.com/architecture
>
> Quilt packages are one level of abstraction above S3 object versions. Object versions track mutations to a single file, whereas a quilt package references a collection files and assigns this collection a unique version.
>
> https://docs.quiltdata.com/more/faq#how-does-quilt-versioning-relate-to-s3-object-versioning
>
> In Quilt, S3 buckets are analogous to branches in git. Each bucket is a self-contained registry for one or more packages. As package data and schemas are refined, you can promote a package to a new bucket to signify its increased data quality.
>
> https://docs.quiltdata.com/mentalmodel#buckets-are-branches

Quilt does provide the ability to look at revision history (commits) and diff various commits to show the changes. This functionality however is very different from git as it is provided via python api. Additionally, branch merging does not apprear to be fully featured; merges are implimented as an update, but the history of the source of the merge is not included in the new repo.


### Project Nessie

Nessie enables you to maintain multiple versions of your data tables and leverage a Git-like workflow (using branches, commits, tags, etc.).

Nessie is an OSS service (rest api and browser ui) and set of libraries. The service is built on Java, leverages Quarkus, and is compiled to a GraalVM.

Nessie extends existing table formats to provide a single serial view of transaction history. This is enabled across an unlimited number of tables, meaning that a transaction affecting multiple tables is able to be catalogued.

Nessie enhances the following table formats with version control techniques:

- Apache Iceberg (tables and views)
- Delta Lake

**Note**: Delta Lake support in Nessie requires some minor modifications to the core Delta libraries. This patch is still ongoing, in the meantime Nessie will not work on Databricks and must be used with the open source Delta. Nessie is able to interact with Delta Lake by implementing a custom version of Delta’s LogStore interface. This ensures that all filesystem changes are recorded by Nessie as commits. The benefit of this approach is the core ACID primitives are handled by Nessie. The limitations around concurrency that Delta would normally have are removed, any number of readers and writers can simultaneously interact with a Nessie managed Delta Lake table.
https://projectnessie.org/tools/deltalake/

Changes to the contents of the data lake are recorded in Nessie as commits without copying the actual data.

Nessie offers a CLI that is installed through a pip package but it is [not currently fully featured](https://projectnessie.org/tools/).

## 2.4. Solutions Providing A Subset Of Functionality

The following are a list of data science platforms that have versioning components but do not provide true version control for data.

### Neptune

Neptune is a metadata store that offers experiment tracking and model registry for machine learning researchers and engineers. With Neptune, you can log, query, manage, display, and compare all your model metadata in a single place. 

Some users may elect to include the training and test data sets as artifacts stored in the metadata repository. While this is like version control, it is not version control for data.

https://docs.neptune.ai/

### DagsHub

DagsHub is a cloud based data science and machine learning platform.

Behind the scenes, DagsHub uses DVC for versioning data. More information can be found [here](https://dagshub.com/docs/experiment_tutorial/2_data_versioning/).

### Delta Lake
Deltal lake is a data lakehouse implimentation. It provides time-travel, a form of version control.