# Overview

In this notebook we review the various solutions and design patterns that exist on the market. For a few solutions we will also compare their architecture with the reference architecture we proposed in [this notebook](DVCS%20Reference%20Architecture.ipynb).

# Reccomendation

As you can see below in the comparison matrix, there is not currently a "one size fits all solution". Every solution I have come accross has it's own limitations and use cases. Thus making a recommendation would depend on the circomstances. That being said, I would generally recommend sticking with a big data solution. LakeFS and Pachyderm have had great support thus far.

Some of these solutions can however fit into the Reference Architecture outlined in [this notebook](DVCS%20Reference%20Architecture.ipynb).

## Solution Feature Comparison

<center><img src="images/dvcs-feature-matrix.png"></center>

See the [accompanying excel file](dvcs-feature-comparison.xlsx) to contribute new information.

# Solutions Overview

## 1.1. The Classic DIY Approach
The classical approach has been for DBA's and system admins to maintain their own mechanisms for managing the various versions of a data set as it evolves over time. These solutions are generally built on top of the datastore, consists of a patchwork of code and 3rd party solutions, and impliment only a subset of features expected of a DVCS.

Generally speaking there are two methods used in parallel:
1. Taking system "snapshots" or backups (of the entire system or enough to rebuild the system manually) periodically via automation or manual efforts
2. Integrating versioning metadata into the underlying schema of the data. This can be done by injecting information into file names or table names (eg. my_file_20203_v2 or my_table becomes my_table_2022 or mytable_v1).

But these approaches don't provide all the functionality that a robust VCS would provide

### 1.1.1. Example Using MS SQL Server

https://blog.devart.com/database-versioning-with-examples.html

### 1.1.2. "As Of" Syntax
We have SQL "as of" syntax. This is in the SQL standard as of 2011 (pun intended) and supported by Oracle, Microsoft SQL Server, and MariaDB. With this syntax you can configure a table to be versioned or a "temporal table" and then query a version using as of <timestamp>. You don't get diffs and merges but you do get instant rollbacks.

## 1.2. Solutions For Popular Databases

Martin Fowler wrote this [article on Evolutionary Database Design](https://martinfowler.com/articles/evodb.html) outlining some of the problems and features one would expect. Some of the solutions below address the issues and designs being proposed.

In general, when we talk about applying version control to a database, holistically we are talkign about all the elements of a database including:
- Database Schema
- Database Object Definitions
- Database Data
- Database RBAC (roles, privileges, etc.)

As we may have infered from the custom built solutions section, we may need to apply a different tool / technique to each component of the database. Below we look at examples that try to capture many/all of these elements in one solution.

### 1.2.1. Dolt (MySql Databases)

Dolt is a MySQL database that impliments a git-like api which provides version control for tables and table rows.

According to it's homepage: 
> Dolt is the first  and only SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository. Dolt is a version controlled database. Dolt is Git for Data. Dolt is a Versioned MySQL Replica.
> 
> Dolt implements the Git command line and associated operations on table rows instead of files. Data and schema are modified in the working set using SQL. When you want to permanently store a version of the working set, you make a commit. In SQL, Dolt implements Git read operations (ie. diff, log) as system tables and write operations (ie. commit, merge) as stored procedures. Dolt produces cell-wise diffs and merges, making data debugging between versions tractable. Dolt is the only SQL database that has branches and merges.
>
>You can run Dolt online, like you would PostgreSQL or MySQL. Or you can run Dolt offline, treating data and schema like source code.
>
> https://docs.dolthub.com/introduction/what-is-dolt

With the addition of its support for MySql/MariaDB Binlog replication, Dolt can be attached to a primary database and can provider versioning for this primary replica.

Dolt handles write operations through stored procedures and will present the difference between data versions cell-wise. For more information we can see the [following article](https://cult.honeypot.io/reads/dolt-a-sql-database-that-works-like-git).

### 1.2.2. VersionSQL (MS SQL Server)

VersionSQL is a plugin for the SQL Server Management Studio IDE. It allows DBA's to track and version changes to database code files from the SSMS interface. The plugin connects to remote VCS systems like Github.

https://www.versionsql.com/sql-server-database/

This may work for other databases like MySQL as SSMS is able to connect to multiple servers. https://docs.devart.com/odbc/mysql/microsoft_sql_server_manager_s.htm


## 1.3. Solutions For File Data

### 1.3.1. Small Data Solutions

#### DataLad

DataLad is a free and open source distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure.
https://www.datalad.org/

A DataLad dataset is a directory with files, managed by DataLad. You can link other datasets, known as subdatasets, and perform commands recursively across an arbitrarily deep hierarchy of datasets. This helps you to create structure while maintaining advanced provenance capture abilities, versioning, and actionable file retrieval.

Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets, without the need for custom data structures, central infrastructure, or third party services.

DataLad is a free and open source command line tool with a Python API and is compatible with all major operating systems. Use DataLad to:

-  create new datasets locally
-  clone other datasets
-  get content on-demand
-  save changes to datasets
-  drop content as needed
-  push changes to a remote location

https://www.datalad.org/

#### DVC

DVC stands for (Not Just) Data Version Control. The reason for this is that DVC offers versioning for the entire data science and machine learning workflow. It's homepage can be found [here](https://dvc.org/). The documentation bosts that DVC can version source code, data, machine learning experiemtns, and machine learning models. These later features will be left for another article discussing those topics. 

With respect to versioning data, DVC is alot like git-lfs and it integrates directly with git. The basic function is to tell git to ignore a data directory and to tell dvc to track that data directory. DVC will add files to the git repository that act as pointer to the actual data. The DVC command line is very similar to git meaining the concepts of branches, staging, commits, push, and pull still apply.

One of the advantages over git-lfs is that DVC is able to store data in a larger array of backend storage repositories. DVC supports Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc to store file contents.

A helpful walkthrough can be found [here](https://towardsdatascience.com/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-7cb49c229fe0).

DVC will face similar scalability issues like git-lfs. It also needs to pull data to a local machine and many users note that the system slows down as the number of files being tracked grows. This is likely because the solution is a program that runs on a single compute footprint and is not distributed or parallelized.

#### Git-annex

[Git-annex](https://git-annex.branchable.com/) is similar to git-lfs; allows managing large files with git, without storing the file contents in git. It is an older solution than git-lfs and supports a wide number of backends as opposed to git-lfs which only supports a git-lfs backend. That being said, it has some quarks between windows and linux use cases. It also suffers from the same scalability problem. More information on the differences can be found [here](https://stackoverflow.com/questions/39337586/how-do-git-lfs-and-git-annex-differ) and [here](https://www.perforce.com/manuals/gitswarm-ee/workflow/lfs/migrate_from_git_annex_to_git_lfs.html).

#### Git-LFS
Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server implimenting the git-lfs protocol (like GitHub.com or GitHub Enterprise).

Git-lfs was designed to allow code project to version small set of data and binaries, not to version a data lake. Git-lfs suffers from a scalability problem. This is because the solution manages operations that should be implimented by the storage layer. For example, in order to read data from the repository, the user must sync their local copy of the data with the remote copy on the server. This, in effect, creates two copies of data. Additionally the time to checkout a particular version is dependent on the network connection. In some cases one might have to wait several minutes or hours to checkout a data set.

https://git-lfs.com/

### 1.3.2. Big Data Solutions

#### Deep Lake (Formerly Activeloop Hub)

Deep Lake is a data lake implimentation designed for deep learning. According to the documentation:

> Deep Lake (formerly known as Activeloop Hub) is a data lake for deep learning applications. Our open-source dataset format is optimized for rapid streaming and querying of data while training models at scale, and it includes a simple API for creating, storing, and collaborating on AI datasets of any size. It can be deployed locally or in the cloud, and it enables you to store all of your data in one place, ranging from simple annotations to large videos.
> 
> https://github.com/activeloopai/deeplake

Looking through the [documentation](https://github.com/activeloopai/deeplake) I see that Deep Lake is implimentated as a python package which is able to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, or local storage.

Deep Lake [dataset version control](https://docs.activeloop.ai/getting-started/dataset-version-control#how-to-use-version-control-in-deep-lake) allows you to manage changes to datasets with commands very similar to Git. It's api impliments the familiar concepts of commits, branches, merges, and diffing.

A Shortcoming of this technology is that the APIs for loading data from the data store into models is only available for Pytorch and Tensorflow. Additionally, the technology is built around tensors.

#### Hangar
A python package which impliments the DVC functionality for tabular tensor (n-dimensional matrix) data. More information can be found [here](https://opendatascience.com/introduction-to-data-version-control/).

The repository consists of data sets, collections of data samples. A data set consists of columns. Columns can be numeric, string, or byte strings. When interracting with the data contained in the respository, the python SDK will natively allow the storage and reference of columns in the repository. The CLI will allow the repository to be managed and for columns to be defined but not populated.

For backends, Hangar supports HDF5, Local NP Memmap, LMDB, or a custom user defined plugin.

#### Kamu

Kemu is a solution for managing, transforming, and collaborating on structured data.

In a larger context, kamu is a reference implementation of Open Data Fabric - a Web 3.0 protocol for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.

Kemu meintains unbreakable lineage and provenance trail in tamper-proof metadata, which lets you assess the trustworthyness of data, no matter how many hands and transformation steps it went through.

It is currently in an [experimental state](https://docs.kamu.dev/odf/).

According to the documentation:

> Kamu achieves for data what Version Control Systems did for Software , but does so without diffs, versioning, or snapshotting.
>
> Our new paradigm streamlines collaboration on data within your company, and enables the effect similar to Open Source Software revolution for data globally.
> 
> How it works
>
> 1. We turn data into a ledger
>
> Data preserves complete history, and never updated destructively. Trust is anchored at the publisher, so they can be always held accountable for data they provide.
>
> 2. Datasets are registered on the network
>
> As a publisher you don’t need to move data into any central point. You maintain complete ownership and control.
>
> 3. People process data using special SQL code
> 
> Our decentralized ETL pipelines can span across teams, organizations, and even continents. People can collaborate on cleaning and enriching data and confidently reuse data from any step.
>
> 4. Data flows in near real-time
> 
> Our streaming SQL engines process data within seconds, continuously and autonomously. All of your science projects, dashboards, and automation get the fidelity of stock tickers data.
>
> 5. Accountability, verifiability, and provenance built-in
>
> Our SQL has the properties of Smart Contracts, so you can trace every single data cell to its source, and easily tell who processed it and how.
>
> https://www.kamu.dev/

kamu is a single-binary utility that comes bundled with most of its dependencies. The binary runs on linux, windows, and mac.



#### LakeFS
LakeFS is a git-like solution that sits on top of S3 (azure support is on the roadmap). It works by providing a metadata layer on top of the "real files". It keeps track of which "real files" are part of which version as well as what operations were performed between versions. This is very nice because there is no data duplication and no client/server that you might see with git-lfs or DVC.

For more information on LakeFS see the [official documentation](https://lakefs.io/). I explore this solution in detail in [this article](./LakeFS/Intro%20To%20LakeFS.ipynb)

#### Pachyderm
Pachyderm is provides a DVCS implimentaion that provides functionality analogous to a minimal git implimentation; users can commit and branch the data repositories. The solution itself is hosted on kubernetes and can be deployed on cloud or on-prem. The solution has a community and enterprise edition available.

The official documentation can be found [here](https://docs.pachyderm.com/latest/overview/). I explore this solution in the following [notebook](Pachyderm/Intro%20To%20Pachyderm.ipynb).

#### Project Nessie

Nessie enables you to maintain multiple versions of your data tables and leverage a Git-like workflow (using branches, commits, tags, etc.).

It's [homepage](https://projectnessie.org/) lists the following:
> Transactional Catalog for Data Lakes
> - Git-inspired data version control
> - Cross-table transactions and visibility
> - Open data lake approach, supporting Hive, Spark, Dremio, AWS Athena, etc.
> - Works with Apache Iceberg and Delta Lake tables
> - Run as a docker image, AWS Lambda or fork it on GitHub

Nessie is an OSS service (rest api and browser ui) and set of libraries. The service is built on Java, leverages Quarkus, and is compiled to a GraalVM.

Nessie extends existing table formats to provide a single serial view of transaction history. This is enabled across an unlimited number of tables, meaning that a transaction affecting multiple tables is able to be catalogued.

Nessie enhances the following table formats with version control techniques:

- Apache Iceberg (tables and views)
- Delta Lake

**Note**: Delta Lake support in Nessie requires some minor modifications to the core Delta libraries. This patch is still ongoing, in the meantime Nessie will not work on Databricks and must be used with the open source Delta. Nessie is able to interact with Delta Lake by implementing a custom version of Delta’s LogStore interface. This ensures that all filesystem changes are recorded by Nessie as commits. The benefit of this approach is the core ACID primitives are handled by Nessie. The limitations around concurrency that Delta would normally have are removed, any number of readers and writers can simultaneously interact with a Nessie managed Delta Lake table.
https://projectnessie.org/tools/deltalake/

Changes to the contents of the data lake are recorded in Nessie as commits without copying the actual data.

Nessie offers a CLI that is installed through a pip package but it is [not currently fully featured](https://projectnessie.org/tools/).

#### Quilt

Quilt describes itself as a self-organizing data hub. 

> Quilt consists of a Python client, web catalog, lambda functions—all of which are open source—plus a suite of backend services and Docker containers orchestrated by CloudFormation.
>
> The backend services are available under a paid license on quiltdata.com.
> https://github.com/quiltdata/quilt

Quilt manages data like code by creating a metadata layer on top of S3.

According to the documentation:

> All package metadata and data are stored in your S3 buckets. A slice of the package-level metadata, as well as S3 object contents, are sent to an ElasticSearch cluster managed by Quilt. All Quilt package manifests are accessible via SQL using AWS Athena.
>
> https://docs.quiltdata.com/architecture
>
> Quilt packages are one level of abstraction above S3 object versions. Object versions track mutations to a single file, whereas a quilt package references a collection files and assigns this collection a unique version.
>
> https://docs.quiltdata.com/more/faq#how-does-quilt-versioning-relate-to-s3-object-versioning
>
> In Quilt, S3 buckets are analogous to branches in git. Each bucket is a self-contained registry for one or more packages. As package data and schemas are refined, you can promote a package to a new bucket to signify its increased data quality.
>
> https://docs.quiltdata.com/mentalmodel#buckets-are-branches

Quilt does provide the ability to look at revision history (commits) and diff various commits to show the changes. This functionality however is very different from git as it is provided via python api. Additionally, branch merging does not apprear to be fully featured; merges are implimented as an update, but the history of the source of the merge is not included in the new repo.


## 2.4. Non-DVCS Solutions Providing A Subset Of DVCS Functionality

The following are a list of data science platforms that have versioning components but do not provide true version control for data.

### Neptune

Neptune is a metadata store that offers experiment tracking and model registry for machine learning researchers and engineers. With Neptune, you can log, query, manage, display, and compare all your model metadata in a single place. 

Some users may elect to include the training and test data sets as artifacts stored in the metadata repository. While this is like version control, it is not version control for data.

https://docs.neptune.ai/

### DagsHub

DagsHub is a cloud based data science and machine learning platform.

Behind the scenes, DagsHub uses DVC for versioning data. More information can be found [here](https://dagshub.com/docs/experiment_tutorial/2_data_versioning/).

### Delta Lake
Deltal lake is a data lakehouse implimentation. It provides time-travel, a form of version control.