# 4. Overview

Currently there are no solutions which are feature complete. There are a number of solutions which offer a subset of the features specified above. In this section I propose a reference architecture and talk through how several existing solutions fit into that architecture.

## 4.1. Reference Architecture

Below we can see a modular solution designed to conform to the client-server model. The DVCS solution sits on top of the underlying storage layer and proxies native traffic to the underlying datastore.

<center><img src="images/dvcs-architecture-diagram.png" style="width: 400px"></center>

With this architecture the client connects to the DVCS to provide administrative commands to the DVCS repository,  informative metadata about the lineage of a particular state of the DVCS repository, or to connect to the underlying datastore to manipulate the actual data.

## 4.2. Components

### 4.2.1. The Storage Layer Proxy (SLP)

The SLP is a metered/monitored proxy which is responsible for connecting a client directly to their data which is stored on the underlying storage layer. The location of the actual data is based on the input from the VCL. The data could "move around" as different VCS event occur. For example, if we want to access a file from a particular version of the repository, there may be some underlying magic that needs to happen so that I can read that file while someone else reads a newer copy. The SLP informs the LCL of client session activity through the metadata store. As clients establish connections and perform operations, the SLP can observe and report.

### 4.2.2. The Lineage Control Layer (LCL)
The LCL is responsible for tracking the lineage of the VCS events; such as where the data is from and what transformations are responsible for creating the data associated with a given commit. When clients connect to the DVCS they establish a session and are given access to the underlying data store through a metered proxy (the SLP). Any storage layer transactions declared/observed are tied to the respective client and catalogued.

### 4.2.3 The Version Control Layer

The VCL is responsible for managing the commit history, the commit metadata, and the working set. It manipulates the underlying storage layer to snapshot data, create working copies, and provide diff functionality. It configures the storage layer proxy to direct client requests to a respective underlying data store.

### 4.2.4. The Storage Layer
The storage layer is the technology the user has elected to store their raw data. This would be a solution that provides file, object, block, or database storage. The storage layer is managed by the DVCS which sits above it. User traffic is directed to the storage layer through the DVCS and its SLP.

### 4.2.5. The Metadata Store
The metadata store is responsible for storing metadata related to the version history, version lineage, proxy settings, and event hooks associated with a data repository.

### Message Bus

The message bus allows for asynchronous event based communication between the internal services. Processes are able to listen to a bus for new units of work to complete and notify downstream consumers once the units of work have been completed or failed. 

### 4.2.6. Event Hooks
The event hook system allows sub systems to trigger events. Internal and 3rd party systems are able to register event hooks to execute when a specified events occur. For example, when the underlying storage layer triggers an event, the event hook system can witness the event and notify downstream systems. The Event Hook functionality requires a compute layer to host execution of the registered callbacks.

### Compute


## 4.3. Solution Workflow

In this extion we will explore how the components of the reference architecture implement the high level VCS operations we have come to expect.
The basic process of adding new data to the central repository is as follows:

1. Create Repo
1. Create Initial Commit & Integration Branch
1. Create Branch
1. Make Commit(s)
    - Update Working Set (through human or pipeline)
    - Commit Changes
1. Merge Into Integration Branch

Note: These operations are expected to be atomic and in practice will be invoked through a CLI or RESTful API.


<table>
    <tbody>
        <tr>
            <td>      
                
### Creating A Repo
       
1. Client sends instruction to Server
1. Server places a message on the message bus
    - VCS Layer updates Metadata Store
    - VCS Layer provisions and/or configures backend Storage
    - VCS Layer places a message on the message bus
1. Server responds to Client
1. Event Hooks are invoked
                
### Creating A Branch
       
1. Client sends instruction to Server
1. Server places a message on the message bus
    - VCS layer activates
        - Updates Metadata Store
        - Provisions and/or configures backend Storage
        - Places a message on the message bus
    - SPL activates
        - Updates Metadata Store
        - Updates internal proxy settings
        - Places a message on the message bus           
1. Server responds to Client
1. Event Hooks are invoked

### Creating A Commit
                
1. Client sends instruction to Server
1. Server places a message on the message bus
    - LCS Layer activates
        - Updates Metadata Store
        - Places a message on the message bus
    - VCS layer activates
        - Updates Metadata Store
        - Provisions and/or configures backend Storage
        - Places a message on the message bus
    - SPL activates
        - Updates Metadata Store
        - Updates internal proxy settings
        - Places a message on the message bus           
1. Server responds to Client
1. Event Hooks are invoked              
            </td>
            <td>
                <center><img src="images/dvcs-architecture-diagram.png" style="height: 600px"></center>
            </td>
        </tr>
    </tbody>
</table>

### Diffing, Merging & Resolving Merge Conflicts
The process of diffing, merging, and resolving conflicts is conceptually consistent accross use cases but practically will have nuances based on the underlying storage format of the data. For example, the process of diffing a text file and a database will be different: in terms of computation, presentation, and interpretation. How do you diff an image, or a video?

This process and technology can be abstracted into it's own solution that can be used in tandem with a DVCS.

# Available Solutions According To Reference Architecture

In the [Available Solutions notebook](Available%20Solutions.ipynb) I reviewed a number of solutions available on the market today. In this section I explain how they fit into the reference architecture provided in this article.

## Dolt

We can see that dolt sits on top of the storage layer and facilitates version control and a proxy to the storage layer.
<center><img src="images/dvcs-reference-architecture-dolt.png" style="height: 400px"></center>

## LakeFS
LakeFS has a similar footprint to Dolt with respect to the reference architecture.
<center><img src="images/dvcs-reference-architecture-lakefs.png" style="height: 400px"></center>

## Pachyderm
PAchyderm is one of the more feature complete solutions. It has the largest footprint. It does not internally impliment a message bus as far as I know and that is not a problem.
<center><img src="images/dvcs-reference-architecture-pachyderm.png" style="height: 400px"></center>