# Overview

[LakeFS](https://docs.lakefs.io/) is a storage solution which provides version control for data lakes. It sits on top of a data lake as an "overlay filesystem" meaning it provides an api which maps version info to phsical data stored in the underlying data store.

Note: At this point in time the latest release of LakeFS is v0.69.1.

# Features

A git-like interface

No data duplication

An S3 Compatible API

LakeFS allows users to leverage the following cloud provided storage services as its underlying data store:
- AWS S3
- S3 Compatible Stores like MinIO or Ceph
- Azure Blob Storage (ABS)
- Google Cloud Storage (GCS)
- Local Storage


LakeFS provides direct integration with popular data frameworks 
- Spark
- Hive
- dbt
- Trino
- and many others

[Event hooks](https://docs.lakefs.io/setup/hooks.html) that support CI/CD

# Architecture

LakeFS impliments the classic client/server model. As such there is a centralized server(s) that take requests and serve data to clients.


## LakeFS Server Architecture

lakeFS is distributed as a single binary. The binary, is also referred to as "the server". The server encapsulates several logical services including:
- UI
- OpenAPI Gateway
- S3 Gateway
- Authentication/Authorization
- Graveler
- Storage Adapter
- Hooks Enginer

The [documentation](https://docs.lakefs.io/understand/architecture.html) provides the following documentation:

<center><img src="images/lakefs-architecture.png" style="width:800px"></center>


Note: The server itself is stateless, meaning you can easily add more instances to handle a bigger load.


### Server Components

- **S3 Gateway** - LakeFS implements a compatible subset of the S3 API to ensure most data systems can use lakeFS as a drop-in replacement for S3.

- **OpenAPI Server** - The Swagger (OpenAPI) server exposes the full set of lakeFS operations including basic CRUD operations against repositories and objects, as well as versioning related operations such as branching, merging, committing and reverting changes to data.

- **Storage Adapter** - an abstraction layer for communicating with any underlying object store. Its implementations allow compatibility with many types of underlying storage such as S3, GCS, Azure Blob Storage, or non-production usages such as the local storage adapter.

- **Graveler** - handles lakeFS versioning by translating lakeFS addresses to the actual stored objects.

- **Authentication & Authorization Service**

- **Hooks Engine** - enables CI/CD for data by triggering user defined actions (hooks) that will run during commit/merge

- **UI** - a simple browser-based client that uses the OpenAPI server to provides access to repositories, branches, commits and objects in the system.


The official documentation can be found [here](https://docs.lakefs.io/understand/architecture.html).

## LakeFS Client Architecure
As the diagram above indicates our applications will need a client or api to interact with LakeFS.

The following clients / apis exist:
- The [LakeFS API](https://docs.lakefs.io/reference/api.html) - The rest API
- The [Python API](https://docs.lakefs.io/integrations/python.html) - A python library which provides an OOP interface for interacting with the API. Also see [here](https://pydocs.lakefs.io/)
- The [Spark Client](https://docs.lakefs.io/reference/spark-client.html) - A library which allows us to interact with LakeFS metadata using spark objects (like dataframes).
- The [S3A Gateway](https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway) - This is the S3-compatible endpoint that the lakeFS server provides which allows spark to access objects using the lakeFS S3 path convention and `s3a://...` URIs.
- The [lakeFS-specific Hadoop FileSystem]() - To use this mode, you configure the Spark application to perform metadata operations on the lakeFS server and all data operations directly through the same underlying object store that lakeFS uses which significantly increases application scalability and performance by reducing the load on the lakeFS server and reducing the number of hops. In this case one uses `lakefs://repo/ref/path/to/data` URIs to read and write data on lakeFS rather than `s3a://...` URIs.

## Integrations

### LakeFS and Spark Integration
We have two options for getting data from lakefs to spark.
1. Through LakeFS' S3 Gateway - Client requests data from the lakefs server using a lakefs repository style url, the server grabs it from the real storage backend path and returns the info to the client.
2. LakeFS Hadoop Filesystem - Client requests metadata from the lakefs server and data from the underlying storage backend (s3)


#### LakeFS S3 Gateway (LakeFS S3A Filesystem)

With this setup, distinguishing between data and metadata operations is done entirely on the lakeFS server. Spark apps read from a lakeFS repository using the S3A filesystem which accesses the lakeFS server via the S3 Gateway. The consequence of this is data throughput becomes dependent on the lakeFS server’s throughput.

<center><img src="images/s3a-filesystem-data-access.png" style="width:600px"></center>

Comparing this diagram to the one below makes it clear how we are able to get these performance gains with the new lakeFS Hadoop FileSystem.


#### LakeFS Hadoop Filesystem

**Note**: The lakeFS FileSystem currently only supports Spark with Hadoop Apache 2.7. But support for Hadoop 3 is on the [roadmap](https://docs.lakefs.io/understand/roadmap.html#hadoop-3-support-in-all-lakefs-clients-high-priority).

I was curious how this functionality worked so I did some digging. In the [documentation about the spark integration](https://docs.lakefs.io/integrations/spark.html#two-tiered-spark-support) and this article about the [LakeFS Hadoop filesystem](https://lakefs.io/advancing-lakefs-version-data-at-scale-with-spark/) I found some helpful information to explain how this integration works. 

The LakeFS team has built and released the "LakeFS Hadoop Filesystem" which they explain as a "native Hadoop FileSystem implementation". What this means is that the LakeFS team have written code for a spark storage adaptor that makes lakefs look like a hadoop filesystem. When we use spark to write data, we can channel spark's write operation through this adaptor. And this is where the magic happens. The storage adaptor allows us to offload the data intensive portions of the read/write to the backend filesystem rather than routing it through the lakefs server as an intermediary. This potentially elminates bottlenecks and reduces latency. This adaptor contains logic to break things down into metadata and data related requests. The metadata requests (like translating a lakefs path to the path on the underlying datastore) still go through the lakefs server, but the access to the data can me made directly to the underlying datastore rather than lakefs. The following image explains the concept:

<center><img src="images/lakefs-hadoop-filesystem-architecture.png" style="width:600px"></center>


### Deltalake

As deltalake sits on top of a filesystem or datalake, it can be supported by lakefs. As deltalake is accessed through spark, one would need to configure spark to spreak to lakefs and deltalake as one normally would.

But there is one major limitation to be aware of:

As noted in the [documentation](https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html), a limiting factor to this integration is the detla table's deltalog. The [Delta Lake transaction log](https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html) (also known as the DeltaLog) is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. As such, this exposes us to one of the classical problems of parallel timelines: When merging branches, the human would need to reconcile which events occurred and in which order. One would need to manually audit the transaction log to resolve conflicting sets of transactions. 

For this reason, the lakefs team reccomends that one only write to a single branch 


# Deployment Options

According to the documentation there are many [ways](https://docs.lakefs.io/understand/architecture.html#ways-to-deploy-lakefs) to deploy the lakefs server. We can use a public cloud provider like AWS, GCP, or Azure. We can also use a self-hosted solution using Bare Metal, Docker, or Kubernetes.

## Self-hosted Options

For self-hosted solutions a user can choose between:
- [Running](https://docs.lakefs.io/quickstart/more_quickstart_options.html#using-the-binary) the provided [Binaries](https://github.com/treeverse/lakeFS/releases)
- Running a [Docker Container](https://hub.docker.com/r/treeverse/lakefs)
- Deploying a [Helm Chart](https://artifacthub.io/packages/helm/lakefs/lakefs) in kubernetes

Otherwise, there are also cloud hosted options for running lakeFS on K8S, ECS, Google Compute Engine and more.

In this [notebook](Basic%20Initial%20Setup.ipynb) I perform a basic setup.

# How Does LakeFS Versioning Work

As mentioned earlier, LakeFS is an overlay file system. Like git, it does some magic to manage the different versions of files and folders contained in our repository.

As with git, in LakeFS, our largest container is the repository. The repository is made up of branches and commits. The branches and commits point to specific versions of files.

As such, when we want to access a particular file from LakeFS we need to give the server a few pices of information so it can lookup and serve our data. We need to not only know it's relative path, but also the name of the repository which holds the file, and the branch or commit coresponding to the desired version of that file.

We will see that the LakeFS API builds this information into the path of a given file. As such, when we refer to data we would use paths resembling the following:

- Repositories: `lakefs://\<repo-name>`
- Commits: `lakefs://\<repo-name>@\<commit-id>`
- Branches: `lakefs://\<repo-name>@\<branch-id>`
- Files (objects): `lakefs://\<repo-name>@\<branch-id>/\<object path>`

And thus reading a file might resemble this

```python
df = spark.read.parquet('lakefs://<repo-name>@<branch-id>\<object path>')
```

## What Are LakeFS Commits
A LakeFS commit maps meta data and a LakeFS file path to an actual file path. In the documentation this is sometimes specified as mapping names to objects or keys to values. As we will see, if a file path changes, the two different keys in two commits will point to the same underlying file. But if the data changes, the two same keys will point to different files on the underlying filesystem. 

<center><img src='images/commit-example.png'></center>

The example above is a simplified representation of commits. In this section we go deeper down the worm hole. As we will see, LakeFS commits Are stored as a B+ Tree of SSTables.

### What are SSTables
TL;DR; In short an SSTable is a tree based key value store

SSTable refers to a data structure and the coresponding persistent file format. It is used by a number of NoSQL databases, specifically those which impliment Log Structured Merge Tree (LSM) based distributed database systems and key-value stores (like ScyllaDB, Apache Cassandra, and BigTable).

An SSTable provides a persistent, ordered, immutable map from keys to values, where both keys and values are arbitrary byte strings. Like any data structure, SSTable implimentation provide operations for accessing an managing the data. For example, methods to look up the value associated with a specified key or to iterate over all key/value pairs in a specified key range.

An SSTable is partitioned into blocks and provides a block index. The index is loaded into memory when the SStable file is opened and provides a lookup to locate a given block without excessive disk seeks (i.e. searching the disk). Additionally is resources allow the entire SStable can be loaded into memory avoiding the use of the disk in a search.

A helpful conversation on the topic can be found in [this article](https://stackoverflow.com/questions/2576012/what-is-an-sstable) or [this article](http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html) or [this one](https://en.wikipedia.org/wiki/Log-structured_merge-tree).

### Commits Are stored as a B+ Tree of SSTables (ie. Gravelers)

Each lakeFS commit is represented as a tree structue (for speed). Specifically a B+ tree with height 2. The way it works is that the namespace of keys, ie. the keyspace, (the list of file paths for all files in the repo) is sorted and split up into blocks or ranges. Each range is mapped in its own SStable in level 2. Each range, because it is sorted, has a start key and and end key indicating the boundaries of the range (ranges do not overlap). The root of the tree (level 1) contains a sorted list of all the last keys from all the ranges and maps them to the coresponding range. With this structure, we can peform a faster lookup of a file in the repository. We do a seek on the root, get the range, and then seek on the range. This is faster than potentially seeking the entire table.

We can see an example commit below:

<center><img src='images/lakefs-commit-btree.png'</img></center>

The commit is stored in a standardized format called “Graveler”. Thus the SSTable files are referred to as graveler files. To be even morespecific, LakeFS uses the RocksDB SSTable file format and its implementation using the Pebble SSTable library from CockroachDB.

More information on this file format can be found [here](https://docs.lakefs.io/understand/versioning-internals.html) or [here](https://lakefs.io/concrete-graveler-committing-data-to-pebbledb-sstables) or [here](https://lakefs.io/concrete-graveler-splitting-for-reuse/).