# Overview
In this notebook we are going to work with LakeFS to setup and manage our first repository. 
We will use several clients to interact with this repository. I will be using the local storage backend.

# 1. Prerequisites
This assumes we have a working installation and followed the setps in the [Basic Initial Setup](Basic%20Initial%20Setup.ipynb) notebook.

# 2. Create A Repository
Once logged in, we see that there are no repositories.

<center><img src="images/lakefs-create-repo.png" style="width:800px"></center>

As we see, there are two options to select from when creating a repository. 

<center><img src="images/lakefs-create-repo-choice.png" style="width:800px"></center>

At the end of the day, both these options lead to the same end; there is only one type of repo in lakefs. 

<center><img src="images/lakefs-repos-view.png" style="width:800px"></center>

But for completeness we will walk through both options in the create repo wizard.

## 2.1. Creating a blank repository

The [official documentation](https://docs.lakefs.io/quickstart/repository.html#create-the-repository) covers this scenario. After selecting the repo type, the wiard takes us to the following screen

<center><img src="images/lakefs-create-blank-repo.png" style="width:800px"></center>


I had a look through the documentation to understand the various fields.

The **Repository ID** is a unique identifier, similar to the repo name one would choose in a git repository.

The **Storge Namespace** refers to the [underlying storage](https://docs.lakefs.io/understand/model.html#concepts-unique-to-lakefs). According to the documentation:

The underlying storage is a location in an object store where lakeFS keeps your objects and some immutable metadata.
The repository’s storage namespace is a location in the underlying storage where data for this repository will be stored.

Additionally, LakeFS sometimes refers to underlying storage as physical (storage). The path used to store the contents of an object is then termed a physical path. 

The value for Storage Namespace has one of two prefixes to denote the two supported storage types. The local prefix allows us to reference a path on our local system while the s3 prefix allows us to reference a storage location in an s3 compliant backend. For example:
- local://path 
- s3://example-bucket/prefix


<center><img src="images/lakefs-create-repo-values.png" style="width:800px"></center>

Once created we see the following page

<center><img src="images/lakefs-created-blank-repo-page.png" style="width:800px"></center>

I was curious about where the files were on the local filesystem. I looked at the configured root directory and saw that a directory was created. I also saw that a dummy file was placed into the directory as a sanity check.

```
[root@localhost lakefs]# pwd
/data/datalake/lakefs

[root@localhost lakefs]# ls -la
total 0
drwx------ 1 root root 1 Sep  8 17:21 .
drwxr-xr-x 1 root root 2 Sep  8 16:57 ..
drwxr-x--- 1 root root 1 Sep  8 17:21 test-blank-repo-namespace

[root@localhost lakefs]# ls -la test-blank-repo-namespace/
total 1
drwxr-x--- 1 root root  1 Sep  8 17:21 .
drwx------ 1 root root  1 Sep  8 17:21 ..
-rw-r--r-- 1 root root 70 Sep  8 17:21 dummy

[root@localhost lakefs]# ls -la test-blank-repo-namespace/dummy
-rw-r--r-- 1 root root 70 Sep  8 17:21 test-blank-repo-namespace/dummy

[root@localhost lakefs]# cat test-blank-repo-namespace/dummy
this is dummy data - created by lakeFS in order to check accessibility
```





## 2.2. Creating a Spark Quickstart Repo

**Note**: Because we are using the local storage backend we wont be able to explore all the functionality of this option. Not to worry, we will revisit later with a more complex configuration.

We now select the  "Spark Quickstart" option from the drop down to create a repository.

<center><img src="images/lakefs-create-spark-repo.png" style="width:800px"></center>

**Note**: This feature was not particularely well documented. There is a [design document](https://github.com/treeverse/lakeFS/blob/e7f9f9de4711c1518ba959243fc38e45af7900d1/design/open/ttv-wizard-mvp.md) in the github repo that outlines how this wizard works at a high level. I also found [this github issue](https://github.com/treeverse/lakeFS/issues/3411) that lays out the rationale behind adding this component to the wizrard being related to creature comfort and convenience:

> To get started using a new lakeFS installation with Spark, the following manual steps are typically required:
>
> 1. Create a repository and configure it with a storage namespace
> 2. Bring in existing data (whether using import, copy, ingest, upload(?), etc)
> 3. Connect Spark using the installation's S3 Gateway or configure Spark to use lakeFSFS with this installation
> 4. Understand how to use lakeFS structured URIs in the existing code base or more likely configure metastore for this indirection to have existing code run on a separate branch/commit (i.e. I want SELECT * FROM events; to use the events table that exists in my experiment-oz-temp branch, not main or production!)
> 5. Now, integrate Hive Metastore/Glue Data Catalog with lakeFS to be able to have a schema for my branch that contains all the same tables that I have in prod.
>



I had to look through the source code to understand error messages I was getting. I will add nuggets of info as we go along.



We then see the wizard activate and we see a UI similar to that of the blank repo.

<center><img src="images/lakefs-create-spark-repo-2.png" style="width:800px"></center>

<center><img src="images/lakefs-create-spark-repo-3.png" style="width:400px"></center>

Clicking next we then have the ability to import data. As noted in the UI:

> Import doesn't copy objects. It only creates links to the objects in the lakeFS metadata layer. Don't worry, we will never change objects in the import source. [Learn more](https://docs.lakefs.io/setup/import.html).

<center><img src="images/lakefs-create-spark-repo-4.png" style="width:400px"></center>

I created some dummy data so we could see what the import looked actually did.

```
[root@localhost lakefs]# mkdir /data/datalake/dummy-data
[root@localhost lakefs]# echo "Hello, World" >  /data/datalake/dummy-data/my_file.txt
```

I plugged in the values and clicked next

<center><img src="images/lakefs-create-spark-repo-5.png" style="width:400px"></center>


But this raised an error. On my screen I saw a red error message warning me that there was a problem.

<center><img src="images/lakefs-create-spark-repo-6.png" style="width:400px"></center>

I searched through the source code to find instances where this phrasing is used. I saw that the WalkerFactory's GetWalker function does not support a local storage backend. Looking at the [source code](https://github.com/treeverse/lakeFS/blob/71db8f37656ad021f3178a376651519ed24f4cf7/pkg/ingest/store/factory.go#L158) it looks like the only supported URI schemes at this point are s3://, gs:// for GCP, or http:// and https:// for Azure.

I was a bit confused by this because I was able to create a repo and upload files to the local storage backend. After looking at the code and consulting blogs and documentation I realized that the issue is simply that the "walker" (the thing that walks the file system, enumerates file paths, and reads files) was not programmed to work on the local filesystem. After chatting in the official [slack channel](https://go.lakefs.io/JoinSlack) that this is because it is assumed that spark cannot access the files on the local filesystem of the lakefs server. The developers had not considered the case where I had mounted a remote filesystem to all the nodes.

As a work around we have a few options:
1. Skip the wizard and upload manually as we do with the basic repo
2. Spin up an http file server to serve our local files
3. Spin up MiniO (an S3 compliant api)
4. Switch to an S3 data store

For now I will skip this.

Lastly we see the spark configurations page which gives us the spark configurations required to connect to our datastore.

<center><img src="images/lakefs-create-spark-repo-7.png" style="width:400px"></center>


# 3. Branching And Merging

In this next section we look at how branching and merging works. We will not go through any complex branching strategy. We will just explore the basics.

## 3.1. Uploading a document to main branch

In this example we upload a simple text file from our local filesystem to the main branch on the lakefs server through the gui.

<center><img src="images/lakefs-upload-1.png" style="width:800px"></center>
<center><img src="images/lakefs-upload-2.png" style="width:400px"></center>
<center><img src="images/lakefs-upload-3.png" style="width:800px"></center>
<center><img src="images/lakefs-upload-4.png" style="width:400px"></center>

**Note**: This file simply contains the text "Hello, World!"

<center><img src="images/lakefs-upload-5.png" style="width:800px"></center>

## 3.2. Making a commit

Looking at the uncommitted changes tab we see that the file is recognized as being changed and we have the option of making a commit.

<center><img src="images/lakefs-changes-1.png" style="width:800px"></center>
<center><img src="images/lakefs-changes-2.png" style="width:400px"></center>
<center><img src="images/lakefs-changes-3.png" style="width:800px"></center>
<center><img src="images/lakefs-changes-4.png" style="width:800px"></center>

## 3.3. Understand changes in filesystem

I now wanted to understand how stages changes and commits were represented on the underlying file system. I created a second file with the name test2 and the text "Hellow, world! 2". I then had a look at the underlying filesystem.

We can see that two files appear in the repositoy. Although renamed, the fils contain the contents we uploaded.

```
[root@localhost test-blank-repo-namespace]# ls -la
total 2
drwxr-x--- 1 root root  4 Sep  9 03:50 .
drwx------ 1 root root  2 Sep  9 03:47 ..
-rw-r--r-- 1 root root 13 Sep  9 03:47 4b7c506d1e0449bfbd5ec165d93d3f25
-rw-r--r-- 1 root root 15 Sep  9 03:50 637a87413c314022b23af24590d8f60f
-rw-r--r-- 1 root root 70 Sep  9 03:47 dummy
drwxr-x--- 1 root root  2 Sep  9 03:50 _lakefs

[root@localhost test-blank-repo-namespace]# cat 4b7c506d1e0449bfbd5ec165d93d3f25
hello, world![root@localhost test-blank-repo-namespace]#

[root@localhost test-blank-repo-namespace]# cat 637a87413c314022b23af24590d8f60f
hello, world! 2
```

Additionally we see a _lakefs directory

```
[root@localhost test-blank-repo-namespace]# ls -la _lakefs/
total 3
drwxr-x--- 1 root root    2 Sep  8 23:51 .
drwxr-x--- 1 root root    4 Sep  8 23:56 ..
-rw-r--r-- 1 root root 1058 Sep  8 23:51 8565a8d7cf787aaf005af5b5e6e145f8c62cce6544eb3b0e87acd17c10
e666c6
-rw-r--r-- 1 root root 1018 Sep  8 23:51 b3f4878b4f146f36da8800fc12635176903622f9655673aad972c9af13
208b83

```

It's a bit complicated to explain what exactly is in this directory. Long story short the directory contains commit metadata. Each file contains information about which objects are contained in which commit. By committing, the metadata that was in the Postgres database has been committed to the backend S3 system inside of the _lakefs prefix. Each of the files with the _lakefs prefix is a Graveler file, which is an “immutable” SSTable with metadata compatible with RocksDB. The name of the file itself is a function of its content (we say the file is “content-addressable”).

More information can be found in the [versioning internals](https://docs.lakefs.io/understand/versioning-internals.html) documentation or [this article](https://blog.dataminded.com/what-is-lakefs-a-critical-survey-edce708a9b8e)

## 3.4. Creating a branch

<center><img src="images/lakefs-create-branch.png" style="width:800px"></center>
<center><img src="images/lakefs-create-branch-2.png" style="width:400px"></center>
<center><img src="images/lakefs-create-branch-3.png" style="width:800px"></center>
<center><img src="images/lakefs-create-branch-4.png" style="width:800px"></center>



Looking at the underlying filesystem we see that data was not duplicated! We still see the original comitted file and the staged file from the main branch. we can see that lakefs is not moving files in the same way that git was. This is great for scalability!

I uploaded a test3 file to the branch and we see it was added to the filesystem

```
[root@localhost test-blank-repo-namespace]# ls -la
total 2
drwxr-x--- 1 root root  5 Sep  9 03:53 .
drwx------ 1 root root  2 Sep  9 03:47 ..
-rw-r--r-- 1 root root 13 Sep  9 03:47 4b7c506d1e0449bfbd5ec165d93d3f25
-rw-r--r-- 1 root root 15 Sep  9 03:53 5b64d28e92ca43e5a285321b1d017405
-rw-r--r-- 1 root root 15 Sep  9 03:50 637a87413c314022b23af24590d8f60f
-rw-r--r-- 1 root root 70 Sep  9 03:47 dummy
drwxr-x--- 1 root root  2 Sep  9 03:50 _lakefs

[root@localhost test-blank-repo-namespace]# cat 5b64d28e92ca43e5a285321b1d017405
hello, world! 3
```


## 3.5. Modifying files
like git, I have the ability to modify staged or comitted files on a branch. The way to do this in the gui is simply upload the new revision and specify an existing file name in the dialogue box. In doing so, we will see the ui is able to distinguish between additions and modifications.

<center><img src="images/lakefs-modifications.png" style="width:800px"></center>

I noticed we cannot compare the uncommitted changes no matter how I configure the comparison.

<center><img src="images/lakefs-modify-2.png" style="width:800px"></center>

We can only compare the comitted changes

<center><img src="images/lakefs-modify-4.png" style="width:800px"></center>

Making a commit we see the the information included in the commit.

<center><img src="images/lakefs-modify-3.png" style="width:800px"></center>

Looking at the filesystem we see that old files remain (we may want to revert to them in the future).

```
[root@localhost test-blank-repo-namespace]# ls -la
total 3
drwxr-x--- 1 root root  6 Sep  9 03:56 .
drwx------ 1 root root  2 Sep  9 03:47 ..
-rw-r--r-- 1 root root 13 Sep  9 03:47 4b7c506d1e0449bfbd5ec165d93d3f25
-rw-r--r-- 1 root root 15 Sep  9 03:53 5b64d28e92ca43e5a285321b1d017405
-rw-r--r-- 1 root root 15 Sep  9 03:50 637a87413c314022b23af24590d8f60f
-rw-r--r-- 1 root root 15 Sep  9 03:56 daaf5fb4fde44d3e8a7e81553e3f7834
-rw-r--r-- 1 root root 70 Sep  9 03:47 dummy
drwxr-x--- 1 root root  4 Sep  9 04:01 _lakefs

[root@localhost test-blank-repo-namespace]# cat 4b7c506d1e0449bfbd5ec165d93d3f25
hello, world!

[root@localhost test-blank-repo-namespace]# cat 637a87413c314022b23af24590d8f60f
hello, world! 2

[root@localhost test-blank-repo-namespace]# cat 5b64d28e92ca43e5a285321b1d017405
hello, world! 3

[root@localhost test-blank-repo-namespace]# cat daaf5fb4fde44d3e8a7e81553e3f7834
hello, world! 4
```

## 3.6 Deleting Files
When we delete files they remain on disk. According to [the documentaion](https://docs.lakefs.io/reference/garbage-collection.html):

> By default, lakeFS keeps all your objects forever. This allows you to travel back in time to previous versions of your data. However, sometimes you may want to hard-delete your objects - namely, delete them from the underlying storage. Reasons for this include cost-reduction and privacy policies.

> Garbage collection rules in lakeFS define for how long to retain objects after they have been deleted (see more information below). lakeFS provides a Spark program to hard-delete objects that have been deleted and whose retention period has ended according to the GC rules. The GC job does not remove any commits: you will still be able to use commits containing hard-deleted objects, but trying to read these objects from lakeFS will result in a 410 Gone HTTP status.

> Note At this point, lakeFS supports Garbage Collection only on S3 and Azure. We have concrete plans to extend the support to GCP.

## 3.7. Merging Branches

At this point would like to merge my feature branch back into my main branch.

<center><img src="images/lakefs-merge-1.png" style="width:800px"></center>
<center><img src="images/lakefs-merge-2.png" style="width:400px"></center>
<center><img src="images/lakefs-merge-3.png" style="width:400px"></center>
<center><img src="images/lakefs-merge-4.png" style="width:400px"></center>
<center><img src="images/lakefs-merge-5.png" style="width:800px"></center>
