<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# lakeFS for Data Collaboration

## Prerequisites

* This Notebook requires connecting to lakeFS Cloud or lakeFS Enterprise
* Register to lakeFS Cloud: https://lakefs.cloud/register or Contact Us for a lakeFS Enterprise Key: https://lakefs.io/contact-sales/

### The image below demonstrates the setup created in this sample notebook:
*  A single lakeFS repository, with a protected Main branch that stores production data.
*  Three groups:
    * **Admins**: including a single user
    * **Data Scientists**: including a single user
    * **Developers**: including two users
* A **FSBlockMergingToMain** policy which prevents users from being able to promote data to production.
* Multiple branches created by individual users

![data_collaboration](./images/data_colab.png)

## Config

### Change your lakeFS credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information
##### Change the Storage Namespace to a location in the bucket you‚Äôve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example/' # e.g. "s3://username-lakefs-cloud/"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

## You can change lakeFS repo name (it can be an existing repo or provide another repo name)

In [None]:
repo_name = "data-collaboration-repo"

## Versioning Information

In [None]:
mainBranch = "main"
fileName = "lakefs_test.csv"

### Import libraries

In [None]:
%xmode Minimal
import lakefs
from lakefs.client import Client
import lakefs_sdk
from lakefs_sdk.client import LakeFSClient
from lakefs_sdk import models
from assets.lakefs_demo import print_commit, print_diff

## Working with the lakeFS Python client API

In [None]:
if not 'superUserClient' in locals():
    configuration = lakefs_sdk.Configuration(
        host=lakefsEndPoint,
        username=lakefsAccessKey,
        password=lakefsSecretKey,
    )
    superUserClient = LakeFSClient(configuration)

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials‚Ä¶")
try:
    v=superUserClient.internal_api.get_lake_fs_version().version
except:
    print("üõë failed to get lakeFS version")
else:
    print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v}")

## Super User creates an "admin1" user

In [None]:
superUserClient.auth_api.create_user(
    user_creation=models.UserCreation(
        id='admin1'))

## Super User adds "admin1" user to an "Admins" group auto-created by lakeFS

In [None]:
groupName='Admins'

has_more = True
next_offset = ""
while has_more:
    groups = superUserClient.auth_api.list_groups(after=next_offset)
    for r in groups.results:
        if r.name == groupName:
            groupId = r.id
            break
    has_more = groups.pagination.has_more
    next_offset = groups.pagination.next_offset
    
superUserClient.auth_api.add_group_membership(
    group_id=groupId,
    user_id='admin1')

## Create credentials for "admin1" user

In [None]:
credentials = superUserClient.auth_api.create_credentials(user_id='admin1')
print(credentials)
admin1AccessKey = credentials.access_key_id
admin1SecretKey = credentials.secret_access_key

## Create a lakeFS Python client for "admin1" user

In [None]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=admin1AccessKey,
    password=admin1SecretKey,
)
admin1Client = LakeFSClient(configuration)

admin1LakefsClient = Client(
    host=lakefsEndPoint,
    username=admin1AccessKey,
    password=admin1SecretKey,
)


print("Created lakeFS client for admin1.")

## Verify user for "admin1Client" Python client

In [None]:
admin1Client.auth_api.get_current_user()

# The Demo Starts Here

##### "admin1" will do rest of the setup to define data collaboration rules specific to the organization

#### "admin1" creates "developer1" and "developer2" users

In [None]:
admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
        id='developer1'))

In [None]:
admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
        id='developer2'))

## "admin1" adds "developer1" and "developer2" to lakeFS created "Developers" group

In [None]:
groupNameDevelopers='Developers'

has_more = True
next_offset = ""
while has_more:
    groups = superUserClient.auth_api.list_groups(after=next_offset)
    for r in groups.results:
        if r.name == groupNameDevelopers:
            groupIdDevelopers = r.id
            break
    has_more = groups.pagination.has_more
    next_offset = groups.pagination.next_offset
    
admin1Client.auth_api.add_group_membership(
    group_id=groupIdDevelopers,
    user_id='developer1')

admin1Client.auth_api.add_group_membership(
    group_id=groupIdDevelopers,
    user_id='developer2')

## Create credentials for "developer1" and "developer2" users

In [None]:
credentials = admin1Client.auth_api.create_credentials(user_id='developer1')
print(credentials)
developer1AccessKey = credentials.access_key_id
developer1SecretKey = credentials.secret_access_key

credentials = admin1Client.auth_api.create_credentials(user_id='developer2')
print(credentials)
developer2AccessKey = credentials.access_key_id
developer2SecretKey = credentials.secret_access_key

## Create lakeFS Python client for "developer1" and "developer2" users

In [None]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=developer1AccessKey,
    password=developer1SecretKey,
)
developer1Client = LakeFSClient(configuration)

developer1LakeFSClient = Client(
    host=lakefsEndPoint,
    username=developer1AccessKey,
    password=developer1SecretKey,
)
    
print("Created lakeFS client for developer1.")

In [None]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=developer2AccessKey,
    password=developer2SecretKey,
)
developer2Client = LakeFSClient(configuration)

developer1LakeFSClient = Client(
    host=lakefsEndPoint,
    username=developer2AccessKey,
    password=developer2SecretKey,
)
    
print("Created lakeFS client for developer2.")

## Verify user for "developer1" and "developer2" Python clients

In [None]:
developer1Client.auth_api.get_current_user()

In [None]:
developer2Client.auth_api.get_current_user()

## "admin1" creates "DataScientists" group

In [None]:
DataScientistsGroup = admin1Client.auth_api.create_group(
    group_creation=models.GroupCreation(
        id='DataScientists'))

## "admin1" attaches lakeFS created "AuthManageOwnCredentials" policy to "DataScientists" group

In [None]:
admin1Client.auth_api.attach_policy_to_group(
    group_id=DataScientistsGroup.id,
    policy_id='AuthManageOwnCredentials')

## "admin1" attaches lakeFS created "FSReadWriteAll" policy to "DataScientists" group

In [None]:
admin1Client.auth_api.attach_policy_to_group(
    group_id=DataScientistsGroup.id,
    policy_id='FSReadWriteAll')

## "admin1" attaches lakeFS created "RepoManagementReadAll" policy to "DataScientists" group

In [None]:
admin1Client.auth_api.attach_policy_to_group(
    group_id=DataScientistsGroup.id,
    policy_id='RepoManagementReadAll')

## "admin1" creates "data_scientist1" user

In [None]:
admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
        id='data_scientist1'))

## "admin1" adds "data_scientist1" user to "DataScientists" group

In [None]:
admin1Client.auth_api.add_group_membership(
    group_id=DataScientistsGroup.id,
    user_id='data_scientist1')

## Create credentials for "data_scientist1" user

In [None]:
credentials = admin1Client.auth_api.create_credentials(user_id='data_scientist1')
print(credentials)
data_scientist1AccessKey = credentials.access_key_id
data_scientist1SecretKey = credentials.secret_access_key

## Create lakeFS Python client for "data_scientist1" user

In [None]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=data_scientist1AccessKey,
    password=data_scientist1SecretKey,
)
data_scientist1Client = LakeFSClient(configuration)

data_scientist1LakeFSClient = Client(
    host=lakefsEndPoint,
    username=data_scientist1AccessKey,
    password=data_scientist1SecretKey,
)
    
print("Created lakeFS client for data_scientist1.")

## Verify user for "data_scientist1Client" Python client

In [None]:
data_scientist1Client.auth_api.get_current_user()

## "admin1" creates "FSBlockMergingToMain" policy to prevent commits to the main branch

In [None]:
admin1Client.auth_api.create_policy(
    policy=models.Policy(
        id='FSBlockMergingToMain',
        statement=[models.Statement(
            effect="deny",
            resource="arn:lakefs:fs:::repository/*/branch/main",
            action=["fs:CreateCommit"],
        ),
        ]
    )
)

## "admin1" attaches "FSBlockMergingToMain" policy to "DataScientists" group

In [None]:
admin1Client.auth_api.attach_policy_to_group(
    group_id=DataScientistsGroup.id,
    policy_id='FSBlockMergingToMain')

## If repo already exists on your lakeFS server then you can skip following step otherwise "admin1" creates a new repo

In [None]:
repo = lakefs.Repository(repo_name, client=admin1LakefsClient).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

## "admin1" protects main branch so no one can write directly to main branch and any subsequent writes must be done via the merge of a branch

In [None]:
admin1Client.repositories_api.set_branch_protection_rules(
    repository=repo_name,
    branch_protection_rule=[models.BranchProtectionRule(
        pattern=mainBranch)])

## "admin1" tries to upload a file to a "shopping_transactions/raw" folder on the main branch, but the upload fails because main branch is protected

In [None]:
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchMain.object('shopping_transactions/raw/'+fileName).upload(data=contentToUpload, mode='wb', pre_sign=False)

## "admin1" creates an "ingest-shopping-transactions" branch

In [None]:
branchIngestShoppingTransactions = repo.branch('ingest-shopping-transactions').create(source_reference=mainBranch)
print("ingest-shopping-transactions ref:", branchIngestShoppingTransactions.get_commit().id)

## "admin1" uploads the file to "shopping_transactions/raw" folder in "ingest-shopping_transactions" branch

In [None]:
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchIngestShoppingTransactions.object('shopping_transactions/raw/'+fileName).upload(data=contentToUpload, mode='wb', pre_sign=False)

## "admin1" commits changes and attaches some metadata

In [None]:
ref = branchIngestShoppingTransactions.commit(message='Ingested raw shopping transactions data!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

## "admin1" merges "ingest-shopping-transactions" branch to main branch

In [None]:
res = branchIngestShoppingTransactions.merge_into(branchMain)
print(res)

## "developer1" works on changing a script that transforms raw shopping transactions data into datasets the user application consumes.
### "developer1" wants to test their change against real production data under shopping_transactions/raw. To do that, the create a branch from "main"

In [None]:
branchTransformationsChange = repo.branch('transformations-change').create(source_reference=mainBranch)
print("transformations-change ref:", branchTransformationsChange.get_commit().id)

## At the same time, "developer2" is ingesting new raw data into "shopping_transactions/raw"
### "developer2" creates an "second-ingest-shopping-transactions" branch

In [None]:
branchSecondIngestion = repo.branch('ingest-shopping-transactions-2').create(source_reference=mainBranch)
print("ingest-shopping-transactions-2' ref:", branchSecondIngestion.get_commit().id)

## "developer2" uploads additional data to "shopping_transactions/raw" folder in "ingest-shopping-transactions-2" branch

In [None]:
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchSecondIngestion.object('shopping_transactions/raw/rawdata2.csv').upload(data=contentToUpload, mode='wb', pre_sign=False)

## "developer2" commits with additional commit medata

In [None]:
ref = branchSecondIngestion.commit(message='Ingested raw shopping transactions data!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

## "developer2" merges changes to "ingest-shopping-transactions-2" to main, and introduce new data to production

In [None]:
res = branchSecondIngestion.merge_into(branchMain)
print(res)

## "developer1" branch still points to the production data version they created the branch from, and is not seeing the recent change made by "developer2"

In [None]:
diff = branchTransformationsChange.diff(other_ref=branchSecondIngestion)
print_diff(diff)

## "data_scientist1" creates "ds_branch" branch

In [None]:
branchDSBranch = lakefs.Repository(repo_name, client=data_scientist1LakeFSClient).branch('ds_branch').create(source_reference=mainBranch)
print("ds_branch ref:", branchDSBranch.get_commit().id)

## "data_scientist1" uploads a new file to "experiment1" branch 

In [None]:
contentToUpload = open('/data/lakefs_test_new.csv', 'r').read()
branchDSBranch.object('ds/lakefs_test_new.csv').upload(data=contentToUpload, mode='wb', pre_sign=False)

## "data_scientist1" commits changes and attaches some metadata

In [None]:
ref = branchDSBranch.commit(message='Added new data file!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

## But "data_scientist1" can't merge "ds_branch" branch to main branch due to "FSBlockMergingToMain" policy attached to "DataScientists" group

In [None]:
branchDSBranch.merge_into(mainBranch)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack