# Managing the Data Lifecycle with lakeFS

##Efficient DataOps for High Quality Data Products

#### Environment Set-Up
###### (Run only once)


<img src="https://lakefs.io/wp-content/uploads/2022/06/what-is-lakefs-slide.png" width=800/>

<img src="https://lakefs.io/wp-content/uploads/2022/06/why-git-for-data-2.png" width=800/>

In [0]:
#setting up lakeFS end point access and secret in order to later configure the python client
lakefsEndPoint = 'https://YourEndPoint/' # e.g. 'https://username.azure_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAlakeFSAccessKey'
lakefsSecretKey = 'lakeFSSecretKey'

In [0]:
# Configuring environment variables

repositoryName = "learn-lakefs-python-repo"
storageNamespace = "https://storage-account-name.blob.core.windows.net/storage-container-name/"  + repositoryName # Unique per repository
sourceBranch = "main"
dataPath = "product-reviews"

In [0]:
import lakefs
from lakefs.client import Client

clt = Client(
    host=lakefsEndPoint,
    username=lakefsAccessKey,
    password=lakefsSecretKey,
)

print("Created lakeFS client.")

In [0]:
repo = lakefs.Repository(repositoryName, client=clt).create(storage_namespace=storageNamespace, exist_ok=True)
print(repo)

In [0]:
#Reading data from databricks datasets and inserging into the newly created repository (Creating initial data)

import_data_path = "/databricks-datasets/amazon/test4K/"
df = spark.read.parquet(import_data_path)
df.write.format("parquet").save("lakefs://{}/{}/{}".format(repositoryName,sourceBranch,dataPath))


In [0]:
repo.branch(sourceBranch).commit(message='Added initial data')

### Create a production identical isolated environment

In [0]:
# Review production Data from your production "main" branch
# Note - This example uses static strings instead of parameters for an easier read

df = spark.read.parquet("lakefs://learn-lakefs-python-repo/main/product-reviews/")
df.show()

In [0]:
repo.branch("experiment").create(source_reference="main")

In [0]:
#Reading data from the experiment branch
df = spark.read.parquet("lakefs://learn-lakefs-python-repo/experiment/product-reviews/")
df.show()

### Run ETL Data Pipelines in isolation
#### Delete 1 star reviews & re-partition by rating

In [0]:
# Delete all overly happy or overly unhappy star ratings

df_no_1star=df.where(df.rating!='1')
df_no_5star=df_no_1star.where(df.rating!='5')

df = df_no_5star
df.show()

In [0]:
# Repartition by rating

df.write.partitionBy("rating").format("parquet").save("lakefs://learn-lakefs-python-repo/experiment/product-reviews_by_rating")

In [0]:
repo.branch("experiment").commit(
    message='Removed 4 & 5 reviews, and repartition by reviews ',
    metadata={'using': 'python_sdk', 
              '::lakefs::CodeVersion::url[url:ui]': 'https://dbc-8ada78b6-3a6d.cloud.databricks.com/?o=8376305627582670#notebook/3047610880313535/command/285921331468809'})

In [0]:
# What are the differences between the two branches?
main = repo.branch("main")
for diff in main.diff(other_ref="experiment"):
    print(diff)

In [0]:
# Query rating breakdown

df = spark.read.parquet("lakefs://learn-lakefs-python-repo/experiment/product-reviews_by_rating")
df.groupby("rating").count().display()


### Merge Changes into Main
#### Once you are satisfied, merge into main

In [0]:
#res = repo.branch("experiment").merge_into(repo.branch("main"))
#print(res)

### Unhappy With the changes? Don't merge to main
#### Delete the experiment branch

In [0]:
repo.branch("experiment").delete();


<img src="https://lakefs.io/wp-content/uploads/2022/06/how-does-lakefs-work-1.png" width=800/>