# Isolated Reproducible Unstructured Datasets for ML

### Prerequisites

####### This Notebook requires connecting to a lakeFS Server. 
####### To spin up lakeFS quickly - use the [lakeFS Cloud](https://lakefs.cloud) which provides lakeFS server on-demand with a single click; 
####### Or, alternatively, refer to [lakeFS Quickstart doc](https://docs.lakefs.io/quickstart/installing.html).

## Setup Task: Download Images and Annotations datasets used for this demo and upload to a storage container: [http://vision.stanford.edu/aditya86/ImageNetDogs/](http://vision.stanford.edu/aditya86/ImageNetDogs/)
#### Change storage-account-name and sample-dog-images-container-name

In [0]:
containerURL = 'https://storage-account-name.adls.core.windows.net/sample-dog-images-container-name'

## Setup Task: Download [changed Images and Annotations datasets](https://github.com/treeverse/lakeFS-samples/tree/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed) and upload to a different storage container.
#### Change storage-account-name and sample-dog-images-changed-container-name

In [0]:
containerURLforChangedData = 'https://storage-account-name.adls.core.windows.net/sample-dog-images-changed-container-name'

## Setup Task: Change your lakeFS credentials

In [0]:
lakefsEndPoint = 'https://YourEndPoint/' # e.g. 'https://username.azure_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAlakeFSAccessKey'
lakefsSecretKey = 'lakeFSSecretKey'

## Setup Task: You can change lakeFS repo name

In [0]:
repo_name = "images-repo"

## Setup Task: Storage Information
#### Change the Storage Namespace to a location you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [0]:
import random
storageNamespace = 'https://storage-account-name.blob.core.windows.net/storage-container-name/'+repo_name+'/'+str(random.randint(1,100000000))

## Define variables

In [0]:
mainBranch = "main"
emptyBranch = "empty"
AnnotationsFolderName = "Annotations"
ImagesFolderName = "Images"

AfghanHoundSourcePath = "n02088094-Afghan_hound"
AfghanHoundFileName = "n02088094-Afghan_hound/n02088094_115.jpg"
WalkerHoundSourcePath = "n02089867-Walker_hound"
WalkerHoundFileName = "n02089867-Walker_hound/n02089867_24.jpg"

## Setup Task: Run additional [Setup](./?o=8911673420610391#notebook/634747576127085) tasks here

In [0]:
%run ./unstructuredDataMLDemoSetup

## Setup Task: Import Images and Annotations datasets to lakeFS repository

In [0]:
commitMessage='Imported all annotations and images'
commitMetadata={'version': '1.0'}

importer = branchMain.import_data(commit_message=commitMessage, metadata=commitMetadata)
importer.prefix(object_store_uri=containerURL, destination="")

import_objects(mainBranch, importer)

# Project Starts

## Project label and version information

In [0]:
classLabel = "_hound"
version = "v1"

## Create empty Project v1 branch

In [0]:
projectBranchV1 = "project"+classLabel+"_"+version
branchProjectV1 = repo.branch(projectBranchV1).create(source_reference=emptyBranch, exist_ok=True)

## Get list of all Annotation folders

In [0]:
AnnotationsFolders = branchMain.objects(
    prefix=AnnotationsFolderName+'/',
    delimiter='/')

## Import all annotation and images for a particular class label

In [0]:
commitMessage='Imported annotation and images for class label ending with '+classLabel
commitMetadata={'classLabel': classLabel,'version': version}

importer = branchProjectV1.import_data(commit_message=commitMessage, metadata=commitMetadata)

for AnnotationsFolder in AnnotationsFolders:
    # If folder name ends with classLabel
    if AnnotationsFolder.path.endswith(classLabel+'/'):
        print("Importing annotation and images in folder: " + AnnotationsFolder.path)
                                         
        importer.prefix(object_store_uri=containerURL+'/'+AnnotationsFolder.path, destination=AnnotationsFolder.path)
        importer.prefix(object_store_uri=containerURL+'/'+AnnotationsFolder.path.replace(AnnotationsFolderName, ImagesFolderName),
                        destination=AnnotationsFolder.path.replace(AnnotationsFolderName, ImagesFolderName))

import_objects(projectBranchV1, importer)

## Some of images changed

## Changed images

<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_26.jpg" width=150/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_60.jpg" width=330/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_93.jpg" width=310/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_115.jpg" width=310/>

## Upload changed annotations and images

In [0]:
commitMessage='Uploaded changed annotation and images for class label ending with '+classLabel+' and version '+version
commitMetadata={'classLabel': classLabel, 'version': version}

importer = branchProjectV1.import_data(commit_message=commitMessage, metadata=commitMetadata)
importer.prefix(object_store_uri=containerURLforChangedData+'/'+AnnotationsFolderName+'/'+AfghanHoundSourcePath,
                destination=AnnotationsFolderName)
importer.prefix(object_store_uri=containerURLforChangedData+'/'+ImagesFolderName+'/'+AfghanHoundSourcePath,
                        destination=ImagesFolderName)

import_objects(projectBranchV1, importer)

## Get stats for image on main branch

In [0]:
objects = branchMain.objects(
    prefix=ImagesFolderName+'/'+AfghanHoundFileName)

for object in objects:
    print(object)

## Get stats for image on project branch

In [0]:
objects = branchProjectV1.objects(
    prefix=ImagesFolderName+'/'+AfghanHoundFileName)

for object in objects:
    print(object)

## Add v1 tag for future use. You can also run your model by using this tag.

In [0]:
import datetime
tagV1 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV1}"

lakefs.Tag(repository_id=repo_name, tag_id=tagV1, client=clt).create(projectBranchV1, exist_ok=True)

## Create Project v2 branch sourced from v1 branch

In [0]:
version = "v2"

In [0]:
projectBranchV2 = "project"+classLabel+"_"+version
branchProjectV2 = repo.branch(projectBranchV2).create(source_reference=projectBranchV1, exist_ok=True)

## Some of images changed

## Changed images

<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_24.jpg" width=150/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_31.jpg" width=295/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_42.jpg" width=295/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_55.jpg" width=295/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_90.jpg" width=295/>

## Upload changed annotations and images

In [0]:
commitMessage='Uploaded changed annotation and images for class label ending with '+classLabel+' and version '+version
commitMetadata={'classLabel': classLabel, 'version': version}

importer = branchProjectV2.import_data(commit_message=commitMessage, metadata=commitMetadata)
importer.prefix(object_store_uri=containerURLforChangedData+'/'+AnnotationsFolderName+'/'+WalkerHoundSourcePath,
                destination=AnnotationsFolderName)
importer.prefix(object_store_uri=containerURLforChangedData+'/'+ImagesFolderName+'/'+WalkerHoundSourcePath,
                        destination=ImagesFolderName)

import_objects(projectBranchV2, importer)

## Review commit log

In [0]:
results = map(
    lambda n:[n.message],
    lakefs.Reference(repository_id=repo_name, reference_id=projectBranchV2, client=clt).log())

from tabulate import tabulate
print(tabulate(
    results,
    headers=['Message']))

## Add v2 tag for future use. You can also run your model by using this tag.

In [0]:
tagV2 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV2}"

lakefs.Tag(repository_id=repo_name, tag_id=tagV2, client=clt).create(projectBranchV2, exist_ok=True)

## Get image stats using v1 tag

In [0]:
objects = repo.tag(tagV1).objects(
    prefix=ImagesFolderName+'/'+AfghanHoundFileName)

for object in objects:
    print(object)

## Get image stats using v2 tag

In [0]:
objects = repo.tag(tagV2).objects(
    prefix=ImagesFolderName+'/'+WalkerHoundFileName)

for object in objects:
    print(object)

## Diff between v1 and v2 project branch

In [0]:
diff = branchProjectV1.diff(other_ref=projectBranchV2)
print_diff(diff)

## If you made mistakes then you can atomically rollback all changes

### Rollback changes in v2 branch

In [0]:
branchProjectV2.revert(parent_number=1, reference=projectBranchV2)

## Diff between v1 and v2 project branch

In [0]:
diff = branchProjectV1.diff(other_ref=projectBranchV2)
print_diff(diff)

# Project Completes

## More Questions?

###### Join the [lakeFS Slack group](https://lakefs.io/slack)