# Migrate or clone a lakeFS repository

#### Use this notebook if you want to migrate/clone a source repository to a target repository within the same lakeFS environment or in different lakeFS environments
#### You can also use this notebook if you would like to test migration/cloning of repositories

## Prerequisites

#### 1. Source and a target lakeFS environments (you can [deploy one independently](https://docs.lakefs.io/deploy/) or use the hosted solution [lakeFS Cloud](https://lakefs.cloud)
#### 2. Object storage for both source and target repositories

## Setup Task: Import required Python packages

In [None]:
%xmode Minimal
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
import random
import os
import datetime

## Setup Task: Change your lakeFS credentials for the source lakeFS environment

In [None]:
sourceLakefsEndPoint = '<Source lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io'
sourceLakefsAccessKey = '<Source lakeFS Access Key>'
sourceLakefsSecretKey = '<Source lakeFS Secret Key>'

## Setup Task: Change your lakeFS credentials for the target lakeFS environment

#### If your source and target environments are same then use same credentials for target as you specified for the source above

In [None]:
targetLakefsEndPoint = '<Target lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io'
targetLakefsAccessKey = '<Target lakeFS Access Key>'
targetLakefsSecretKey = '<Target lakeFS Secret Key>'

## Setup Task: You can change lakeFS repo names

#### Source Repo: It can be an existing source repo. If you don't have an existing source repo then you will create a source repo in this notebook.

#### Target Repo: Don't create target repo ahead of time. You will create a target repo in this notebook.

In [None]:
source_repo_name = "source-repo"
target_repo_name = "target-repo"

# IMPORTANT: Run next few cells if you are using AWS for lakeFS

#### Skip this section and to go Azure section if you are using Azure for lakeFS

## Setup Task: Import AWS CLI package

In [None]:
from awscliv2.api import AWSAPI

## Setup Task: Change source and target bucket names

#### Buckets should already exist
#### If you are testing migration/cloning, you can use the same bucket with different folder for source and target

In [None]:
sourceBucket = 'sourceBucketName'
targetBucket = 'targetBucketName'

## Setup Task: Change AWS credentials
#### Provide AWS region and access key information

In [None]:
aws_region = '<AWS region name>' # e.g. us-east-1
aws_access_key_id = 'aaaaaaaaaaaaa'
aws_secret_access_key = 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'

## Setup Task: Set storage namespace variables

In [None]:
source_storage_namespace = 's3://' + sourceBucket
print('Storage Namespace for source Repository: ' + source_storage_namespace)

target_storage_namespace = 's3://' + targetBucket
print('Storage Namespace for target Repository: ' + target_storage_namespace)

## Setup Task: Set AWS credentials for the default profile

In [None]:
aws_api = AWSAPI()

aws_api.set_credentials("default", aws_access_key_id, aws_secret_access_key, "", aws_region)

# END OF AWS SECTION

# IMPORTANT: Run next few cells if you are using Azure for lakeFS

## Setup Task: Change storage account names and container names for the source & target

#### Storage account names can be same for the source and target
#### Containers should already exist

In [None]:
source_storage_account_name = 'source_storage_account_name'
target_storage_account_name = 'target_storage_account_name'
sourceContainer = 'sourceContainer'
targetContainer = 'targetContainer'

## Setup Task: Change SAS Tokens

#### You will copy data from source Storage Container to the target Storage Container by using the [azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&bc=%2Fazure%2Fstorage%2Fblobs%2Fbreadcrumb%2Ftoc.json) software (which is pre-installed in this container) and you will use Shared Access Signatures (SAS) token to [Authorize azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&bc=%2Fazure%2Fstorage%2Fblobs%2Fbreadcrumb%2Ftoc.json#authorize-azcopy).

#### Use this document to [Create SAS tokens for your storage containers](https://learn.microsoft.com/en-us/azure/cognitive-services/translator/document-translation/how-to-guides/create-sas-tokens?tabs=Containers). Select Read and List permissions for Source Container and  Write permission for Target Container when creating SAS Tokens.

In [None]:
source_container_SAS_token = 'source_container_SAS_token'

In [None]:
target_container_SAS_token = 'target_container_SAS_token'

## Setup Task: Set storage namespace variables

In [None]:
source_storage_namespace = 'https://' + source_storage_account_name + '.blob.core.windows.net/' + sourceContainer + '/'
print('Storage Namespace for source Repository: ' + source_storage_namespace)

target_storage_namespace = 'https://' + target_storage_account_name + '.blob.core.windows.net/' + targetContainer + '/'
print('Storage Namespace for target Repository: ' + target_storage_namespace)

# END OF AZURE SECTION

## Setup Task: Create lakeFS Python client for source lakeFS environment

In [None]:
if not 'sourceClient' in locals():
    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = sourceLakefsAccessKey
    configuration.password = sourceLakefsSecretKey
    configuration.host = sourceLakefsEndPoint

    sourceClient = LakeFSClient(configuration)
    print("Created source lakeFS client.")

## Setup Task: Create lakeFS Python client for target lakeFS environment

In [None]:
if not 'targetClient' in locals():
    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = targetLakefsAccessKey
    configuration.password = targetLakefsSecretKey
    configuration.host = targetLakefsEndPoint

    targetClient = LakeFSClient(configuration)
    print("Created target lakeFS client.")

## IMPORTANT: If you don't have an existing source repo then run next few cells to create a sample source repo and populate it with sample data otherwise go to "Step 1 - Commit Changes" section

## Setup Task: You can change values for these variables or leave as is

In [None]:
mainBranch = "main"
testBranch = "test"
fileName = "lakefs_test.csv" # Don't change this sample data file name (included in this container)

## Setup Task: Create source repo

In [None]:
sourceClient.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=source_repo_name,
        storage_namespace=source_storage_namespace,
        default_branch=mainBranch))

## Setup Task: Upload sample data file to the main branch

In [None]:
contentToUpload = open(os.path.expanduser('~')+'/'+fileName, 'rb') # Only a single file per upload which must be named \\\"content\\\"
sourceClient.objects.upload_object(
    repository=source_repo_name,
    branch=mainBranch,
    path=fileName,
    content=contentToUpload)

## Setup Task: Randomly create few branches, upload sample data file to few of those branches and commit data in few branches

In [None]:
no_of_branches = random.randint(5, 10)

for i in range(no_of_branches):
    branch_name = testBranch+datetime.datetime.now().strftime("_%Y_%m_%d_%H_%M_%S_%f")
    sourceClient.branches.create_branch(
        repository=source_repo_name,
        branch_creation=models.BranchCreation(
            name=branch_name,
            source=mainBranch))
    print('Created branch: ' + branch_name)
    
    upload_object = bool(random.getrandbits(1))
    if upload_object:
        contentToUpload = open(os.path.expanduser('~')+'/'+fileName, 'rb') # Only a single file per upload which must be named \\\"content\\\"
        sourceClient.objects.upload_object(
            repository=source_repo_name,
            branch=branch_name,
            path=fileName,
            content=contentToUpload)
        print('    Data file uploaded to the branch')
        
        commit_changes = bool(random.getrandbits(1))
        if commit_changes:
            sourceClient.commits.commit(
                repository=source_repo_name,
                branch=branch_name,
                commit_creation=models.CommitCreation(
                    message='Added a file!'))
            print('    Changes committed for the branch')
            
print('Number of branches Created: ' + str(no_of_branches))

# END OF THE SOURCE REPO CREATION SECTION AND SETUP TASKS

# Step 1 - Commit Changes

## IMPORTANT: Uncommitted data is not migrated over so check uncommitted data (this might take time if you have many branches in the source repo)

In [None]:
has_more = True
after = ""

while has_more:
    list_branches = sourceClient.branches.list_branches(
        repository=source_repo_name,
        after=after)

    for branch in list_branches.results:
        get_diff = sourceClient.branches.diff_branch(
            repository=source_repo_name,
            branch=branch.id,
            amount=1)
        if get_diff.results:
            print('Branch with uncommitted data: ' + branch.id)

    # pagination
    has_more = list_branches.pagination.has_more
    after = list_branches.pagination.next_offset

## OPTIONAL: Commit any uncommitted data in your source repo (this might take time if you have many branches in the source repo)
#### If you want, you can manually commit the changes for the branches listed above from the lakeFS UI

#### You can run previous command again after finishing this to verify that there are not any uncommitted data any more

In [None]:
has_more = True
after = ""

while has_more:
    list_branches = sourceClient.branches.list_branches(
        repository=source_repo_name,
        after=after)

    for branch in list_branches.results:
        get_diff = sourceClient.branches.diff_branch(
            repository=source_repo_name,
            branch=branch.id,
            amount=1)
        if get_diff.results:
            print('Committed changes for Branch: ' + branch.id)
            sourceClient.commits.commit(
                repository=source_repo_name,
                branch=branch.id,
                commit_creation=models.CommitCreation(
                    message='Added a file!'))
    # pagination
    has_more = list_branches.pagination.has_more
    after = list_branches.pagination.next_offset

# Step 2 - Backup Metadata of Source Repository

In [None]:
dump_refs = sourceClient.refs.dump_refs(
    repository=source_repo_name)

print(dump_refs)

# Step 3 - Copy Data from Source to Target

#### IMPORTANT: Run if you are using AWS for lakeFS

In [None]:
s3SyncCommand = 'aws s3 sync s3://' + sourceBucket + ' s3://' + targetBucket
! $s3SyncCommand

#### IMPORTANT: Run if you are using Azure for lakeFS

In [None]:
azureCopyCommand = "azcopy copy 'https://" + source_storage_account_name + ".blob.core.windows.net/" + sourceContainer + "?" + source_container_SAS_token + \
    "' 'https://" + target_storage_account_name + ".blob.core.windows.net/" + targetContainer + "?" + target_container_SAS_token + "' --recursive"

! $azureCopyCommand

## Step 4 - Create Target Bare Repository

In [None]:
targetClient.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=target_repo_name,
        storage_namespace=target_storage_namespace,
        default_branch=mainBranch),
    bare=True)

## Step 5 - Restore Metadata to Target Repository

In [None]:
targetClient.refs.restore_refs(
    repository=target_repo_name,
    refs_dump=dump_refs)

print('Repository migrated')

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack