# Backup, migrate or clone a lakeFS repository on AWS

#### Use this notebook if you want to backup & restore/migrate/clone a source repository to a target repository within the same lakeFS environment or in different lakeFS environments

## Prerequisites

#### 1. Source and a target lakeFS environments (you can [deploy one independently](https://docs.lakefs.io/deploy/) or use the hosted solution [lakeFS Cloud](https://lakefs.cloud))
#### 2. Source repository
#### 3. Storage Container for target repository but don't create target repository ahead of time (you will create a target repo in this notebook)

## Setup Task: Import required Python packages

In [None]:
%xmode Minimal
import lakefs
from lakefs.client import Client
import lakefs_sdk
from lakefs_sdk.client import LakeFSClient
import random
import os
import datetime
import json

## Setup Task: Change your lakeFS credentials for the source lakeFS environment

In [None]:
sourceLakefsEndPoint = '<Source lakeFS Endpoint URL>' # e.g. 'https://username.azure_region_name.lakefscloud.io'
sourceLakefsAccessKey = '<Source lakeFS Access Key>'
sourceLakefsSecretKey = '<Source lakeFS Secret Key>'

## Setup Task: Change your lakeFS credentials for the target lakeFS environment

#### If your source and target environments are same then use same credentials for target as you specified for the source above

In [None]:
targetLakefsEndPoint = '<Target lakeFS Endpoint URL>' # e.g. 'https://username.azure_region_name.lakefscloud.io'
targetLakefsAccessKey = '<Target lakeFS Access Key>'
targetLakefsSecretKey = '<Target lakeFS Secret Key>'

## Setup Task: Change lakeFS repo names

In [None]:
source_repo_name = "source-repo"
target_repo_name = "target-repo"

## Setup Task: Change main/production branch name for the source repo

In [None]:
source_main_branch = "main"

## Setup Task: Change storage account names and container names for the source & target

#### Storage account name can be same for the source and target

In [None]:
source_storage_namespace = 'https://source-storage-account-name.blob.core.windows.net/sourceContainer'
target_storage_namespace = 'https://target-storage-account-name.blob.core.windows.net/targetContainer'

## Setup Task: Change SAS Tokens

#### You will copy data from source Storage Container to the target Storage Container by using the [azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&bc=%2Fazure%2Fstorage%2Fblobs%2Fbreadcrumb%2Ftoc.json) software (which is pre-installed in this container) and you will use Shared Access Signatures (SAS) token to [Authorize azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&bc=%2Fazure%2Fstorage%2Fblobs%2Fbreadcrumb%2Ftoc.json#authorize-azcopy).

#### Use this document to [Create SAS tokens for your storage containers](https://learn.microsoft.com/en-us/azure/cognitive-services/translator/document-translation/how-to-guides/create-sas-tokens?tabs=Containers). Select "Read" and "List" permissions for Source Container and "Write" permission for Target Container while creating SAS Tokens.

In [None]:
source_container_SAS_token = 'source_container_SAS_token'

In [None]:
target_container_SAS_token = 'target_container_SAS_token'

## Setup Task: Create lakeFS Python client for source lakeFS environment

In [None]:
if not 'source_lakefs_client' in locals():
    source_lakefs_client = Client(
        host=sourceLakefsEndPoint,
        username=sourceLakefsAccessKey,
        password=sourceLakefsSecretKey,
    )
    
    configuration = lakefs_sdk.Configuration(
        host=sourceLakefsEndPoint,
        username=sourceLakefsAccessKey,
        password=sourceLakefsSecretKey,
    )
    source_lakefs_sdk_client = LakeFSClient(configuration)
    
    print("Verifying lakeFS credentials‚Ä¶")
    try:
        v=source_lakefs_client.version
        sourceRepo = lakefs.Repository(source_repo_name, client=source_lakefs_client)
    except:
        print("üõë failed to get lakeFS version")
    else:
        print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v}")

## Setup Task: Create lakeFS Python client for target lakeFS environment

In [None]:
if not 'target_lakefs_client' in locals():
    target_lakefs_client = Client(
        host=targetLakefsEndPoint,
        username=targetLakefsAccessKey,
        password=targetLakefsSecretKey,
    )
    
    configuration = lakefs_sdk.Configuration(
        host=targetLakefsEndPoint,
        username=targetLakefsAccessKey,
        password=targetLakefsSecretKey,
    )
    target_lakefs_sdk_client = LakeFSClient(configuration)
    
    print("Verifying lakeFS credentials‚Ä¶")
    try:
        v=target_lakefs_client.version
    except:
        print("üõë failed to get lakeFS version")
    else:
        print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v}")

# Step 1 - Commit Changes

## IMPORTANT: Uncommitted data is not migrated over so check uncommitted data (this might take time if you have many branches in the source repo)

In [None]:
for branchList in sourceRepo.branches():
    for diff in sourceRepo.branch(branchList.id).uncommitted():
        print('Branch with uncommitted data: ' + branchList.id)
        break

## OPTIONAL: Commit any uncommitted data in your source repo (this might take time if you have many branches in the source repo)
#### If you want, you can manually commit the changes for the branches listed above from the lakeFS UI

#### You can run previous command again after finishing this to verify that there are not any uncommitted data any more

In [None]:
for branchList in sourceRepo.branches():
    for diff in sourceRepo.branch(branchList.id).uncommitted():
        ref = sourceRepo.branch(branchList.id).commit(message='Committed changes during the migration of the repository')
        print(ref.get_commit())
        break

# Step 2 - Dump Metadata of Source Repository
### IMPORTANT: Shutdown lakeFS services immediately after dumping the metadata so nobody can make any changes in the source repository

In [None]:
source_lakefs_sdk_client.internal_api.dump_refs(source_repo_name)

# Step 3 - Copy Data from Source to Target
### You can restart lakeFS services after copying the data from source to target

In [None]:
azureCopyCommand = "azcopy copy '" + source_storage_namespace + "/*?" + source_container_SAS_token + "' '" + target_storage_namespace + "?" + target_container_SAS_token + "' --recursive"
print(azureCopyCommand)

! $azureCopyCommand

## Step 4 - Create Target Bare Repository

#### IMPORTANT: For Backup & Restore process, run this step only when you want to restore the repository

In [None]:
lakefs.Repository(target_repo_name, client=target_lakefs_client).create(storage_namespace=f"{target_storage_namespace}", default_branch=source_main_branch, bare=True)

## Step 5 - Restore Metadata to Target Repository

#### IMPORTANT: For Backup & Restore process, run this step only when you want to restore the repository

### Download metadata(refs_manifest.json) file created by "Step 2"

In [None]:
azureDownloadRefsManifestFileCommand = "azcopy copy '" + target_storage_namespace + "/_lakefs/refs_manifest.json?" + target_container_SAS_token + "' ."

! $azureDownloadRefsManifestFileCommand

### Read refs_manifest.json file and restore metadata to new repository

In [None]:
with open('./refs_manifest.json') as file:
    refs_manifest_json = json.load(file)
    print(refs_manifest_json)
    
target_lakefs_sdk_client.internal_api.restore_refs(target_repo_name, refs_manifest_json)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack