# Migrate or clone a local lakeFS repository to AWS

#### Use this notebook if you want to migrate/clone a local source repository to a target repository on AWS

## Prerequisites

#### 1. Source and a target lakeFS environments (you can [deploy one independently](https://docs.lakefs.io/deploy/) or use the hosted solution [lakeFS Cloud](https://lakefs.cloud))
#### 2. Source repository
#### 3. S3 Bucket for target repository but don't create target repository ahead of time (you will create a target repo in this notebook)

## Setup Task: Import required Python packages

In [None]:
%xmode Minimal
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
import random
import os
import datetime
from awscliv2.api import AWSAPI

## Setup Task: Change your lakeFS credentials for the local source lakeFS environment

In [None]:
sourceLakefsEndPoint = '<Source lakeFS Endpoint URL>' # e.g. 'http://host.docker.internal:8000' if you are running lakeFS in a separate Docker container on your local machine
sourceLakefsAccessKey = '<Source lakeFS Access Key>'
sourceLakefsSecretKey = '<Source lakeFS Secret Key>'

## Setup Task: Change your lakeFS credentials for the target lakeFS environment

In [None]:
targetLakefsEndPoint = '<Target lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io'
targetLakefsAccessKey = '<Target lakeFS Access Key>'
targetLakefsSecretKey = '<Target lakeFS Secret Key>'

## Setup Task: Change lakeFS repo names

In [None]:
source_repo_name = "source-repo"
target_repo_name = "target-repo"

## Setup Task: Change main/production branch name for the source repo

In [None]:
source_main_branch = "main"

## Setup Task: Change source and target storage namespace

In [None]:
source_storage_namespace = 'local://localSourceFolder' # storage namespace for the local source repo
target_storage_namespace = 's3://myTargetBucket' # change target Bucket Name on S3

## Setup Task: Change AWS credentials

In [None]:
aws_region = '<AWS region name>' # e.g. us-east-1
aws_access_key_id = 'aaaaaaaaaaaaa'
aws_secret_access_key = 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'

## Setup Task: Set AWS credentials for the default profile

In [None]:
aws_api = AWSAPI()

aws_api.set_credentials("default", aws_access_key_id, aws_secret_access_key, "", aws_region)

## Setup Task: Create lakeFS Python client for source lakeFS environment

In [None]:
if not 'sourceClient' in locals():
    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = sourceLakefsAccessKey
    configuration.password = sourceLakefsSecretKey
    configuration.host = sourceLakefsEndPoint

    sourceClient = LakeFSClient(configuration)
    print("Created source lakeFS client.")

## Setup Task: Create lakeFS Python client for target lakeFS environment

In [None]:
if not 'targetClient' in locals():
    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = targetLakefsAccessKey
    configuration.password = targetLakefsSecretKey
    configuration.host = targetLakefsEndPoint

    targetClient = LakeFSClient(configuration)
    print("Created target lakeFS client.")

# Step 1 - Commit Changes

## IMPORTANT: Uncommitted data is not migrated over so check uncommitted data (this might take time if you have many branches in the source repo)

In [None]:
has_more = True
after = ""

while has_more:
    list_branches = sourceClient.branches.list_branches(
        repository=source_repo_name,
        after=after)

    for branch in list_branches.results:
        get_diff = sourceClient.branches.diff_branch(
            repository=source_repo_name,
            branch=branch.id,
            amount=1)
        if get_diff.results:
            print('Branch with uncommitted data: ' + branch.id)

    # pagination
    has_more = list_branches.pagination.has_more
    after = list_branches.pagination.next_offset

## OPTIONAL: Commit any uncommitted data in your source repo (this might take time if you have many branches in the source repo)
#### If you want, you can manually commit the changes for the branches listed above from the lakeFS UI

#### You can run previous command again after finishing this to verify that there are not any uncommitted data any more

In [None]:
has_more = True
after = ""

while has_more:
    list_branches = sourceClient.branches.list_branches(
        repository=source_repo_name,
        after=after)

    for branch in list_branches.results:
        get_diff = sourceClient.branches.diff_branch(
            repository=source_repo_name,
            branch=branch.id,
            amount=1)
        if get_diff.results:
            print('Committed changes for Branch: ' + branch.id)
            sourceClient.commits.commit(
                repository=source_repo_name,
                branch=branch.id,
                commit_creation=models.CommitCreation(
                    message='Committed changes during the migration of the repository'))
    # pagination
    has_more = list_branches.pagination.has_more
    after = list_branches.pagination.next_offset

# Step 2 - Dump Metadata of Source Repository

In [None]:
dump_refs = sourceClient.refs.dump_refs(
    repository=source_repo_name)

print(dump_refs)

# Step 3 - Copy Data from Source to Target

#### You can directly copy data from local storage to target storage on your own
#### or you can run following printed command on your local machine to copy data from local Docker container to local machine first
#### (change the Docker container name for lakeFS and go to the folder where you cloned lakefs-samples Git repo before running the command)

In [None]:
lakefs_docker_container_name = 'lakefs'
print('docker cp ' + lakefs_docker_container_name + ':/home/lakefs/lakefs/data/block/' + source_storage_namespace.split('://')[1] + \
      '/ lakeFS-samples/14-migrate-or-clone-repo/localDownloadedSourceFolder/')

#### Now copy data from your local machine to target storage on S3

In [None]:
s3SyncCommand = 'aws s3 sync ./localDownloadedSourceFolder/ ' + target_storage_namespace
print(s3SyncCommand)

! $s3SyncCommand

## Step 4 - Create Target Bare Repository

In [None]:
targetClient.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=target_repo_name,
        storage_namespace=target_storage_namespace,
        default_branch=source_main_branch),
    bare=True)

## Step 5 - Restore Metadata to Target Repository

In [None]:
targetClient.refs.restore_refs(
    repository=target_repo_name,
    refs_dump=dump_refs)

print('Repository migrated')

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack