# Lab 1 - Classical QMC with AWS Batch

In this lab, we will run a classical QMC workload on [AWS Batch](https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html). AWS Batch helps you to run batch computing workloads on the AWS Cloud and makes it easy for developers, scientists, and engineers to access large amounts of compute resources. AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure. 

As a fully managed service, AWS Batch helps you to run batch computing workloads of any scale. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there's no need to install or manage batch computing software, so you can focus your time on analyzing results and solving problems.

In the following, we will go through the steps to
1. Create an AWS Batch compute environment
2. Package the QMC code for our workload in a Docker image and upload it to the AWS cloud
3. Run a small scale QMC example requiring only one compute instance
4. Run a large scale QMC example with 200 child jobs running on multiple compute instances in parallel
5. Compare the results of our small and large scale experiments

### AWS SDK for Python (Boto3)

Troughout the tutorial, we will interact with AWS service APIs with [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html), the AWS SDK for Python. Boto3 simplifies the use of AWS services by providing a set of libraries that are consistent and familiar for Python developers. The AWS SDK for Python provides object-oriented APIs for each AWS service we will use in this lab. Let's instantiate the [Boto3 clients](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/clients.html#low-level-clients) for the service APIs we need in this lab.

In [None]:
import boto3

cfn_client = boto3.client('cloudformation')
batch_client = boto3.client('batch')
sts_client = boto3.client("sts")
s3_client = boto3.client("s3")

We also use Boto3 to get information about the AWS account and AWS region we use in this lab. An AWS region is one component of the [AWS global infrastructure](https://aws.amazon.com/about-aws/global-infrastructure). It is a separate geographic area designed to be isolated from the other regions. When you use a service, select a region to determine in which geographic location your resources will be deployed. When you view your resources, you see only the resources that are tied to the region that you specified. This is because regions are isolated from each other, and we don't automatically replicate resources across regions.

In [None]:
ACCOUNT_ID = sts_client.get_caller_identity().get("Account")
print("Account ID:", ACCOUNT_ID)

my_session = boto3.session.Session()
WORKING_REGION = my_session.region_name
print("Region:", WORKING_REGION)

### Create the AWS Batch components

Batch provides all of the necessary functionality to run high-scale, compute-intensive workloads on top of AWS managed container orchestration services. The service runs your workloads in Docker containers from images you provide and that are pulled from container registries, which may exist within or outside of your AWS infrastructure.

Let's create the infrastructure we need to run our experiments today. To use AWS Batch we need the following compontents:
* A Batch __[compute environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html)__ defining the compute resources used to run on jobs on.
* A Batch __[job queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html)__ where we submit our jobs to so they can be scheduled to run in a compute environment.
* A Batch __[job definition](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html)__ specifying the blueprint how jobs are to be run, including memory and CPU requirements as well as container properties like the container image and environment variables.

For our tutorial, we will also need
* An __[Amazon Simple Storage Service (S3) bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html)__ we use to store job input and output data.
* An __[Amazon Elastic Container Registry (ECR) repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html)__ where we upload and store the Docker image defining our job runtime.

![](./images/batch-architecture.png)

We can choose to create all these resources through the web browser in the AWS management console, or progammatically using AWS SDKs or the AWS CLI. Today, we will use [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html), instead. CloudFormation is a service that helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. We use the template [`batch-environment.yaml`](./batch-environment.yaml) that describes all the AWS resources we need for our QMC simulations with AWS Batch, and CloudFormation takes care of provisioning and configuring those resources.

In [None]:
with open('batch-environment.yaml', 'r') as file:
    template_body = file.read()


stack_name = 'batch-environment'

try:
    print(f"Creating CloudFormation stack {stack_name}")
    cfn_client.create_stack(
        StackName=stack_name,
        TemplateBody=template_body,
    )
    print("Waiting for CloudFormation stack to complete...")
    waiter = cfn_client.get_waiter("stack_create_complete")
    waiter.wait(
        StackName=stack_name,
        WaiterConfig={
            'Delay': 10,
            'MaxAttempts': 150
        }
    )
    print("CloudFormation stack completed.")
except cfn_client.exceptions.AlreadyExistsException:
    print("Stack already exists. Updating CloudFormation stack")
    try:
        cfn_client.update_stack(
            StackName=stack_name,
            TemplateBody=template_body,
        )
        print("Waiting for CloudFormation stack to be updated...")
        waiter = cfn_client.get_waiter("stack_update_complete")
        waiter.wait(
            StackName=stack_name,
            WaiterConfig={
                'Delay': 10,
                'MaxAttempts': 150
            }
        )
        print("CloudFormation stack updated.")
    except cfn_client.exceptions.ClientError as e:
        print(e)

print()
for output in cfn_client.describe_stacks(StackName=stack_name).get('Stacks')[0].get('Outputs'):
    print(f"{output.get('OutputKey')}: {output.get('OutputValue')}")

<div class="alert alert-block alert-success">
<b>Activity:</b> Navigate to the <a href="https://us-east-1.console.aws.amazon.com/cloudformation/home">AWS CloudFormation mangement console</a> to check the status of your stack and see the individual resources being created (see the screenshot below). Also, you may review the template <code>batch-environment.yaml</code> to learn how we describe the resources for CloudFormation to create for us.
</div>

![](./images/cfn-console.png)

<div class="alert alert-block alert-success">
<b>Activity:</b> Assign the stack outputs to the related variables below.
</div>

In [None]:
data_bucket_name = ""  # FIXME
batch_image_repository_uri = ""  # FIXME
job_queue = ""  # FIXME
job_definition = ""  # FIXME

### Algorithm code and runtime

Now that we have setup Batch to run jobs, we have to provide the algorithm code and describe the job runtime. For that purpose, we have to create a Docker image from which Batch will start a container to execute our workload. For this tutorial we are providing you with all the files, including the source code for the QMC example, the Dockerfile and the container entrypoint script so you don't have to code all up yourself. 

<div class="alert alert-block alert-success">
<b>Activity:</b> (Optional) To familiarize yourself with the provided code, you may review the directories <code>batch_container_image</code> and <code>afqmc</code>.
</div>

Now let's build and push our Docker image. Since we push it to a private repository, we must authenticate our local Docker client to our private ECR registry in our AWS account (learn more [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html)), first. Next, we build the Docker image locally in our development envrionment. Finally, we push the image to the repository.

<div class="alert alert-block alert-info">
<b>Note:</b> The next cell takes about 3 minutes to complete.
</div>

In [None]:
%%time
import os

print("Authenticating Docker to your Amazon ECR private registry...")
os.system(f"aws ecr get-login-password --region {WORKING_REGION} | docker login --username AWS --password-stdin {ACCOUNT_ID}.dkr.ecr.{WORKING_REGION}.amazonaws.com")

print("Building your Docker image locally...")
os.system(f"docker build --quiet --platform linux/amd64 -f batch_container_image/Dockerfile -t {batch_image_repository_uri} .")

print("Pushing your Docker image to your ECR repository...")
os.system(f"docker push --quiet {batch_image_repository_uri}")

print("All done.")

<div class="alert alert-block alert-success">
<b>Activity:</b> Navigate to the <a href="https://us-east-1.console.aws.amazon.com/ecr/private-registry/repositories">Amazon ECR mangement console</a> and check to see that the image has been successfully pushed to the ECR repository. See the screenshot below.
</div>

![](images/ecr-console.png)

### Submit a job to AWS Batch

Now that we have created a compute environment for Batch and uploaded a container image defining our algorithm execution to ECR, it's time to submit our first QMC job.

In [None]:
response = batch_client.submit_job(
    jobName="lab-1-classical-small-scale",
    jobQueue=job_queue,
    jobDefinition=job_definition,
    containerOverrides={
        "environment": [
            {"name": "JOB_ENTRY_POINT", "value": "run_classical_afqmc"},
            {"name": "JOB_DTAU", "value": "0.005"},
            {"name": "JOB_TIME_STEPS", "value": "200"}
        ]
    }
)

jobId = response["jobId"]

<div class="alert alert-block alert-info">
<b>Note:</b>
<ul>
<li> The envrionment variable <b>JOB_ENTRY_POINT</b> that we specified during job submission is used to select our algorithm script.</li>
<li>The container entrypoint script writes to a file in the local directory and uploads the results to our S3 bucket.</li>
</ul>
</div>

<div class="alert alert-block alert-success">
<b>Activity:</b> Navigate to the <a href="https://us-east-1.console.aws.amazon.com/batch/home">AWS Batch mangement console</a> and find the job you submitted above. See the screenshot below.
</div>

![](images/batch-console.png)

Your Batch job will undergo several lifecycle transitions indicated by follwoing states: `SUBMITTED`, `PENDING`, `RUNNABLE`, `STARTING`, `RUNNING`, `SUCCEEDED`.

Learn more about Batch job states __[here](https://docs.aws.amazon.com/batch/latest/userguide/job_states.html)__

In [None]:
batch_client.describe_jobs(jobs=[jobId]).get("jobs")[0].get("status")

<div class="alert alert-block alert-success">
<b>Activity:</b> Once your job is in the <code>SUCCEEDED</code> state, navigate to the S3 management console and to locate the results file.
</div>

### Postprocessing of job results

Next, we can download the job result and postprocess the data, locally.

The total energy of AFQMC at every time step is evaluated by weight-averaging the local energy of every walker sample as
$$
E = \sum_{l=1}^{N} w_l \frac{\langle \Psi_T|H|\phi_l\rangle}{\langle \Psi_T|\phi_l\rangle},
$$
where $|\Psi_T\rangle$ is the trial wavefunction. The local energies and weights are separately saved into the S3 bucket. For our first, small-scale example, there is only one file we have to download.

In [None]:
import json
import numpy as np
from pathlib import Path

job = batch_client.describe_jobs(jobs=[jobId])["jobs"][0]
job_status = job["status"]
if job_status == "SUCCEEDED":
    Path("results/lab-1").mkdir(parents=True, exist_ok=True)
    s3_client.download_file(
        data_bucket_name,
        f"batch/{jobId}/results.json",
        f"results/lab-1/result.json")
else:
    print(f"Your job is in status {job_status}")

In [None]:
# local postprocessing of results
local_energies = []
local_energies_real = []
local_energies_imag = []
weights = []

with open(f"results/lab-1/result.json", "r") as file:
    data = json.load(file)


[local_energies_real.append(j) for j in data["local_energies_real"]]
[local_energies_imag.append(j) for j in data["local_energies_imag"]]
[weights.append(j) for j in data["weights"]]

for i, j in zip(local_energies_real, local_energies_imag):
    local_energies.append([ii+1.j*jj for ii, jj in zip(i, j)])

energies = np.real(np.average(local_energies, weights=weights, axis=0))

Let's plot the energies and compare to the reference value.

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

plt.plot(
    0.005 * np.arange(200),
    energies,
    linestyle="dashed",
    marker=".",
    color="tab:blue",
    label="classical",
)
plt.axhline(-1.137117067345732, linestyle="dashed", color="black")
plt.title(r"Ground state estimation of H$_2$ using AFQMC", fontsize=16)
plt.legend(fontsize=14, loc="upper right")
plt.xlabel(r"$\tau$", fontsize=14)
plt.ylabel("Energy", fontsize=14)
plt.yticks(fontsize=14)
plt.tick_params(direction="in", labelsize=14)
plt.show()

The results might be oscillating but that's ok. This is because we have only employed a small number of samples. In the next lab, we're going to scale this calculation up using Batch.

## Run classical AFQMC at scale

Now, we're going to scale up the previous example to a larger number of QMC walkers, by using the Batch feature called [array jobs](https://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html). Each individual task only executes 2 walkers just like in the previous example, and by expanding out more tasks, we increase the total number of walkers in this simulation. At the end of this lab, you will see the results become much more accurate than above.

### Submit the job

<div class="alert alert-block alert-success">
<b>Activity:</b> Submit the job and check their status in the Batch management console. You will notice that the remains in the <code>PENDING</code> state while the child processes run.
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> This job is an array job and we define the number of child processes using <code>arrayProperties={"size" : 200}</code>
</div>

In [None]:
response = batch_client.submit_job(
    jobName="lab-1-classical-large-scale",
    jobQueue=job_queue,
    jobDefinition=job_definition,
    containerOverrides={
        "environment": [
            {"name": "JOB_ENTRY_POINT", "value": "run_classical_afqmc"},
            {"name": "JOB_DTAU", "value": "0.005"},
            {"name": "JOB_TIME_STEPS", "value": "600"}
        ]
    },
    arrayProperties={"size": 200}
)

jobId = response["jobId"]

Track the job status:

In [None]:
batch_client.describe_jobs(jobs=[jobId]).get("jobs")[0].get("status")

<div class="alert alert-block alert-success">
<b>Activity:</b> When the jobs complete use the <a href="https://us-east-1.console.aws.amazon.com/s3/home"> Amazon S3 management console</a> to locate the output. Then use Boto3 to fetch and plot the results.
</div>

### Postprocessing of job results

The processing of our job results is computationally more intensive this time. We have to download 200 files and average the energies over a much larger dataset.

In [None]:
import json
import numpy as np
from pathlib import Path

job = batch_client.describe_jobs(jobs=[jobId])["jobs"][0]
job_status = job["status"]
if job_status == "SUCCEEDED":
    Path("results/lab-1").mkdir(parents=True, exist_ok=True)
    s3_client = boto3.client("s3")
    for i in range(200):
        s3_client.download_file(
            data_bucket_name, 
            f"batch/{jobId}:{i}/results.json",
            f"results/lab-1/result_{i}.json"
        )
else:
    print(f"Your job is in status {job_status}")



local_energies_real = []
local_energies_imag = []
weights = []

for i in range(200):
    with open(f"results/lab-1/result_{i}.json", "r") as file:
        data = json.load(file)
    [local_energies_real.append(j) for j in data["local_energies_real"]]
    [local_energies_imag.append(j) for j in data["local_energies_imag"]]
    [weights.append(j) for j in data["weights"]]

local_energies = [[ii+1.j*jj for ii, jj in zip(i, j)] for i, j in zip(local_energies_real, local_energies_imag)]   
energies = np.real(np.average(local_energies, weights=weights, axis=0))

Let's plot the energies and compare to the reference value, again.

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

plt.plot(
    0.005 * np.arange(600),
    energies,
    linestyle="dashed",
    marker=".",
    color="tab:blue",
    label="classical",
)
plt.axhline(-1.137117067345732, linestyle="dashed", color="black")
plt.title(r"Ground state estimation of H$_2$ using AFQMC", fontsize=16)
plt.legend(fontsize=14, loc="upper right")
plt.xlabel(r"$\tau$", fontsize=14)
plt.ylabel("Energy", fontsize=14)
plt.yticks(fontsize=14)
plt.tick_params(direction="in", labelsize=14)
plt.show()

By comparing with the previous run, it's clear that the oscillations are much smaller and the results converge to the reference value.

## Wrapping up

<div class="alert alert-block alert-info">
<b>You reached the end of lab 1. Well done!</b>
</div>