docs: update docs for azbatch and dockerhub ref (#2298)

### Description - Patches the Azure Batch Executor docs to include instructions for generating an SAS token - Updates the Azure Batch Executor test to pull docker image from the latest snakemake release ### QC  * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).
snakemake · Jun 14, 2023 · 908dbf1 · 908dbf1
1 parent bca7959
commit 908dbf1
Show file tree

Hide file tree

Showing 4 changed files with 154 additions and 40 deletions.
diff --git a/docs/executing/cloud.rst b/docs/executing/cloud.rst
@@ -454,3 +454,103 @@ the `FUNNEL_SERVER_USER` and  `FUNNEL_SERVER_PASSWORD` AS environmental variable
     $ export FUNNEL_SERVER_USER=funnel
     $ export FUNNEL_SERVER_PASSWORD=abc123
 
+-----------------------------------------------------------------
+Executing a Snakemake workflow via Azure Batch
+-----------------------------------------------------------------
+
+First, install the `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`_.
+Then install Azure related dependencies:
+
+.. code:: console
+
+    conda create -c bioconda -c conda-forge -n snakemake snakemake msrest azure-batch azure-storage-blob azure-mgmt-batch azure-identity
+    conda activate snakemake
+
+
+Data in Azure Storage
+~~~~~~~~~~~~~~~~~~~~~~
+
+Using this executor typically requires you to start with large data files
+already in Azure Storage, and then interact with them via Azure Batch. An easy way to do this is to use the
+`azcopy <https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10>`__.
+command line client. For example, here is how we might upload a file
+to storage using it:
+
+.. code-block:: console
+
+    $ azcopy copy mydata.txt "https://$account.blob.core.windows.net/snakemake-bucket/1/mydata.txt"
+
+The snakemake azbatch executor will not work with data in a storage account that has "hierarchical namespace" enabled. 
+Azure hierarchical namespace is a new api on azure storage that is also called "ADLS Gen2". 
+Snakemake does not currently support this storage format because the Python API is distinct from traditional blob storage.
+For more details see: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace.
+
+
+Execution
+~~~~~~~~~
+
+Before you an exexute you will need to setup the credentials that allow the batch nodes to
+read and write from blob storage. For the AzBlob storage provider in
+Snakemake this is done through the environment variables.
+
+Set required env variables:
+
+.. code-block:: console
+
+    $ export AZ_BLOB_PREFIX=<Azure_Blob_name>
+    $ export AZ_BATCH_ACCOUNT_URL="<AZ_BATCH_ACCOUNT_URL>"
+    $ export AZ_BATCH_ACCOUNT_KEY="<AZ_BATCH_ACCOUNT_KEY>"
+    $ export AZ_BLOB_ACCOUNT_URL="<AZ_BLOB_ACCOUNT_URL_with_SAS>"
+
+Now we can run Snakemake using:
+
+.. code-block:: console
+
+    $  snakemake \
+        --default-remote-prefix $AZ_BLOB_PREFIX \
+        --use-conda \
+        --default-remote-provider AzBlob \
+        --envvars AZ_BLOB_ACCOUNT_URL \
+        --az-batch \
+        --container-image snakemake/snakemake \
+        --az-batch-account-url $AZ_BATCH_ACCOUNT_URL
+
+This will use the default Snakemake image from Dockerhub. If you would like to use your
+own, make sure that the image contains the same Snakemake version as installed locally
+and also supports Azure Blob storage. The optional BATCH_CONTAINER_REGISTRY can be configured 
+to fetch from your own container registry. If that registry is an Azure Container Registry 
+that the managed identity has access to, then the BATCH_CONTAINER_REGISTRY_USER and BATCH_CONTAINER_REGISTRY_PASS is not needed. 
+
+After completion all results including logs can be found in the blob container prefix specified by `--default-remote-prefix`.
+
+Additional configuration
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Defining a Start Task**
+
+A start task can be optionally specified as a shell scirpt that runs during each node's startup as it's added to the batch pool.
+To specify a start task, set the environment variable BATCH_NODE_START_TASK_SAS_URL to the SAS url of a start task shell script.
+Store your shell script in a blob storage account and generate an SAS url to a shell script blob object. 
+You can generate an SAS URL to the blob using the azure portal or the command line using the following command structure: 
+
+.. code-block::
+
+  $  container="container-name"
+  $  expiry="2024-01-01"
+  $  blob_name="starttask.sh"
+  $  SAS_TOKEN=$(az storage blob generate-sas --account-name $stgacct --container-name $container --name $blob_name --permissions r --auth-mode login --as-user --expiry $expiry -o tsv)
+  $  BLOB_URL=$(az storage blob url --account-name cromwellstorage --container-name snaketest --name starttask.sh --auth-mode login -o tsv)
+
+  # then export the full SAS URL
+  $  export BATCH_NODE_START_TASK_SAS_URL="${BLOB_URL}?${SAS_TOKEN}"
+
+
+**Autoscaling and Task Distribution**
+
+The azure batch executor supports autoscaling of the batch nodes by including the flag ``--az-batch-enable-autoscale``. 
+This flag sets the initial dedicated node count of the pool to zero, and re-evaluates the number of nodes to be spun up or down based on the number of remaining tasks to run over a five minute interval. 
+Since five minutes is the smallest allowed interval for azure batch autoscaling, this feature becomes more useful for long running jobs. For more information on azure batch autoscaling configuration, see: https://learn.microsoft.com/en-us/azure/batch/batch-automatic-scaling.
+
+For shorter running jobs it might be more cost/time effective to set VM size with more cores (`BATCH_POOL_VM_SIZE`) and increase the number of `BATCH_TASKS_PER_NODE`. Or, if you want to keep tasks running on separate nodes, you can set a larger number for `BATCH_POOL_NODE_COUNT`. 
+It may require experimentation to find the most efficient/cost effective task distribution model for your use case depending on what you are optimizing for. For more details on limitations of azure batch node / task distribution see: https://learn.microsoft.com/en-us/azure/batch/batch-parallel-node-tasks.
+
diff --git a/docs/executor_tutorial/azure_batch.rst b/docs/executor_tutorial/azure_batch.rst
@@ -8,18 +8,19 @@ Azure Batch Tutorial
 .. _AZCLI: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest
 
 In this tutorial we will show how to execute a Snakemake workflow
-on Azure batch nodes without a shared file-system. One could use attached storage 
+on Azure Batch nodes using Azure Blob Storage. One could use attached storage 
 solutions as a shared file system, but this adds an unnecessary level of complexity
-and most importantly costs. Instead we use cheap Azure Blob storage,
+and most importantly costs. Instead we use cheap Azure Blob Storage,
 which is used by Snakemake to automatically stage data in and out for
 every job. Please visit the `Azure Batch Documentation 
 <https://learn.microsoft.com/en-us/azure/batch/batch-technical-overview#how-it-works>`__
 for an overview of the various components of Azure Batch.
 
-Following the steps below you will
+Following the steps below you will:
 
-#. Set up Azure Blob storage, and sync the Snakemake tutorial data to a storage container.
+#. Set up Azure Blob Storage, and sync the Snakemake tutorial data to the storage container
 #. Create an Azure Batch account  
+#. Configure credentials
 #. Run the example Sankemake workflow on the batch account
 
 
@@ -29,20 +30,20 @@ Setup
 To go through this tutorial, you need the following software installed:
 
 * Python_ ≥3.6
-* Snakemake_ ≥7.25.4
+* Snakemake_ ≥7.28.0
 * AZCLI_
 
 
 First install conda as outlined in the :ref:`tutorial <tutorial-setup>`,
-and then install the full Snakemake with some additional Azure related optional dependencies:
+and then install the full Snakemake with some additional Azure related dependencies and AZCLI_:
 
 .. code:: console
 
     conda create -c bioconda -c conda-forge -n snakemake snakemake msrest azure-batch azure-storage-blob azure-mgmt-batch azure-identity
 
 Naturally, you can omit the deployment of such an environment in case you already have it, or you can update an existing Snakemake environment with the additional dependencies.
 
-Create an Azure storage account and upload example data
+Create an Azure Storage Account and upload example data
 :::::::::::::::::::::::::::::::::::::::::::::::
 
 We will be starting from scratch, i.e. we will 
@@ -51,6 +52,9 @@ existing resources instead.
 
 .. code:: console
 
+   # Login into Azure cli
+   az login
+
    # change the following names as required
    # azure region where to run:
    export region=westus
@@ -67,11 +71,29 @@ existing resources instead.
    # create a general purpose storage account with cheapest SKU
    az storage account create -n $stgacct -g $resgroup --sku Standard_LRS -l $region
 
-Get a key for that account and save it as ``stgkey`` for later use:
+Get a key for that account and save it as ``stgkey``, then generate the storage account SAS token that expires one day later, you will use the SAS to authenticate to Blob storage:
 
 .. code:: console
 
-   export stgkey=$(az storage account keys list -g $resgroup -n $stgacct | head -n1 | cut -f 3)
+   # Get date 5 days from today
+   export expiry_date=`date -u -d "+5 days" '+%Y-%m-%dT%H:%MZ'`
+
+   # get the storage account key and storage endpoint
+   export stgkey=$(az storage account keys list -g $resgroup -n $stgacct -o tsv | head -n1 | cut -f 4)
+   export stgurl=$(az storage account show-connection-string -g $resgroup -n $stgacct --protocol https -o tsv | cut -f5,9 -d ';' | cut -f 2 -d '=')
+
+   # get a storage account SAS token to use for AZ_BLOB_ACCOUNT_URL
+   export sas=$(az storage account generate-sas --account-name $stgacct \
+      --account-key $stgkey \
+      --expiry $expiry_date \
+      --https-only \
+      --permissions acdlrw \
+      --resource-types sco \
+      --services bf
+      --out tsv)
+
+   # construct a blob account url with SAS token
+   export storage_account_url_with_sas="${stgurl}?${sas}"
 
 Next, you will create a storage container (think: bucket) to upload the Snakemake tutorial data to:
 
@@ -120,17 +142,15 @@ The format of the batch account url is :code:`https://${accountname}.${region}.b
 
     # get batch account url from command line
     export batch_endpoint=$(az batch account show --name $accountname --resource-group $resgroup --query "accountEndpoint" --output tsv)
-    export batch_account_url=="https://${batch_endpoint}"
-
-
-.. code:: console
+    export batch_account_url="https://${batch_endpoint}"
 
-    az_batch_account_key=$(az batch account keys list --resource-group $resgroup --name $accountname -o tsv | head -n1 | cut -f2)
+    # set the batch account key
+    export az_batch_account_key=$(az batch account keys list --resource-group $resgroup --name $accountname -o tsv | head -n1 | cut -f2)
 
 
 
 To run the test workflow, two primary environment variables need to be set local to the snakemake invocation.
-The azure batch account key, and the azure storage account url with an SAS key. More details about the AZ_BLOB_ACCOUNT_URL 
+The azure batch account key, and the azure storage account url with an SAS credential. More details about the AZ_BLOB_ACCOUNT_URL 
 are described in the section below. 
 
 .. code:: console
@@ -142,7 +162,7 @@ are described in the section below.
 Running the workflow
 ::::::::::::::::::::
 
-Below we will run an example Snakemake workflow, using conda to install software on the fly.
+Below we will run an example Snakemake workflow, using conda envrionments to install dependencies at runtime.
 Clone the example workflow and cd into the directory:
 
 .. code:: console
@@ -162,23 +182,16 @@ Clone the example workflow and cd into the directory:
     └── src
         └── plot-quals.py
 
-Now we will need to setup the credentials that allow the batch nodes to
-read and write from blob storage. For the AzBlob storage provider in
-Snakemake this is done through the environment variables
-``AZ_BLOB_ACCOUNT_URL`` and optionally ``AZ_BLOB_CREDENTIAL``. See the
-`documentation <snakefiles/remote_files.html#microsoft-azure-storage>`__ for more info.
-``AZ_BLOB_ACCOUNT_URL`` takes the form ``https://<accountname>.blob.core.windows.net/``
-or may also contain a storage account shared access signature (SAS) token with the form ``https://<accountname>.blob.core.windows.net/<sas_token>``, 
-which is a powerful way to define fine grained and even time controlled access to storage blobs on Azure. 
-If the SAS token is not specified as part of the ``AZ_BLOB_ACCOUNT_URL`` it must be specified using ``AZ_BLOB_CREDENTIAL``.
-``AZ_BLOB_CREDENTIAL`` must be a storage account SAS token, and usually needs to be enclosed in quotes when set from the 
-command line as it contains special characters that need to be escaped.
+To authenticate Azure Blob Storage, we set ``AZ_BLOB_ACCOUNT_URL`` 
+which takes the form: ``https://<accountname>.blob.core.windows.net/?<sas_token>``. 
+The SAS url can be constructed manually from the Azure portal, or on the command line using the commands shown in the above 
+section on storage account configuration. The value for ``AZ_BLOB_ACCOUNT_URL`` must be enclosed in double quotes, as the SAS token 
+contains special characters that need to be escaped.
 
-When using azure storage and snakemake without the azure batch executor, it is valid to use storage account key credentials for ``AZ_BLOB_CREDENTIAL``, 
-but this type of authentication is not supported with Azure batch so we must use a storage account SAS token credential when using the azure batch executor.
+When using azure storage and snakemake without the Azure Batch executor, it is valid to use storage account key credentials and the variable ``AZ_BLOB_CREDENTIAL``, 
+but this type of authentication is not supported with Azure Batch so we must use ``AZ_BLOB_ACCOUNT_URL`` with an SAS token credential when using the Azure Batch executor.
 
-The blob account url combined with SAS token is generally the simplest solution because it results in only needing to specify the ``AZ_BLOB_ACCOUNT_URL``. We’ll pass the ``AZ_BLOB_ACCOUNT_URL`` on to the batch nodes  
-with ``--envvars`` (see below). If using both AZ_BLOB_ACCOUNT_URL, and AZ_BLOB_CREDENTIAL, you will pass both variables to the --envvars command line argument.
+We’ll pass the ``AZ_BLOB_ACCOUNT_URL`` on to the batch nodes with ``--envvars`` flag (see below). 
 
 The following optional environment variables can be set to override their associated default values, 
 and are used to change the runtime configuration of the batch nodes themselves:
@@ -256,7 +269,7 @@ Now you are ready to run the analysis:
     export AZ_BLOB_PREFIX=snakemake-tutorial
     export AZ_BATCH_ACCOUNT_URL="${batch_account_url}"
     export AZ_BATCH_ACCOUNT_KEY="${az_batch_account_key}"
-    export AZ_BLOB_ACCOUNT_URL="${account_url_with_sas}"
+    export AZ_BLOB_ACCOUNT_URL="${storage_account_url_with_sas}"
 
     # optional environment variables with defaults listed
 
@@ -304,8 +317,8 @@ Now you are ready to run the analysis:
 
 This will use the default Snakemake image from Dockerhub. If you would like to use your
 own, make sure that the image contains the same Snakemake version as installed locally
-and also supports Azure Blob storage. The optional BATCH_CONTAINER_REGISTRY can be configured 
-to fetch from your own container registry. If that registry is an azure container registry 
+and also supports Azure Blob Storage. The optional BATCH_CONTAINER_REGISTRY can be configured 
+to fetch from your own container registry. If that registry is an Azure Container Registry 
 that the managed identity has access to, then the BATCH_CONTAINER_REGISTRY_USER and BATCH_CONTAINER_REGISTRY_PASS is not needed. 
 
 After completion all results including
@@ -341,7 +354,7 @@ logs can be found in the blob container prefix specified by `--default-remote-pr
   results/sorted_reads/C.bam                                                                                     BlockBlob    Hot          2248758   application/octet-stream  2022-12-28T18:18:58+00:00
   results/sorted_reads/C.bam.bai                                                                                 BlockBlob    Hot          344       application/octet-stream  2022-12-28T18:21:23+00:00
 
-Once the execution is complete, the Batch nodes will scale down
+Once the execution is complete, the batch nodes will scale down
 automatically. If you are not planning to run anything else, it makes
 sense to shut down it down entirely:
 
@@ -372,9 +385,9 @@ You can generate an SAS URL to the blob using the azure portal or the command li
 Autoscaling and Task Distribution
 :::::
 
-The azure batch executor supports autoscaling of the batch nodes by including the flag --az-batch-enable-autoscale. 
+The azure batch executor supports autoscaling of the batch nodes by including the flag ``--az-batch-enable-autoscale``. 
 This flag sets the initial dedicated node count of the pool to zero, and re-evaluates the number of nodes to be spun up or down based on the number of remaining tasks to run over a five minute interval. 
 Since five minutes is the smallest allowed interval for azure batch autoscaling, this feature becomes more useful for long running jobs. For more information on azure batch autoscaling configuration, see: https://learn.microsoft.com/en-us/azure/batch/batch-automatic-scaling.
 
-For shorter running jobs it might be more cost/time effective to set VM size with more cores `BATCH_POOL_VM_SIZE` and increase the number of `BATCH_TASKS_PER_NODE`. Or, if you want to keep tasks running on separate nodes, you can set a larger number for `BATCH_POOL_NODE_COUNT`. 
-It may require experimentation to find the most efficient/cost effective task distribution model for your use case depending on what you are optimizing for. For more details on limitations of azure batch node / task distribution see: https://learn.microsoft.com/en-us/azure/batch/batch-parallel-node-tasks.
+For shorter running jobs it might be more cost/time effective to set VM size with more cores (`BATCH_POOL_VM_SIZE`) and increase the number of `BATCH_TASKS_PER_NODE`. Or, if you want to keep tasks running on separate nodes, you can set a larger number for `BATCH_POOL_NODE_COUNT`. 
+It may require experimentation to find the most efficient/cost effective task distribution model for your use case depending on what you are optimizing for. For more details on limitations of azure batch node / task distribution see: https://learn.microsoft.com/en-us/azure/batch/batch-parallel-node-tasks.
diff --git a/snakemake/executors/azure_batch.py b/snakemake/executors/azure_batch.py
@@ -660,7 +660,7 @@ def create_batch_pool(self):
         # Specify container configuration, fetching an image
         #  https://docs.microsoft.com/en-us/azure/batch/batch-docker-container-workloads#prefetch-images-for-container-configuration
         container_config = batchmodels.ContainerConfiguration(
-            container_image_names=[self.container_image]
+            type="dockerCompatible", container_image_names=[self.container_image]
         )
 
         user = None
@@ -696,6 +696,7 @@ def create_batch_pool(self):
             # Specify container configuration, fetching an image
             #  https://docs.microsoft.com/en-us/azure/batch/batch-docker-container-workloads#prefetch-images-for-container-configuration
             container_config = batchmodels.ContainerConfiguration(
+                type="dockerCompatible",
                 container_image_names=[self.container_image],
                 container_registries=registry_conf,
             )

diff --git a/tests/test_azure_batch_executor.py b/tests/test_azure_batch_executor.py
@@ -18,7 +18,7 @@ def test_az_batch_executor():
     run(
         path=wdir,
         default_remote_prefix=prefix,
-        container_image="jakevc/snakemake",
+        container_image="snakemake/snakemake",
         envvars=["AZ_BLOB_ACCOUNT_URL", "AZ_BLOB_CREDENTIAL"],
         az_batch=True,
         az_batch_account_url=bau,