# Getting started

Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. 
No account or login is required. The S3 bucket is located here [arn:aws:s3:::allen-brain-cell-atlas](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html). You will need to be connected to the internet to run this notebook.

Each release has an associated **manifest.json** which list all the specific version of directories and files that are part of the release. We recommend using the manifest as the starting point of data download and usage.

Expression matrices are stored in the [anndata h5ad format](https://anndata.readthedocs.io/en/latest/) and needs to be downloaded to a local file system for usage.

The **AWS Command Line Interface ([AWS CLI](https://aws.amazon.com/cli/))** is a simple option to download specific directories and files from S3. Download and installation instructructions can be found here: https://aws.amazon.com/cli/. 

This notebook shows how to format AWS CLI commands to download the data required for the tutorials. You can copy those command onto a terminal shell or optionally you can run those command directly in this notebook by uncommenting the "subprocess.run" lines in the code.


In [1]:
import requests
import json
import os
import pathlib
import subprocess
import time

## Using the file manifest

Let's open the manifest.json file associated with the current release.

In [2]:
version = '20231215'
url = 'https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/releases/%s/manifest.json' % version
manifest = json.loads(requests.get(url).text)
print("version: ", manifest['version'])

version:  20231215


At the top level, the manifest consists of the release *version* tag, S3 *resource_uri*,  dictionaries *directory_listing* and *file_listing*. A simple option to download data is to use the AWS CLI to download specific directories or files. All the example notebooks in this repository assumes that data has been downloaded locally in the same file organization as specified by the "relative_path" field in the manifest.

In [3]:
manifest.keys()
print("version:",manifest['version'])
print("resource_uri:",manifest['resource_uri'])

version: 20231215
resource_uri: s3://allen-brain-cell-atlas/


Let's look at the information associated with the spatial transcriptomics dataset **MERFISH-C57BL6J-638850**. This dataset has two related directories: *expression_matrices* containing a set of h5ad files and *metadata* containing a set of csv files. Use the *view_link* url to browse the directories on a web-browser.

In [4]:
expression_matrices = manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['expression_matrices']
print(expression_matrices)
print(expression_matrices['view_link'])

{'version': '20230830', 'relative_path': 'expression_matrices/MERFISH-C57BL6J-638850/20230830', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/MERFISH-C57BL6J-638850/20230830/', 'view_link': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/MERFISH-C57BL6J-638850/20230830/', 'total_size': 15255179148}
https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/MERFISH-C57BL6J-638850/20230830/


In [5]:
metadata = manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['metadata']
print(metadata)
print(metadata['view_link'])

{'version': '20231215', 'relative_path': 'metadata/MERFISH-C57BL6J-638850/20231215', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/metadata/MERFISH-C57BL6J-638850/20231215/', 'view_link': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850/20231215/', 'total_size': 1942629358}
https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850/20231215/


Directory sizes are also reported as part to the manifest.json. WARNING: the expression matrices directories can get very large > 100 GB.

In [6]:
GB = float(float(1024) ** 3)

for r in manifest['directory_listing'] :    
    r_dict =  manifest['directory_listing'][r]
    for d in r_dict['directories'] :
        d_dict = r_dict['directories'][d]        
        print(d_dict['relative_path'],":",'%0.2f GB' % (d_dict['total_size']/GB))
        

expression_matrices/MERFISH-C57BL6J-638850/20230830 : 14.21 GB
metadata/MERFISH-C57BL6J-638850/20231215 : 1.81 GB
expression_matrices/MERFISH-C57BL6J-638850-sections/20230630 : 14.31 GB
expression_matrices/WMB-10Xv2/20230630 : 104.16 GB
expression_matrices/WMB-10Xv3/20230630 : 176.41 GB
expression_matrices/WMB-10XMulti/20230830 : 0.21 GB
metadata/WMB-10X/20231215 : 2.39 GB
metadata/WMB-taxonomy/20231215 : 0.01 GB
metadata/WMB-neighborhoods/20231215 : 3.00 GB
image_volumes/Allen-CCF-2020/20230630 : 0.37 GB
metadata/Allen-CCF-2020/20230630 : 0.00 GB
image_volumes/MERFISH-C57BL6J-638850-CCF/20230630 : 0.11 GB
metadata/MERFISH-C57BL6J-638850-CCF/20231215 : 2.01 GB
expression_matrices/Zhuang-ABCA-1/20230830 : 3.09 GB
metadata/Zhuang-ABCA-1/20231215 : 1.33 GB
metadata/Zhuang-ABCA-1-CCF/20230830 : 0.21 GB
expression_matrices/Zhuang-ABCA-2/20230830 : 1.30 GB
metadata/Zhuang-ABCA-2/20231215 : 0.57 GB
metadata/Zhuang-ABCA-2-CCF/20230830 : 0.08 GB
expression_matrices/Zhuang-ABCA-3/20230830 : 1.69

## Downloading files for the tutorial notebooks

Suppose you would like to download data to your local path *../abc_download_root*.

In [7]:
download_base = '/Volumes/LaCie/2024_Allen_Brain_Cell_Atlas'

### Downloading all metadata directories

Since the metadata directories are relatively small we will download all the metadata directories. We loop through the manifest and download each metadata directory using  **[AWS CLI](https://aws.amazon.com/cli/)** sync command. This should take < 5 minutes.

In [9]:
for r in manifest['directory_listing'] :
    
    r_dict =  manifest['directory_listing'][r]
    
    for d in r_dict['directories'] :
        
        if d != 'metadata' :
            continue
        d_dict = r_dict['directories'][d]
        local_path = os.path.join( download_base, d_dict['relative_path'])
        local_path = pathlib.Path( local_path )
        remote_path = manifest['resource_uri'] + d_dict['relative_path']
        
        command = "aws s3 sync --no-sign-request %s %s" % (remote_path, local_path)
        print(command)
        
        start = time.process_time()
        # Uncomment to download directories
        result = subprocess.run(command.split(),stdout=subprocess.PIPE)
        print("time taken: ", time.process_time() - start)
  

aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/MERFISH-C57BL6J-638850/20231215 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/metadata/MERFISH-C57BL6J-638850/20231215
time taken:  0.07871700000000015
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-10X/20231215 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/metadata/WMB-10X/20231215
time taken:  0.0960399999999999
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-taxonomy/20231215 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/metadata/WMB-taxonomy/20231215
time taken:  0.006442000000000059
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-neighborhoods/20231215 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/metadata/WMB-neighborhoods/20231215
time taken:  0.10831099999999982
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Allen-CCF-2020/20230630 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/metadata/Allen-CCF-2020/20230630
time taken:  0.00495200000000

### Downloading one 10x expression matrix
The prerequisite to run the 10x part 1 notebook is to have downloaded the log2 version of the "'WMB-10Xv2-TH'" matrix (4GB). Download takes ~ 1 min depending on your network speed. 

We define a simple helper function to create the require AWS command. You can copy the command into a terminal shell to run or optionally run it inside this notebook if you uncomment the "subprocess.run" line of code.

In [11]:
def download_file( file_dict ) :
    
    print(file_dict['relative_path'],file_dict['size'])
    local_path = os.path.join( download_base, file_dict['relative_path'] )
    local_path = pathlib.Path( local_path )
    remote_path = manifest['resource_uri'] + file_dict['relative_path']

    command = "aws s3 cp --no-sign-request %s %s" % (remote_path, local_path)
    print(command)

    start = time.process_time()
    # Uncomment to download file
    result = subprocess.run(command.split(' '),stdout=subprocess.PIPE)
    print("time taken: ", time.process_time() - start)

In [12]:
expression_matrices = manifest['file_listing']['WMB-10Xv2']['expression_matrices']
file_dict = expression_matrices['WMB-10Xv2-TH']['log2']['files']['h5ad']
print('size:',file_dict['size'])
download_file( file_dict )

size: 4038679930
expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad 4038679930
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad
time taken:  0.14806600000000003


### Downloading the MERFISH expression matrix

The prerequisite to run the MERFISH part 1 notebook is to have downloaded the log2 version of the "C57BL6J-638850" matrix (7GB). Download takes ~3 mins depending on tour network speed.

In [13]:
datasets = ['MERFISH-C57BL6J-638850']
for d in datasets :
    expression_matrices = manifest['file_listing'][d]['expression_matrices']
    file_dict = expression_matrices['C57BL6J-638850']['log2']['files']['h5ad']
    print('size:',file_dict['size'])
    download_file( file_dict )

size: 7627589574
expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad 7627589574
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad
time taken:  0.2750950000000001


The prerequisite to run the Zhuang MERFISH notebook is to have downloaded the log2 version of the expression matrices of all 4 brain specimens

In [14]:
datasets = ['Zhuang-ABCA-1','Zhuang-ABCA-2','Zhuang-ABCA-3','Zhuang-ABCA-4']
for d in datasets :
    expression_matrices = manifest['file_listing'][d]['expression_matrices']
    file_dict = expression_matrices[d]['log2']['files']['h5ad']
    print('size:',file_dict['size'])
    download_file( file_dict )

size: 2128478610
expression_matrices/Zhuang-ABCA-1/20230830/Zhuang-ABCA-1-log2.h5ad 2128478610
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-1/20230830/Zhuang-ABCA-1-log2.h5ad /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/expression_matrices/Zhuang-ABCA-1/20230830/Zhuang-ABCA-1-log2.h5ad
time taken:  0.07982400000000034
size: 871420938
expression_matrices/Zhuang-ABCA-2/20230830/Zhuang-ABCA-2-log2.h5ad 871420938
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-2/20230830/Zhuang-ABCA-2-log2.h5ad /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/expression_matrices/Zhuang-ABCA-2/20230830/Zhuang-ABCA-2-log2.h5ad
time taken:  0.03586199999999984
size: 1160586154
expression_matrices/Zhuang-ABCA-3/20230830/Zhuang-ABCA-3-log2.h5ad 1160586154
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-3/20230830/Zhuang-ABCA-3-log2.h5ad /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/expression_matrices

### Downloading all image volumes

The prerequisite to run the CCF and MERFISH to CCF registration notebooks is to have downloaded the two set of image volumes.

In [15]:
for r in manifest['directory_listing'] :
    
    r_dict =  manifest['directory_listing'][r]
    
    for d in r_dict['directories'] :
        
        if d != 'image_volumes' :
            continue
        d_dict = r_dict['directories'][d]
        local_path = os.path.join( download_base, d_dict['relative_path'])
        local_path = pathlib.Path( local_path )
        remote_path = manifest['resource_uri'] + d_dict['relative_path']
        
        command = "aws s3 sync --no-sign-request %s %s" % (remote_path, local_path)
        print(command)
        
        start = time.process_time()
        # Uncomment to download directories
        result = subprocess.run(command.split(),stdout=subprocess.PIPE)
        print("time taken: ", time.process_time() - start)
  

aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/image_volumes/Allen-CCF-2020/20230630 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/image_volumes/Allen-CCF-2020/20230630
time taken:  0.022386000000000017
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/image_volumes/MERFISH-C57BL6J-638850-CCF/20230630 /Volumes/LaCie/2024_Allen_Brain_Cell_Atlas/image_volumes/MERFISH-C57BL6J-638850-CCF/20230630
time taken:  0.008987999999999996
