# Getting Started with the Human Cell Atlas

## Overview

[Human Cell Atlas (HCA)](https://humancellatlas.org) data is organized into "Projects" which can be discovered through the Catalog. You can interactively [browse the Catalog](https://data.humancellatlas.org/explore/projects) or access it via the public [Data Browser API](https://data.humancellatlas.org/apis/api-documentation/data-browser-api).

This notebook demonstrates how to:

- access the Catalog
- access metadata for a Project
- download the data for a Project

Examples illustrated in this notebook are based on [examples](https://github.com/verily-src/azul/blob/prod/docs/) in the [Azul](https://github.com/verily-src/azul) repository.

## Setup

Run the cell below to set up libraries and utilities for this notebook.

### Install libraries

In [None]:
# Import the standard "requests" library for programmatic access of HTTP URLs
import requests

# Import the standard "os" module for URL path manipulation
import os

# Import "tqdm" to display a progress bar during downloads
from tqdm import tqdm

### Set notebook globals

The notebook needs to know:
    
- URL endpoint for the HCA catalog
- Where to save downloaded files
- An example project's UUID to download

In [None]:
CATALOG_PREFIX = 'dcp'
ENDPOINT_URL = f'https://service.azul.data.humancellatlas.org/index'
CATALOGS_URL = f'{ENDPOINT_URL}/catalogs'
PROJECTS_URL = f'{ENDPOINT_URL}/projects'

HCA_EXAMPLES_DIR = os.path.expanduser('~/wb-tutorials/hca')
OUTPUT_DIR = os.path.join(HCA_EXAMPLES_DIR, 'data')

!mkdir -p "{OUTPUT_DIR}"

### Create utility routines

#### fetch_json

Fetch a URL, handle errors, and return the response json on success.

In [None]:
def fetch_json(url: str, params: dict) -> list:
    response = requests.get(url, params=params)
    response.raise_for_status()
    
    return response.json()

### list_catalogs

Returns a list of catalogs from the server.

The list of catalogs is expected to look something like:

  * ['dcp31', 'dcp32', 'dcp1', 'lm2', 'lm3']

In [None]:
def list_catalogs() -> list:
    response = fetch_json(CATALOGS_URL, None)

    catalogs = []
    for catalog, details in response['catalogs'].items():
        if not details['internal']:
            catalogs.append(catalog)

    return catalogs

### get_dcp_catalog

The Data Coordination Platform (DCP) publishes new catalogs periodically.
Extract the "latest" DCP catalog.

In [None]:
def get_dcp_catalog() -> str:
    # We want to latest dcp catalog.
    catalogs = list_catalogs()
    
    # Extract the 'dcp' catalogs
    dcp_catalogs = [c for c in catalogs if c.startswith(CATALOG_PREFIX)]
    
    # Get the largest numerically
    max_value = 0
    max_catalog = None
    for c in dcp_catalogs:
        if int(c[len(CATALOG_PREFIX):]) > max_value:
            max_value = int(c[len(CATALOG_PREFIX):])
            max_catalog = c
    
    return max_catalog

#### download_file 

Downloads the content of the specified URL to a local output path,
while displaying a progress bar.

In [None]:
def download_file(url: str, output_path: str) -> None:
    # Start the request stream
    response = requests.get(url, stream=True)
    response.raise_for_status()

    # Get the content length so the progress bar can display accurate progress
    total = int(response.headers.get('content-length', 0))
    print(f'Downloading to: {output_path}', flush=True)
    
    # Fetch the content in chunks, updating the progress bar
    with open(output_path, 'wb') as f:
        with tqdm(total=total, unit='B', unit_scale=True, unit_divisor=1024) as bar:
            for chunk in response.iter_content(chunk_size=1024):
                size = f.write(chunk)
                bar.update(size)

#### get_project_request_params

Get params to fetch the list of projects in the HCA catalog.

In [None]:
def get_project_request_params(catalog: str, max_projects: int) -> dict:

    # Set up request parameters
    return {
      'catalog': catalog,
      'size': max_projects,
      'sort': 'projectTitle',
      'order': 'asc'
    }

#### list_projects 

Fetch the list of projects in the HCA catalog.
Return a list of project titles and UUIDs.

In [None]:
def list_projects(catalog: str, max_projects: int) -> list:

    # Allocate a list to populate for return
    project_list = []

    print(f"Fetching first {max_projects} projects:")
    
    # Set up the fetch parameters
    url = PROJECTS_URL
    params = get_project_request_params(catalog, max_projects)
    
    while url and len(project_list) < max_projects:
        response_json = fetch_json(url, params)

        # Iterate over results, pulling out key project elements
        for hit in response_json['hits']:
            uuid = hit['entryId']
            shortname = hit['projects'][0]['projectShortname']
            title = hit['projects'][0]['projectTitle']

            print("-----------------------")
            print(f"Title: {title}")
            print(f"Shortname: {shortname}")
            print(f"Id: {uuid}")

            project_list.append({'title': title, 'uuid': uuid})

        # Handle response pagination if we haven't reached max_projects
        url = response_json['pagination']['next']
        if url:
            params = None
        else:
            break

    return project_list

#### iterate_matrices_tree

Recursively traverse a matrix tree and yield the leaf nodes which
contain the details for each matrix file (e.g. file name, url, size).

The matrix format specification can be found [here](https://github.com/HumanCellAtlas/dcp2/blob/main/docs/dcp2_system_design.rst).

In [None]:
def iterate_matrices_tree(tree, keys=()):
    if isinstance(tree, dict):
        for k, v in tree.items():
            yield from iterate_matrices_tree(v, keys=(*keys, k))
    elif isinstance(tree, list):
        for file in tree:
            yield keys, file
    else:
        assert False

#### download_project_files

Fetch a project's metadata, find the file URLs, and download the contents.

In [None]:
def download_project_files(catalog: str, project_uuid: str, output_path: str):
    # Fetch the project metadata
    project_url = f'{PROJECTS_URL}/{project_uuid}'
    response = requests.get(project_url, params={'catalog': catalog})
    response.raise_for_status()
    response_json = response.json()

    # Grab the project from the response
    project = response_json['projects'][0]

    # It is posssible for a matrix file to be included multiple times in the projects response,
    # so a list of downloaded URLs is maintained to prevent downloading any file more than once.
    file_urls = set()
    
    # Iterate over the matrices and the contributed analyses to find project files
    for key in ('matrices', 'contributedAnalyses'):
        tree = project[key]
        for path, file_info in iterate_matrices_tree(tree):
            url = file_info['url']
            if url not in file_urls:
                dest_path = os.path.join(output_path, file_info['name'])
                download_file(url, dest_path)
                file_urls.add(url)

## Access HCA

### Get the latest catalog

From the list of catalogs, find the "latest"

In [None]:
CATALOG = get_dcp_catalog()
print(f"The DCP catalog is: {CATALOG}")

### Fetch project list

From the catalog get a short list of projects and print them.

In [None]:
PROJECT_LIST = list_projects(CATALOG, 10)

### Download project files

Download the files for the first project in our list.

In [None]:
TARGET_PROJECT = PROJECT_LIST[0]

print(f"Downloading files for project '{TARGET_PROJECT['title']}'")
download_project_files(CATALOG, TARGET_PROJECT['uuid'], OUTPUT_DIR)
print("Downloads Complete.")

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:

In [None]:
!conda env export

JupyterLab extensions:

In [None]:
!jupyter labextension list

Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---

Copyright 2023 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style
license that can be found in the LICENSE file or at
https://developers.google.com/open-source/licenses/bsd