# Summary

* This is a tutorial on using Python for accessing the scBaseCamp dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#metadata) for a description of the obs metadata.

# Setup

### Installation

If needed, install the necessary dependencies.

You can use the [conda environment](../conda_envs/python.yml) provided in this git repository. To do so:

In [None]:
!which conda && conda env create -q -f ../conda_envs/python.yml

# Load packages

In [4]:
import os
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

In [5]:
# initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

# Data location

In [6]:
# GCS bucket path
gcs_base_path = "gs://arc-ctc-scbasecamp/2025-02-25/"

# List available files

Let's see what we have to work with!

In [None]:
# helper function
def get_parquet_files(gcs_base_path: str, target: str=None, endswith: str=None):
    files = fs.glob(os.path.join(gcs_base_path, "**"))
    if target:
        files = [f for f in files if os.path.basename(f) == target]
    else:
        files = [f for f in files if f.endswith(endswith)]
    file_list = []
    for f in files:
        file_list.append(f.split("/")[-3:-1] + [f])
    return pd.DataFrame(file_list, columns=["organism", "feature_type", "file_path"])

## Parquet files

* Contain the obs metadata

In [18]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]

'/home/nickyoungblut/.gcp/c-tc-429521-6f6f5b8ccd93.json'

In [None]:
# set the path to the metadata files
gcs_path = os.path.join(gcs_base_path, "metadata")
gcs_path

'gs://arc-ctc-scbasecamp/2025-02-25/metadata'

### Per-sample metadata

In [None]:
# list files
sample_pq_files = get_parquet_files(gcs_path, "sample_metadata.parquet.gz")
sample_pq_files.head()

Unnamed: 0,organism,feature_type,file_path
0,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
1,Arabidopsis_thaliana,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
2,Arabidopsis_thaliana,Velocyto,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
3,Bos_taurus,Gene,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...
4,Bos_taurus,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...


**Notes:**

* As you can see, the files are organized by `organism` and `count_type` (STARsolo output type)

### Per-obs metadata

In [10]:
# list files
obs_pq_files = get_parquet_files(gcs_path, "obs_metadata.parquet.gz")
obs_pq_files.head()

Unnamed: 0,organism,feature_type,file_path
0,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
1,Arabidopsis_thaliana,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
2,Bos_taurus,Gene,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...
3,Bos_taurus,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...
4,Caenorhabditis_elegans,Gene,arc-ctc-scbasecamp/2025-02-25/metadata/Caenorh...


## h5ad files 

* Contain count matrices and per-obs metadata

In [11]:
gcs_path = os.path.join(gcs_base_path, "h5ad")
gcs_path

'gs://arc-ctc-scbasecamp/2025-02-25/h5ad'

In [12]:
# list files
h5ad_files = get_parquet_files(gcs_path, endswith=".h5ad.gz")
print(h5ad_files.shape)
h5ad_files.head()

(42216, 3)


Unnamed: 0,organism,feature_type,file_path
0,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
1,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
2,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
3,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
4,Arabidopsis_thaliana,Gene,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...


# Obs metadata

* `obs` ≃ cell

In [13]:
# select a particular STARsolo output type
## "GeneFull_Ex50pAS" is most similar to CellRanger output
target_feature_type = "GeneFull_Ex50pAS"

### Per-sample

* Useful for quickly summarizing the per-sample metadata (a small file versus the entire obs metadata file; see below).

In [14]:
# filter to target count type
sample_pq_files_f = sample_pq_files[sample_pq_files["feature_type"] == target_feature_type]
sample_pq_files_f

Unnamed: 0,organism,feature_type,file_path
1,Arabidopsis_thaliana,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
4,Bos_taurus,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...
7,Caenorhabditis_elegans,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Caenorh...
10,Callithrix_jacchus,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Callith...
13,Danio_rerio,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Danio_r...
16,Drosophila_melanogaster,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Drosoph...
19,Equus_caballus,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Equus_c...
22,Gallus_gallus,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Gallus_...
25,Gorilla_gorilla,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Gorilla...
28,Heterocephalus_glaber,GeneFull_Ex50pAS,arc-ctc-scbasecamp/2025-02-25/metadata/Heteroc...


In [15]:
# we will just read the first 3 rows of each file
row_count = 3
sample_metadata = []
for i,row in sample_pq_files_f.iterrows():
    sample_metadata.append(
        ds.dataset(row["file_path"], filesystem=fs, format="parquet")
        .head(row_count)
        .to_pandas()
    )
sample_metadata = pd.concat(sample_metadata)

print(sample_metadata.shape)
sample_metadata.head()

(62, 14)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,24123125,SRX17302366,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Arabid...,9036,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,not specified,"BL (Brassinolide), 100nM, 0.5 hours post-treat...",WT Col-0,,
1,24123140,SRX17302381,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Arabid...,14317,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,not specified,"control treatment, age: 7 days",WT Col-0,,
2,24123142,SRX17302383,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Arabid...,20075,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,unsure,control,unsure,,
0,32702158,SRX24387177,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Bos_ta...,11084,10x_Genomics,3_prime_gex,single_cell,Bos taurus,lung,"Arthritis, Rheumatoid",unsure,unsure,,
1,32702159,SRX24387178,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Bos_ta...,21404,10x_Genomics,3_prime_gex,single_cell,Bos taurus,lung,unsure,unsure,unsure,,


In [16]:
sample_metadata.columns

Index(['entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep',
       'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease',
       'purturbation', 'cell_line', 'czi_collection_id',
       'czi_collection_name'],
      dtype='object')

In [68]:
# number of cells in this slice of the dataset
print(f"Cell count: {sample_metadata['obs_count'].sum()}")

Cell count: 532944


In [69]:
# cell count per organism
obs_count_org = (
    sample_metadata.groupby("organism")["obs_count"]
    .sum().to_frame().reset_index()
    .sort_values("obs_count", ascending=False)
    .reset_index(drop=True)
)
obs_count_org

Unnamed: 0,organism,obs_count
0,Solanum lycopersicum,55212
1,Arabidopsis thaliana,43428
2,Bos taurus,40851
3,Macaca mulatta,37939
4,Callithrix jacchus,35467
5,Equus caballus,30142
6,Ovis aries,29690
7,Sus scrofa,27760
8,Gallus gallus,27099
9,Schistosoma mansoni,25526


### Per-observation

* The per-obs metadata contains metadata specific to each obs (e.g., gene count)

In [70]:
# filter to just human samples
human_samples = sample_metadata[sample_metadata["organism"] == "Homo sapiens"]
human_samples

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,29110026,ERX11148743,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2316,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,epidermal myeloid cells,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [74]:
# select certain columns and row filtering
infile = os.path.join(gcs_base_path, "metadata", "Homo_sapiens", target_feature_type, "obs_metadata.parquet.gz")
dataset = ds.dataset(infile, filesystem=fs, format="parquet")
obs_metadata_target = dataset.to_table(filter=(ds.field('SRX_accession').isin(human_samples["srx_accession"]))).to_pandas()
obs_metadata_target

Unnamed: 0,gene_count,umi_count,SRX_accession,cell_barcode
0,1966,9930.0,ERX11148735,AAACCTGAGTCGCCGT
1,931,1479.0,ERX11148735,AAACCTGTCTTGAGGT
2,3234,19343.0,ERX11148735,AAACGGGCATACGCTA
3,2882,22176.0,ERX11148735,AAAGATGAGAAACCTA
4,484,1035.0,ERX11148735,AAAGATGCAGATCTGT
...,...,...,...,...
2374,2361,12167.0,ERX11148744,TTTGTCAAGACTTGAA
2375,1803,6814.0,ERX11148744,TTTGTCACACAAGCCC
2376,991,2859.0,ERX11148744,TTTGTCACAGACGCCT
2377,2445,11812.0,ERX11148744,TTTGTCAGTTATGCGT


In [None]:
# observations per sample in out data subset
obs_metadata_target["SRX_accession"].value_counts()

SRX_accession
ERX11148744    2379
ERX11148743    2316
ERX11148735     747
Name: count, dtype: int64

# Read h5ad files

### Example: select human samples

In [78]:
# we have a set of samples
target_samples = sample_metadata[sample_metadata["organism"] == "Homo sapiens"]
target_samples

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,29110026,ERX11148743,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2316,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,epidermal myeloid cells,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [79]:
# read in the anndata for those samples
adata = []
for infile in target_samples["file_path"].tolist():
    with fs.open(infile, 'rb') as f:
        adata.append(sc.read_h5ad(f))

# combine anndata objects
adata = sc.concat(adata)
adata

FileNotFoundError: arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_sapiens/ERX11148735.h5ad.gz

In [37]:
# number of obs per SRX accession
adata.obs["SRX_accession"].value_counts()

SRX_accession
ERX11148744    2379
ERX11148743    2316
ERX11148735     747
Name: count, dtype: int64

In [39]:
# add per-sample metadata to the anndata object
adata.obs = adata.obs.reset_index().merge(
    target_samples, left_on="SRX_accession", right_on="srx_accession", how="inner"
)
adata.obs.head()

Unnamed: 0,index,gene_count,umi_count,SRX_accession,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,AAACCTGAGTCGCCGT,1966,9930.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,AAACCTGTCTTGAGGT,931,1479.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,AAACGGGCATACGCTA,3234,19343.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
3,AAAGATGAGAAACCTA,2882,22176.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
4,AAAGATGCAGATCTGT,484,1035.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


### Example: human samples with gene count >= 1000

In [49]:
# get target samples
sample_metadata_target = sample_metadata.merge(
    obs_metadata[obs_metadata["gene_count"] >= 1000], 
    left_on="srx_accession", 
    right_on="SRX_accession", 
    how="inner"
)
print(f"SRX count: {sample_metadata_target}")

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name,gene_count,umi_count,SRX_accession,cell_barcode


In [50]:
# for the sake of this tutorial, just use the first 3
sample_metadata_target = sample_metadata_target.loc[:3]
sample_metadata_target

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name,gene_count,umi_count,SRX_accession,cell_barcode


In [None]:
# read in h5ad files
adata = []
for infile in sample_metadata_target["file_path"].tolist():
    with fs.open(infile, 'rb') as f:
        adata.append(sc.read_h5ad(f))

# combine anndata objects
adata = sc.concat(adata)
adata

# Downloading files

You can use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download any of the files in the bucket
and work with them locally. 

Please be considerate to the [cost of egress](https://cloud.google.com/storage/pricing) when download the data from Google Cloud Storage.

For example:

```bash
gsutil cp gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_sapiens/ERX4319106.h5ad.gz .
```

***

# sessionInfo

In [1]:
!pip list

Package                   Version
------------------------- --------------
aiohappyeyeballs          2.4.6
aiohttp                   3.11.12
aiosignal                 1.3.2
anndata                   0.11.3
anyio                     4.8.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
array_api_compat          1.10.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.4
attrs                     25.1.0
babel                     2.17.0
beautifulsoup4            4.13.3
bleach                    6.2.0
blinker                   1.9.0
Brotli                    1.1.0
cached-property           1.5.2
cachetools                5.5.2
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
colorama                  0.4.6
comm                      0.2.2
contourpy                 1.3.1
cryptography              44.0.1
cycler                    0.12.1
debugpy      