# Summary

* This is a tutorial on using Python for accessing the Tahoe-100M dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#obs-cell-metadata) for a description of the obs metadata.

# Setup

### Installation

If needed, install the necessary dependencies.

You can use UV's `pyproject.toml` provided in this git repository.

### Load dependencies

In [1]:
import io
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

In [20]:
def humanize(n_bytes):
    for unit in ("B","KiB","MiB","GiB","TiB"):
        if n_bytes < 1024:
            return f"{n_bytes:.2f}{unit}"
        n_bytes /= 1024
    return f"{n_bytes:.2f}PiB"


In [2]:
# initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

### Data location

In [None]:
# NOTE: when calling gcsfs methods you drop the "gs://" prefix
base_path = "arc-ctc-tahoe100/2025-02-25/"

# get total size (in bytes) of everything under that prefix
total_bytes = fs.du(base_path)
print(f"Total size under gs://{base_path}: {humanize(total_bytes)} bytes")

Total size under gs://arc-ctc-tahoe100/2025-02-25/: 315.75GiB bytes


# Obs metadata

* `obs` ≃ cell

### Per-sample

* Useful for quickly summarizing the per-sample metadata (a small file versus the entire obs metadata file; see below).

In [5]:
# path to sample metadata
infile = "/".join([gcp_base_path.rstrip("/"), 'metadata', 'sample_metadata.parquet'])
infile

'gs://arc-ctc-tahoe100/2025-02-25/metadata/sample_metadata.parquet'

In [6]:
# read just the first 3 rows
sample_metadata = ds.dataset(infile, filesystem=fs, format="parquet").head(3).to_pandas()
sample_metadata

Unnamed: 0,sample,plate,mean_gene_count,mean_tscp_count,mean_mread_count,mean_pcnt_mito,drug,drugname_drugconc
0,smp_1495,plate1,1354.169768,2027.11594,2444.032416,0.033956,Infigratinib,"[('Infigratinib', 0.05, 'uM')]"
1,smp_1496,plate1,1404.454157,2226.282791,2690.68597,0.071723,Erdafitinib,"[('Erdafitinib ', 0.05, 'uM')]"
2,smp_1497,plate1,1205.267794,1859.375821,2246.200127,0.084853,Everolimus,"[('Everolimus', 0.05, 'uM')]"


In [8]:
# select certain columns and row filtering
columns_to_read = ['sample', 'plate', 'mean_gene_count']  # Specify the columns you need
dataset = ds.dataset(infile, filesystem=fs, format="parquet")
sample_metadata = dataset.to_table(filter=(ds.field('mean_gene_count') > 2000), columns=columns_to_read).to_pandas()
sample_metadata 

Unnamed: 0,sample,plate,mean_gene_count
0,smp_1598,plate2,2196.279615
1,smp_1604,plate2,2039.014042
2,smp_1605,plate2,2043.083782
3,smp_2044,plate6,2155.888048
4,smp_2046,plate6,2072.778266
5,smp_2054,plate6,2035.670581
6,smp_2056,plate6,2058.926286
7,smp_2060,plate6,2094.526584
8,smp_2066,plate6,2229.688468
9,smp_2067,plate6,2192.958014


In [7]:
# get the number of samples
columns_to_read = ["sample"]  # Specify the columns you need
dataset = ds.dataset(infile, filesystem=fs, format="parquet")
sample_count = dataset.to_table(columns=columns_to_read).to_pandas()["sample"].nunique()
print(f"Number of samples: {sample_count}")

Number of samples: 1344


In [8]:
# get samples per plate
columns_to_read = ["plate", "sample"]  # Specify the columns you need
dataset = ds.dataset(infile, filesystem=fs, format="parquet")
samples_per_plate = dataset.to_table(columns=columns_to_read).to_pandas().groupby("plate").size()
samples_per_plate

plate
plate1     96
plate10    96
plate11    96
plate12    96
plate13    96
plate14    96
plate2     96
plate3     96
plate4     96
plate5     96
plate6     96
plate7     96
plate8     96
plate9     96
dtype: int64

### Per-observation

* `obs` ~= cells
* For the sake of this tutorial, we will just pull the first 100000 observations.

In [9]:
# set the path to the obs_metadata file
infile = "/".join([gcp_base_path.rstrip("/"), 'metadata', 'obs_metadata.parquet'])
infile

'gs://arc-ctc-tahoe100/2025-02-25/metadata/obs_metadata.parquet'

In [10]:
# read a subset of the metadata
obs_metadata = ds.dataset(infile, filesystem=fs, format="parquet").head(100000).to_pandas()
obs_metadata

Unnamed: 0,plate,BARCODE_SUB_LIB_ID,sample,gene_count,tscp_count,mread_count,drugname_drugconc,drug,cell_line,sublibrary,BARCODE,pcnt_mito,S_score,G2M_score,phase,pass_filter,cell_name
0,plate10,01_001_001-lib_1681,smp_2359,1379,2172,2559,"[('Bestatin (hydrochloride)', 0.05, 'uM')]",Bestatin (hydrochloride),CVCL_1478,lib_1681,01_001_001,0.029926,-0.229665,-0.190110,G1,full,NCI-H1573
1,plate10,01_002_149-lib_1681,smp_2359,975,1256,1470,"[('Bestatin (hydrochloride)', 0.05, 'uM')]",Bestatin (hydrochloride),CVCL_0459,lib_1681,01_002_149,0.026274,-0.167578,-0.132784,G1,full,NCI-H460
2,plate10,01_003_052-lib_1681,smp_2359,865,1239,1446,"[('Bestatin (hydrochloride)', 0.05, 'uM')]",Bestatin (hydrochloride),CVCL_C466,lib_1681,01_003_052,0.033898,-0.200957,-0.161538,G1,full,hTERT-HPNE
3,plate10,01_003_090-lib_1681,smp_2359,393,484,559,"[('Bestatin (hydrochloride)', 0.05, 'uM')]",Bestatin (hydrochloride),CVCL_1724,lib_1681,01_003_090,0.037190,-0.052746,-0.076190,G1,minimal,SW48
4,plate10,01_003_093-lib_1681,smp_2359,2657,5325,6269,"[('Bestatin (hydrochloride)', 0.05, 'uM')]",Bestatin (hydrochloride),CVCL_1285,lib_1681,01_003_093,0.017465,-0.636364,-0.614103,G1,full,HOP62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,plate10,72_141_112-lib_1682,smp_2430,1481,2171,2559,"[('γ-Oryzanol', 0.05, 'uM')]",γ-Oryzanol,CVCL_0366,lib_1682,72_141_112,0.042837,0.000000,-0.100386,S,full,SNU-423
99996,plate10,72_141_131-lib_1682,smp_2430,1430,2119,2496,"[('γ-Oryzanol', 0.05, 'uM')]",γ-Oryzanol,CVCL_0371,lib_1682,72_141_131,0.074563,-0.009524,-0.028340,G1,full,KATO III
99997,plate10,72_141_184-lib_1682,smp_2430,827,1044,1201,"[('γ-Oryzanol', 0.05, 'uM')]",γ-Oryzanol,CVCL_1693,lib_1682,72_141_184,0.045019,-0.028571,-0.057324,G1,full,SHP-77
99998,plate10,72_142_157-lib_1682,smp_2430,875,1153,1333,"[('γ-Oryzanol', 0.05, 'uM')]",γ-Oryzanol,CVCL_0320,lib_1682,72_142_157,0.071119,-0.061905,-0.042970,G1,full,HT-29


In [11]:
# sample count
obs_metadata["sample"].nunique()

96

In [12]:
# gene count distribution
pd.options.display.float_format = '{:.0f}'.format
obs_metadata["gene_count"].describe()

count   100000
mean      1382
std        735
min        268
25%        896
50%       1209
75%       1661
max       9395
Name: gene_count, dtype: float64

In [13]:
# tscp (UMI) count distribution
pd.options.display.float_format = '{:.0f}'.format
obs_metadata["tscp_count"].describe()

count   100000
mean      2214
std       1833
min        392
25%       1230
50%       1748
75%       2583
max      54006
Name: tscp_count, dtype: float64

# Reading in h5ad files

* For this tutorial, we will be reading in a subsampled version of 1 h5ad file, since the per-plate h5ad files are rather large. 

In [14]:
# set the path to the plate metadata file
infile = "gs://arc-ctc-tahoe100/2025-02-25/tutorial/plate3_2k-obs.h5ad"

In [15]:
# read in the h5ad file
with fs.open(infile, 'rb') as f:
    adata = sc.read_h5ad(f)
adata

AnnData object with n_obs × n_vars = 2000 × 62710
    obs: 'sample', 'gene_count', 'tscp_count', 'mread_count', 'drugname_drugconc', 'drug', 'cell_line', 'sublibrary', 'BARCODE', 'pcnt_mito', 'S_score', 'G2M_score', 'phase', 'pass_filter', 'cell_name', 'plate'

In [16]:
# look at the obs metadata
print(adata.obs.shape)
adata.obs.head()

(2000, 16)


Unnamed: 0_level_0,sample,gene_count,tscp_count,mread_count,drugname_drugconc,drug,cell_line,sublibrary,BARCODE,pcnt_mito,S_score,G2M_score,phase,pass_filter,cell_name,plate
BARCODE_SUB_LIB_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
01_001_117-lib_1009,smp_1687,1110,1404,1655,"[('Infigratinib', 5.0, 'uM')]",Infigratinib,CVCL_1693,lib_1009,01_001_117,0,0,0,G2M,full,SHP-77,plate3
01_001_122-lib_1009,smp_1687,1011,1324,1577,"[('Infigratinib', 5.0, 'uM')]",Infigratinib,CVCL_1495,lib_1009,01_001_122,0,0,0,G1,full,NCI-H1792,plate3
01_001_172-lib_1009,smp_1687,835,1042,1240,"[('Infigratinib', 5.0, 'uM')]",Infigratinib,CVCL_0399,lib_1009,01_001_172,0,0,0,G2M,full,LoVo,plate3
01_002_058-lib_1009,smp_1687,754,902,1040,"[('Infigratinib', 5.0, 'uM')]",Infigratinib,CVCL_1056,lib_1009,01_002_058,0,0,0,G1,full,A498,plate3
01_002_063-lib_1009,smp_1687,1546,2288,2695,"[('Infigratinib', 5.0, 'uM')]",Infigratinib,CVCL_0480,lib_1009,01_002_063,0,0,0,G2M,full,PANC-1,plate3


#### Next steps

You can then use the anndata object for various downsteam analyses

# Downloading files

You can use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download any of the files in the bucket
and work with them locally. 

Please be considerate to the [cost of egress](https://cloud.google.com/storage/pricing) when download the data from Google Cloud Storage.

For example:

```bash
gsutil cp gs://arc-ctc-tahoe100/2025-02-25/tutorial/plate3_2k-obs.h5ad .
```

For large data transfers, it is better to use `gsutil rsync`:

```bash
gsutil rsync gs://arc-ctc-tahoe100/2025-02-25/tutorial/ .
```

***

# Session Info

In [17]:
!conda list

# packages in environment at /home/nickyoungblut/miniforge3/envs/scbasecamp_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
aiohappyeyeballs          2.6.1              pyhd8ed1ab_0    conda-forge
aiohttp                   3.11.14         py313h8060acc_0    conda-forge
aiosignal                 1.3.2              pyhd8ed1ab_0    conda-forge
anndata                   0.11.3             pyhd8ed1ab_0    conda-forge
anyio                     4.9.0              pyh29332c3_0    conda-forge
argon2-cffi               23.1.0             pyhd8ed1ab_1    conda-forge
argon2-cffi-bindings      21.2.0          py313h536fd9c_5    conda-forge
array-api-compat          1.11.2             pyh29332c3_0    conda-forge
arrow                     1.3.0              pyhd8ed1ab_1    conda-forge
asttokens                 3.0.0              py