# Construct pySCENIC GRN

We used pySCENIC to construct a gene regulatory network (GRN) for each scale dataset. For simplicity of demonstration, we present the preprocessing steps for scale-2; the preprocessing for the other scales is identical.

## Library imports

In [11]:
# import dependencies
import glob

import loompy as lp

import pandas as pd

import anndata as ad
import scanpy as sc

from rgv_tools import DATA_DIR

## General settings

In [7]:
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)

## Constants

In [8]:
DATASET = "mouse_neural_crest"

In [22]:
SAVE_DATA = True
if SAVE_DATA:
    (DATA_DIR / DATASET / "processed").mkdir(parents=True, exist_ok=True)
    (DATA_DIR / DATASET / "processed" / "scenic").mkdir(parents=True, exist_ok=True)

## Data loading

In [17]:
adata_raw = ad.io.read_h5ad(DATA_DIR / DATASET / "raw" / "GSE201257_adata_QC_filtered.h5ad")
adata = ad.io.read_h5ad(DATA_DIR / DATASET / "processed" / "adata_stage2_processed.h5ad")
adata = adata_raw[adata.obs_names]

In [18]:
adata

View of AnnData object with n_obs × n_vars = 5139 × 24489
    obs: 'plates', 'devtime', 'location', 'n_genes_by_counts', 'total_counts', 'total_counts_ERCC', 'pct_counts_ERCC'
    var: 'ERCC', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

In [19]:
sc.pp.filter_genes(adata, min_cells=5)

filtered out 4325 genes that are detected in less than 5 cells


  adata.var["n_cells"] = number


In [20]:
sc.pp.normalize_total(adata, target_sum=1e3)
sc.pp.log1p(adata)

normalizing counts per cell
    finished (0:00:00)


In [21]:
adata = sc.AnnData(adata.X, obs=adata.obs, var=adata.var)
adata.var["Gene"] = adata.var_names
adata.obs["CellID"] = adata.obs_names

In [23]:
if SAVE_DATA:
    adata.write_loom(DATA_DIR / DATASET / "processed" / "scenic" / "adata_stage_2_check.loom")

In [24]:
adata.X = adata.X.toarray().copy()

## SCENIC step

In [25]:
f_loom_path_scenic = DATA_DIR / DATASET / "processed" / "scenic" / "adata_stage_2_check.loom"
f_tfs = "allTFs_mm.txt"
adj_path = DATA_DIR / DATASET / "processed" / "scenic" / "adj_stage_2.csv"

In [None]:
!pyscenic grn {f_loom_path_scenic} {f_tfs} -o {adj_path} --num_workers 30


2025-02-18 10:11:10,542 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2025-02-18 10:11:12,711 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks.
preparing dask client
parsing input
creating dask graph
30 partitions
computing dask graph
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


In [26]:
# ranking databases
f_db_glob = "cisTarget_databases/*feather"  ## download feather file according instructions of pySCENIC
f_db_names = " ".join(glob.glob(f_db_glob))

# motif databases
f_motif_path = (
    "cisTarget_databases/motifs-v9-nr.mgi-m0.001-o0.0.tbl"  ## download motif file according instructions of pySCENIC
)

regulon_path = DATA_DIR / DATASET / "processed" / "scenic" / "stage_2_all_regulons.csv"

In [14]:
!pyscenic ctx adj_path \
    {f_db_names} \
    --annotations_fname {f_motif_path} \
    --expression_mtx_fname {f_loom_path_scenic} \
    --output {regulon_path} \
    --all_modules \
    --num_workers 30


2025-04-01 22:31:47,930 - pyscenic.cli.pyscenic - INFO - Creating modules.

2025-04-01 22:31:48,691 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2025-04-01 22:31:52,850 - pyscenic.utils - INFO - Calculating Pearson correlations.

	Dropout masking is currently set to [False].

2025-04-01 22:32:46,533 - pyscenic.utils - INFO - Creating modules.

2025-04-01 22:34:21,642 - pyscenic.cli.pyscenic - INFO - Loading databases.

2025-04-01 22:34:23,242 - pyscenic.cli.pyscenic - INFO - Calculating regulons.

2025-04-01 22:34:23,242 - pyscenic.prune - INFO - Using 30 workers.

2025-04-01 22:34:23,242 - pyscenic.prune - INFO - Using 30 workers.

2025-04-01 22:34:31,838 - pyscenic.prune - INFO - Worker mm10__refseq-r80__10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings(9): database loaded in memory.

2025-04-01 22:34:31,838 - pyscenic.prune - INFO - Worker mm10__refseq-r80__10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings(9): database loaded in memory.

2025-04-01 22:34:32,130 

In [15]:
f_pyscenic_output = DATA_DIR / DATASET / "processed" / "scenic" / "pyscenic_output_stage_2_all_regulons.loom"

In [16]:
!pyscenic aucell \
    {f_loom_path_scenic} \
    {regulon_path} \
    --output {f_pyscenic_output} \
    --num_workers 2


2025-04-01 23:04:07,877 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2025-04-01 23:04:16,489 - pyscenic.cli.pyscenic - INFO - Loading gene signatures.
Create regulons from a dataframe of enriched features.
Additional columns saved: []

2025-04-01 23:04:32,049 - pyscenic.cli.pyscenic - INFO - Calculating cellular enrichment.

2025-04-01 23:05:21,417 - pyscenic.cli.pyscenic - INFO - Writing results to file.
[0m

In [17]:
lf = lp.connect(f_pyscenic_output, mode="r+", validate=False)
exprMat = pd.DataFrame(lf[:, :], index=lf.ra.Gene, columns=lf.ca.CellID)
auc_mtx = pd.DataFrame(lf.ca.RegulonsAUC, index=lf.ca.CellID)
regulons = lf.ra.Regulons

In [18]:
res = pd.concat([pd.Series(r.tolist(), index=regulons.dtype.names) for r in regulons], axis=1)

In [19]:
res.columns = lf.row_attrs["var_names"]

## Save data

In [20]:
if SAVE_DATA:
    res.to_csv(DATA_DIR / DATASET / "processed" / "regulon_mat_stage_2_all_regulons.csv")