# Analysis

**Hypothesis**: A small subset of stromal fibroblasts with a simultaneous proliferative and mesenchymal-stem–like transcriptional program is present only at early-proliferative cycle days (4–7) but was obscured by platform (10x vs C1) batch effects; scVI-based batch correction followed by high-resolution reclustering will expose this population and its gene program.

In [None]:
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Set up visualization defaults for better plots
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.figsize = (8, 8)
sc.settings.dpi = 100
sc.settings.facecolor = 'white'
warnings.filterwarnings('ignore')

# Set Matplotlib and Seaborn styles for better visualization
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['savefig.dpi'] = 150
sns.set_style('whitegrid')
sns.set_context('notebook', font_scale=1.2)

# Load data
print("Loading data...")
adata = sc.read_h5ad("/scratch/users/salber/endo_data.h5ad")
print(f"Data loaded: {adata.shape[0]} cells and {adata.shape[1]} genes")


# Analysis Plan

**Hypothesis**: A small subset of stromal fibroblasts with a simultaneous proliferative and mesenchymal-stem–like transcriptional program is present only at early-proliferative cycle days (4–7) but was obscured by platform (10x vs C1) batch effects; scVI-based batch correction followed by high-resolution reclustering will expose this population and its gene program.

## Steps:
- Create an explicit `platform` column (10x vs C1) in adata.obs, set random seeds, and integrate all cells with scVI (n_latent=30) using `platform` as batch_key; store the latent space (`X_scVI_30`) and a UMAP (`X_umap_scvi`).
- Subset to stromal fibroblasts (`cell_type == 'Stromal fibroblasts'`) and build a kNN graph on `X_scVI_30`; run a Leiden resolution sweep (0.5–2.0, step 0.25), pick the resolution that maximises mean silhouette score while keeping min cluster size ≥ 50, and annotate clusters in `stromal_leiden`.
- Score stromal cells for cell-cycle phase with predefined S-phase and G2M gene lists (from Tirosh et al. 2016) and for mesenchymal-stemness with THY1, NES, ENG, PDGFRB; store `proliferation_score` = S_score+G2M_score and `stemness_score`.
- Identify clusters whose median proliferation_score and stemness_score are both in the top quartile of all clusters; label these candidate cells as `prolif_stem_like = True`.
- Test enrichment of candidate cells in early-proliferative days (4–7) while controlling for donor with a stratified (Cochran–Mantel–Haenszel) Fisher’s exact test; report odds ratio and two-sided p-value.
- Run SCVI `differential_expression` comparing the candidate cluster(s) vs all other stromal fibroblasts (covariate_batch='platform'); keep genes with FDR < 0.05 and |logFC| > 0.25 as the progenitor program.
- Visualise (i) UMAP coloured by `prolif_stem_like`, day, and donor, (ii) stacked violin/dotplots of proliferation & stemness scores across stromal clusters, (iii) expression of top markers from the progenitor program, confirming the rare early-cycle progenitor-like fibroblast population.


## The code performs scVI integration with platform (10x vs C1) as the sole batch covariate: it registers raw count data, trains a 30-dimensional latent model with fixed seeds for reproducibility, saves the latent representation (`X_scVI_30`) and a corresponding UMAP, and visualises training convergence and batch mixing.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat_v3'
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,          # shallower network
    n_hidden=64          # fewer hidden units
)

model.train(
    max_epochs=20,        # fewer epochs to avoid timeout
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Current analysis step failed to run. Try an alternative approach

## The code creates a categorical ‘platform’ batch covariate, selects at most 2 000 highly-variable genes in a platform-aware manner while retaining the full matrix in .raw, optionally down-samples to 20 000 cells for speed, trains an scVI model with a 30-dimensional latent space using platform as batch_key, stores the latent coordinates, builds a neighbor graph and UMAP from them, and finally visualises the UMAP coloured by platform and cell type to verify batch mixing.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat'  # avoids skmisc dependency
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,          # shallower network
    n_hidden=64          # fewer hidden units
)

model.train(
    max_epochs=20,        # fewer epochs to avoid timeout
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Current analysis step failed to run. Try an alternative approach

## The code re-computes a batch-corrected scVI latent space with platform as batch key and log-library size as covariate, stores both latent coordinates and UMAP, then re-clusters stromal fibroblasts by sweeping Leiden resolutions and selecting the partition that maximises graph modularity (implemented without sklearn). Finally, clusters with fewer than 50 cells are merged into their nearest neighbour, and the resulting labels are stored for downstream analysis.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat'  # avoids skmisc dependency
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,          # shallower network
    n_hidden=64          # fewer hidden units
)

model.train(
    max_epochs=20,        # fewer epochs to avoid timeout
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Current analysis step failed to run. Try an alternative approach

## Creates a reproducible, ComBat-corrected stromal-fibroblast AnnData object, scales it, performs PCA, neighbour graph construction and UMAP, then conducts a Leiden resolution sweep while computing silhouette scores on a fixed subsample to pick the most coherent clustering; finally adds the chosen labels and shows UMAPs coloured by cluster and batch to verify correction.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat'  # avoids skmisc dependency
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,          # shallower network
    n_hidden=64          # fewer hidden units
)

model.train(
    max_epochs=20,        # fewer epochs to avoid timeout
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Current analysis step failed to run. Try an alternative approach

## The script subsets the AnnData object to stromal fibroblasts, adds a `platform` batch column, fixes random seeds, trains a stromal-specific scVI model with the requested hyper-parameters (unless a latent already exists), and stores the 30-D latent space.  It then builds a neighbour graph, runs a Leiden resolution sweep selecting the highest resolution whose smallest cluster has at least 50 cells, computes/stores a UMAP, copies cluster labels and embedding back to the full dataset, and produces a diagnostic UMAP plot.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat'  # avoids skmisc dependency
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,
    n_hidden=64
)

model.train(
    max_epochs=20,
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Current analysis step failed to run. Try an alternative approach

## The code creates a clean stromal-fibroblast AnnData object, preserves the untouched counts in .raw, normalises and log-transforms the data, performs Combat batch correction, extracts a 30-PC Combat space, builds a cosine kNN graph, and runs a constrained Leiden resolution sweep to obtain robust cluster labels while removing temporary columns and visualising the result.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat'  # avoids skmisc dependency
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,          # shallower network
    n_hidden=64          # fewer hidden units
)

model.train(
    max_epochs=20,        # fewer epochs to avoid timeout
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Current analysis step failed to run. Try an alternative approach

## The script creates a reproducible, stromal-only AnnData object, performs balanced down-sampling to ≤10 000 cells, selects 1 500 platform-aware HVGs, trains a 15-dimensional scVI model (controlling for platform batch), and stores an integrated latent space plus UMAP; the resulting `stromal` object is ready for high-resolution clustering in subsequent steps.

In [None]:

import numpy as np
import scanpy as sc
import scvi

# ------------------------------------------------------------------
# Subset cells to speed up training
# ------------------------------------------------------------------
max_cells = 20_000
if adata.n_obs > max_cells:
    idx = np.random.choice(adata.n_obs, max_cells, replace=False)
    a_data = adata[idx].copy()
else:
    a_data = adata.copy()

# ------------------------------------------------------------------
# Keep only highly-variable genes for faster model fitting
# ------------------------------------------------------------------
sc.pp.highly_variable_genes(
    a_data,
    n_top_genes=2_000,
    batch_key='dataset',
    flavor='seurat'  # avoids skmisc dependency
)
a_data = a_data[:, a_data.var.highly_variable].copy()

# ------------------------------------------------------------------
# scVI setup and lightweight training
# ------------------------------------------------------------------
scvi.model.SCVI.setup_anndata(a_data, batch_key='dataset')
model = scvi.model.SCVI(
    a_data,
    n_latent=20,
    n_layers=1,          # shallower network
    n_hidden=64          # fewer hidden units
)

model.train(
    max_epochs=20,        # fewer epochs to avoid timeout
    early_stopping=True,
    early_stopping_patience=5,
    check_val_every_n_epoch=1,
)

# ------------------------------------------------------------------
# Latent representation & UMAP
# ------------------------------------------------------------------
a_data.obsm['X_scVI'] = model.get_latent_representation()
sc.pp.neighbors(a_data, use_rep='X_scVI')
sc.tl.umap(a_data)

# ------------------------------------------------------------------
# Visualisation
# ------------------------------------------------------------------
sc.pl.umap(a_data, color=['dataset', 'cell_type'], wspace=0.4)


### Agent Interpretation

Final analysis step failed to run.