# Setup

# MissionBio Tapestri Single-Cell Analysis with EspressoPro

This notebook demonstrates a comprehensive single-cell protein analysis pipeline using MissionBio Tapestri data. The analysis leverages EspressoPro for automated cell type annotation and includes advanced refinement techniques.

## Key Features:
- **Data**: Multi-sample PBMC dataset (HD01 and HD02 samples)
- **Analysis**: Complete protein-based single-cell characterization
- **Methods**: UMAP dimensionality reduction, graph-based clustering, automated annotation
- **Refinement**: Small cluster handling, mixed cluster resolution, KNN consensus, mast cell detection

## Workflow Overview:
1. Data loading and setup
2. Sample-wise analysis (normalization, dimensionality reduction, clustering)
3. EspressoPro automated cell type prediction
4. Multi-step annotation refinement
5. Final visualization and validation

## Loading modules

In [1]:
import missionbio.mosaic as ms
import os
import espressopro as ep
import anndata as ad
import scanpy as sc
import pandas as pd
import numpy as np
import random
import plotly.graph_objects as go

  from tqdm.autonotebook import tqdm


In [2]:
import warnings
import pandas as pd

# silence globally
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)

## Setting seed

conda env config vars set PYTHONHASHSEED=0

In [3]:
os.environ['PYTHONHASHSEED'] = '0'
random.seed(42)
np.random.seed(42)

In [4]:
def ensure_pythonhashseed(seed=0):
    current_seed = os.environ.get("PYTHONHASHSEED")

    seed = str(seed)
    if current_seed is None or current_seed != seed:
        print(f'Setting PYTHONHASHSEED="{seed}"')
        os.environ["PYTHONHASHSEED"] = seed
        # restart the current process
        os.execl(sys.executable, sys.executable, *sys.argv)

In [5]:
import random

hash = random.getrandbits(128)

print("hash value: %032x" % hash)

hash value: bdd640fb06671ad11c80317fa3b1799d


# Loading data

In [6]:
PBMC_samples = ms.load_example_dataset(path="Multisample PBMC", single=False)

          0/2: 

            0%: 

            0%: 

## Data Loading

Loading the MissionBio example multi-sample PBMC dataset. This dataset contains protein expression measurements from multiple healthy donor samples (HD01, HD02) measured using the Tapestri platform.

# PBMC - HD01 Analysis

In [7]:
PBMC_HD01 = PBMC_samples.samples[0]

In [8]:
PBMC_HD01

<missionbio.mosaic.sample.Sample at 0x10b972d40>

## Normalisation

Performing two essential preprocessing steps:
1. **Normalization**: Corrects for technical variations and library size differences between cells
2. **Scaling**: Standardizes protein expression values to enable proper downstream analysis

These steps ensure that subsequent dimensionality reduction and clustering are not biased by technical artifacts.

In [9]:
ep.Normalise_protein_data(PBMC_HD01)
ep.Scale_protein_data(PBMC_HD01)

Fitting GMMs:   0%|          | 0/1943 [00:00<?, ?it/s]

[Normalise_protein_data] Applied MissionBio NSP normalization
[Scale_protein_data] Scaled 'Normalized_reads' -> saved as 'Scaled_reads'


## Dimentionality reduction

In [10]:
PBMC_HD01.protein.run_pca(attribute='Normalized_reads', components=8, show_plot=False, random_state=42, svd_solver='randomized')

UMAP (Uniform Manifold Approximation and Projection) reduces the high-dimensional protein data to 2D for visualization while preserving local neighborhood structure. 

**Parameters used:**
- `n_neighbors=50`: Larger neighborhood for global structure preservation
- `min_dist=0.1`: Allows for tight clustering of similar cells  
- `spread=8`: Broader distribution of points in embedding space
- `random_state=42`: Ensures reproducible results

In [11]:
PBMC_HD01.protein.run_umap(attribute='pca', random_state=42, n_neighbors=50, min_dist=0.1, spread=8, n_components=2)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


divide by zero encountered in power


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## Unsupervised clustering

In [12]:
PBMC_HD01.protein.cluster(attribute='umap', method='graph-community', k=5, random_state=42) 

  0%|          | 0/9715 [00:00<?, ?it/s]

Number of clusters found: 73.
Modularity: 0.969


In [13]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)

## EspressoPro predictions and annotation

EspressoPro uses machine learning models trained on extensive reference datasets to automatically predict cell types based on protein expression profiles. 

**Process:**
1. `generate_predictions()`: Creates probability scores for each potential cell type
2. `annotate_data()`: Assigns final cell type labels based on highest confidence predictions

**Output annotations:**
- `Simplified.Celltype`: Broad cell type categories (e.g., T cells, B cells, Monocytes)
- `Detailed.Celltype`: Specific cell subtypes (e.g., CD4+ T cells, CD8+ T cells, NK cells)

In [14]:
# Examples
PBMC_HD01 = ep.generate_predictions(obj=PBMC_HD01)
PBMC_HD01 = ep.annotate_data(obj=PBMC_HD01)


[generate_predictions] Using default models path: /Users/kgurashi/anaconda3/envs/mosaic_2/lib/python3.10/site-packages/espressopro/data/Pre_trained_models/TotalSeqD_Heme_Oncology_CAT399906
[generate_predictions] Using default data path: /Users/kgurashi/anaconda3/envs/mosaic_2/lib/python3.10/site-packages/espressopro/data/Pre_trained_models/TotalSeqD_Heme_Oncology_CAT399906
[generate_predictions] Ensuring models are available...
[generate_predictions] Using all atlases: Hao, Triana, Zhang, Luecken
[generate_predictions] query_df shape: 1943 cells x 45 features
[generate_predictions] first 12 columns: ['CD10', 'CD117', 'CD11b', 'CD11c', 'CD123', 'CD13', 'CD138', 'CD14', 'CD141', 'CD16', 'CD163', 'CD19']
                        CD10     CD117     CD11b     CD11c     CD123      CD13     CD138      CD14     CD141      CD16     CD163      CD19
Normalized_reads                                                                                                                          
AACAACCTACA


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.


Transforming to str index.



In [15]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Simplified.Celltype')
go.Figure(fig)

ValueError: Simplified.Celltype not found in the row attributes, col attributes or the layers

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype')
go.Figure(fig)

## Mark rare celltypes as "small"

Small clusters (< 3 cells) are often artifacts or noise in single-cell data. The `mark_small_clusters()` function identifies these rare events and labels them as "small" to prevent over-interpretation.

**Benefits:**
- Reduces noise in downstream analysis
- Focuses attention on biologically meaningful cell populations
- Improves visualization clarity

In [None]:
PBMC_HD01 = ep.mark_small_clusters(PBMC_HD01, "Simplified.Celltype", min_cells=3)
PBMC_HD01 = ep.mark_small_clusters(PBMC_HD01, "Detailed.Celltype", min_cells=3)

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Simplified.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD01.protein.signaturemap('Normalized_reads',
                           splitby='Simplified.Celltype')

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD01.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype')

## Mark clusters with mixed annotation as "mixed"

In [None]:
PBMC_HD01 = ep.mark_mixed_clusters(PBMC_HD01, "Simplified.Celltype")
PBMC_HD01 = ep.mark_mixed_clusters(PBMC_HD01, "Detailed.Celltype")

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Simplified.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD01.protein.signaturemap('Normalized_reads',
                           splitby='Simplified.Celltype')

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD01.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype')

## Refine annotation using unsupervised data

This advanced refinement step improves annotation accuracy by leveraging local neighborhood information in the embedding space.

**Method:**
- Examines each cell's nearest neighbors in UMAP space
- Applies consensus voting based on neighbor cell type assignments
- Corrects isolated mislabeled cells by considering their local context
- Creates the `Detailed.Celltype_refined_consensus` field with improved annotations

**Advantage:** Reduces single-cell annotation errors by incorporating spatial information from the dimensionality reduction.

In [None]:
PBMC_HD01 = ep.refine_labels_by_knn_consensus(PBMC_HD01, label_col='Detailed.Celltype')

In [None]:
PBMC_HD01.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype_refined_consensus')

## Expand celltypes to whole clusters

In [None]:
PBMC_HD01 = ep.suggest_cluster_celltype_identity(
    sample=PBMC_HD01,
    annotation="Detailed.Celltype_refined_consensus", rewrite=True)

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='annotated_clusters')
go.Figure(fig)

In [None]:
PBMC_HD01.protein.signaturemap('Normalized_reads',
                           splitby='annotated_clusters')

## Add mast cells

In [None]:
PBMC_HD01 = ep.add_mast_annotation(PBMC_HD01, field_out='Detailed.Celltype_refined_consensus')

[Mast] threshold 1.636 → 20 cells


## Final visualization of refined cell type annotations

This UMAP plot displays the final refined cell type annotations after all processing steps including:
- Initial EspressoPro predictions
- Removal of small clusters (< 3 cells)
- Resolution of mixed clusters
- KNN consensus refinement
- Addition of mast cell annotations

The `Detailed.Celltype_refined_consensus` field represents the most accurate cell type assignments for each cell.

In [None]:
fig = PBMC_HD01.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype_refined_consensus')
go.Figure(fig)

# PBMC - HD02 Analysis

In [None]:
PBMC_HD02 = PBMC_samples.samples[1]

In [None]:
PBMC_HD02

<missionbio.mosaic.sample.Sample at 0x329d1ce50>

## Normalisation

In [None]:
ep.Normalise_protein_data(PBMC_HD02)
ep.Scale_protein_data(PBMC_HD02)

Fitting GMMs:   0%|          | 0/1808 [00:00<?, ?it/s]

[Normalise_protein_data] Applied MissionBio NSP normalization
[Scale_protein_data] Scaled 'Normalized_reads' -> saved as 'Scaled_reads'


## Dimentionality reduction

In [None]:
PBMC_HD02.protein.run_pca(attribute='Normalized_reads', components=8, show_plot=False, random_state=42, svd_solver='randomized')

In [None]:
PBMC_HD02.protein.run_umap(attribute='pca', random_state=42, n_neighbors=50, min_dist=0.1, spread=8, n_components=2)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


divide by zero encountered in power


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## Unsupervised clustering

In [None]:
PBMC_HD02.protein.cluster(attribute='umap', method='graph-community', k=5, random_state=42) 

  0%|          | 0/9040 [00:00<?, ?it/s]

Number of clusters found: 73.
Modularity: 0.970


In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)

## EspressoPro predictions and annotation

In [None]:
# Examples
PBMC_HD02 = ep.generate_predictions(obj=PBMC_HD02)
PBMC_HD02 = ep.annotate_data(obj=PBMC_HD02)


[generate_predictions] Using default models path: /Users/kgurashi/anaconda3/envs/mosaic_2/lib/python3.10/site-packages/espressopro/data/Pre_trained_models/TotalSeqD_Heme_Oncology_CAT399906
[generate_predictions] Using default data path: /Users/kgurashi/anaconda3/envs/mosaic_2/lib/python3.10/site-packages/espressopro/data/Pre_trained_models/TotalSeqD_Heme_Oncology_CAT399906
[generate_predictions] Ensuring models are available...
[generate_predictions] Using all atlases: Hao, Triana, Zhang, Luecken
[generate_predictions] query_df shape: 1808 cells x 45 features
[generate_predictions] first 12 columns: ['CD10', 'CD117', 'CD11b', 'CD11c', 'CD123', 'CD13', 'CD138', 'CD14', 'CD141', 'CD16', 'CD163', 'CD19']
                        CD10     CD117     CD11b     CD11c     CD123      CD13     CD138      CD14     CD141      CD16     CD163      CD19
Normalized_reads                                                                                                                          
AACAACCTACA


Transforming to str index.


Transforming to str index.


Transforming to str index.



In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Simplified.Celltype')
go.Figure(fig)

In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype')
go.Figure(fig)

## Mark rare celltypes as "small"

In [None]:
PBMC_HD02 = ep.mark_small_clusters(PBMC_HD02, "Simplified.Celltype", min_cells=3)
PBMC_HD02 = ep.mark_small_clusters(PBMC_HD02, "Detailed.Celltype", min_cells=3)

In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Simplified.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='Simplified.Celltype')

In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype')

## Mark clusters with mixed annotation as "mixed"

In [None]:
PBMC_HD02 = ep.mark_mixed_clusters(PBMC_HD02, "Simplified.Celltype")
PBMC_HD02 = ep.mark_mixed_clusters(PBMC_HD02, "Detailed.Celltype")

In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Simplified.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='Simplified.Celltype')

In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype')
go.Figure(fig)

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype')

## Refine annotation using unsupervised data

In [None]:
PBMC_HD02 = ep.refine_labels_by_knn_consensus(PBMC_HD02, label_col='Detailed.Celltype')

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype_refined_consensus')

## Expand celltypes to whole clusters

In [None]:
PBMC_HD02 = ep.suggest_cluster_celltype_identity(
    sample=PBMC_HD02,
    annotation="Detailed.Celltype_refined_consensus", rewrite=True)

In [None]:
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='annotated_clusters')
go.Figure(fig)

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='annotated_clusters')

## Add mast cells

In [None]:
PBMC_HD02 = ep.add_mast_annotation(PBMC_HD02, field_out='Detailed.Celltype_refined_consensus')

[Mast] threshold 1.841 → 19 cells


## Final visualization of refined cell type annotations

This UMAP plot displays the final refined cell type annotations after all processing steps including:
- Initial EspressoPro predictions
- Removal of small clusters (< 3 cells)
- Resolution of mixed clusters
- KNN consensus refinement
- Addition of mast cell annotations

The `Detailed.Celltype_refined_consensus` field represents the most accurate cell type assignments for each cell.

In [None]:
# Final UMAP visualization showing refined cell type annotations
# This plot represents the culmination of the entire analysis pipeline
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype_refined_consensus')
go.Figure(fig)

## Final visualization of refined cell type annotations

This UMAP plot displays the final refined cell type annotations after all processing steps including:
- Initial EspressoPro predictions
- Removal of small clusters (< 3 cells)
- Resolution of mixed clusters
- KNN consensus refinement
- Addition of mast cell annotations

The `Detailed.Celltype_refined_consensus` field represents the most accurate cell type assignments for each cell.

In [None]:
# Final UMAP visualization showing refined cell type annotations
# This plot represents the culmination of the entire analysis pipeline
fig = PBMC_HD02.protein.scatterplot(attribute='umap',colorby='Detailed.Celltype_refined_consensus')
go.Figure(fig)

In [None]:
PBMC_HD02.protein.signaturemap('Normalized_reads',
                           splitby='Detailed.Celltype_refined_consensus')

In [None]:
fig = PBMC_HD02.protein.heatmap(attribute='Normalized_reads', splitby='Detailed.Celltype_refined_consensus')
go.Figure(fig)

# Analysis Summary

## Completed Analysis Pipeline

This notebook successfully demonstrates a comprehensive single-cell protein analysis workflow using two PBMC samples (HD01 and HD02). Each sample underwent identical processing steps:

### Key Analysis Steps:
1. **Data preprocessing** - Normalization and scaling of protein expression
2. **Dimensionality reduction** - PCA followed by UMAP for visualization  
3. **Unsupervised clustering** - Graph-community detection for initial cell grouping
4. **Automated annotation** - EspressoPro machine learning predictions
5. **Quality control** - Detection of small clusters and mixed populations
6. **Refinement** - KNN consensus and cluster-based improvements
7. **Specialized detection** - Mast cell identification
8. **Validation** - Multiple visualization methods (UMAP, heatmaps, signature maps)

### Final Results:
- **High-quality cell type annotations** with multiple levels of detail
- **Robust cell populations** validated through multiple approaches
- **Comprehensive visualization** enabling biological interpretation
- **Reproducible workflow** suitable for similar datasets

### Applications:
This analysis framework can be applied to:
- Clinical sample characterization
- Treatment response studies  
- Cell atlas construction
- Biomarker discovery
- Quality control for single-cell experiments