# Seurat Clustering in Human PBMC Single Cells

Following the R script at: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html, this notebook prepares the data for R processing, and then evaluate the results.

We first download the 68K PBMC data and follow the standard `scanpy` workflow for normalisation of read counts and subsetting on the highly variable genes.

In [None]:
%reload_ext autoreload
%autoreload 2
from __future__ import annotations
import pandas as pd
import scanpy as sc
import os
import warnings
from metrics import *
import scipy.io
import scipy.sparse


#!pip install ipywidgets --upgrade
os.environ["LOKY_MAX_CPU_COUNT"] = '4'
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))
sc.logging.print_versions()
#Filtering warnings from current version of matplotlib.
warnings.filterwarnings("ignore", message=".*Parameters 'cmap' will be ignored.*", category=UserWarning)
warnings.filterwarnings("ignore", message="Tight layout not applied.*", category=UserWarning)

-----
anndata     0.10.8
scanpy      1.10.2
-----
PIL                 10.3.0
activity            NA
asttokens           NA
cffi                1.17.0
colorama            0.4.6
comm                0.2.2
cycler              0.12.1
cython_runtime      NA
dateutil            2.9.0.post0
debugpy             1.8.2
decorator           5.1.1
decoupler           1.7.0
deprecated          1.2.14
dill                0.3.8
exceptiongroup      1.2.1
executing           2.0.1
future              1.0.0
graphtools          1.5.3
h5py                3.11.0
igraph              0.11.6
importlib_resources NA
ipykernel           6.29.4
ipywidgets          8.1.3
jedi                0.19.1
joblib              1.4.2
kiwisolver          1.4.5
legacy_api_wrap     NA
llvmlite            0.43.0
louvain             0.8.2
magic               3.0.0
matplotlib          3.9.0
matplotlib_inline   0.1.7
metrics             NA
mpl_toolkits        NA
natsort             8.4.0
nt                  NA
numba               0.6

The original dataset is from a single donor, from 10xgenomix. Fresh 68k PBMCs (Donor A).

- ~68,000 cells detected
- Sequenced on Illumina NextSeq 500 High Output with ~20,000 reads per cell
- 98bp read1 (transcript), 8bp I5 sample barcode, 14bp I7 GemCode barcode and 5bp read2 (UMI)
- Analysis run with --cells=24000

In [2]:
adata = sc.datasets.pbmc68k_reduced()
adata

AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

In [6]:
adata.obs['labels'] = adata.obs.bulk_labels.map({'CD14+ Monocyte':0, 'Dendritic':1, 'CD56+ NK':2, 'CD4+/CD25 T Reg':3, 'CD19+ B':4, 'CD8+ Cytotoxic T':5, 'CD4+/CD45RO+ Memory':6, 'CD8+/CD45RA+ Naive Cytotoxic':7, 'CD4+/CD45RA+/CD25- Naive T':8, 'CD34+':9})

In [None]:
# Step 1: Export barcodes (cell names) to barcodes.tsv.
barcodes = pd.DataFrame(adata.obs_names)
barcodes.to_csv('./data/hg19b/barcodes.tsv', sep='\t', header=False, index=False)

# Step 2: Export gene names to genes.tsv.
genes = pd.DataFrame(adata.var_names)
genes.to_csv('./data/hg19b/genes.tsv', sep='\t', header=False, index=False)

# Step 3: Export the expression matrix to matrix.mtx.
# Convert the expression matrix to a sparse matrix if it's not already.
raw_data = adata.raw.X.T
if not scipy.sparse.issparse(raw_data):
    raw_data = scipy.sparse.csr_matrix(raw_data)

# Verify the dimensions
num_genes = adata.shape[1]
num_cells = adata.shape[0]

print(f"Number of genes: {num_genes}")
print(f"Number of cells: {num_cells}")

# Save the sparse matrix in Matrix Market format
scipy.io.mmwrite('./data/hg19b/matrix.mtx', raw_data)

Number of genes: 765
Number of cells: 700


### Use the R code seurat.R to calculate cluster assignments. Save the results as seurat_clusters.csv.

In [11]:
# Step 1: Read the Seurat CSV file into a DataFrame (df1).
df1 = pd.read_csv('./data/seurat_clusters.csv')

# Step 2: Create a new DataFrame (df2) from the adata object.
true_labels = adata.obs['labels'].reset_index()
true_labels.columns = ['Cell', 'True_Label']

# Merge the dataframes on the cell identifiers.
merged_df = pd.merge(df1, true_labels, on='Cell')

# Display the merged dataframe.
print(merged_df.head())

               Cell  Cluster True_Label
0  AAAGCCTGGCTAAC-1        1          0
1  AAATTCGATGCACA-1        1          1
2  AACACGTGGTCTTT-1        4          2
3  AAGTGCACGTGCTA-1        8          3
4  ACACGAACGGAGTG-1        5          1


In [None]:
#Extract pathway activity matrix for metric calculation.
pca_embedings = pd.read_csv('./data/pca_embedings.csv', index_col=0)

#We use cdata.obs.data, the true labels.
true_labels = merged_df.True_Label

calc_stats(pca_embedings, true_labels, merged_df.Cluster)

Silhouette Score: 0.10989487711324464
Calinski-Harabasz Index: 51.29868134246531
Special accuracy: 0.6285714285714286
completeness score: 0.5930755615265298
homogeneity_score: 0.6787977727839148
adjusted_mutual_info_score: 0.6214094122697238
