### PBMC clustering (global view)

**Objective**
Cluster PBMC cells to obtain a global map of major immune cell populations.

*What major immune cell populations are present, and how are they transcriptionally organized?*

**Input**
- Normalized object from Notebook 03

**Methods**
- PCA on HVG
- Neighbor graph construction
- Clustering
- UMAP visualization

**Output**
- UMAP plot with clusters
- Cluster assignments per cell

Dimensionality reduction and clustering revealed distinct transcriptional cell populations, providing a global map of immune cell diversity.
This unsupervised organization forms the basis for downstream biological interpretation and cell type annotation.

This notebook builds a global PBMC cell map using HVG-based PCA, a kNN neighbor graph, Leiden clustering, and UMAP.

In [2]:
# Setup and load HVG object 
import os
import scanpy as sc

PROJECT_ROOT = "/mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients"
os.chdir(PROJECT_ROOT)

adata = sc.read_h5ad("results/adata_norm_hvg.h5ad")
print("condition?", "condition" in adata.obs.columns)
print("HVG?", "highly_variable" in adata.var.columns)

condition? True
HVG? True


  utils.warn_names_duplicates("obs")


In [3]:
# Safety guard (optional)
if "highly_variable" not in adata.var.columns:
    print("WARNING: no HVG flag found, using all genes")

In [None]:
print(adata.obs.columns)

In [None]:
import scanpy as sc

# Use HVGs only (NO copy to save RAM)
adata_hvg = adata[:, adata.var["highly_variable"]]

# PCA (no scaling needed here to avoid MemoryError)
sc.tl.pca(adata_hvg, svd_solver="arpack")

# Neighbors + Leiden + UMAP
sc.pp.neighbors(adata_hvg, n_neighbors=15, n_pcs=30)
sc.tl.leiden(adata_hvg, resolution=0.5)
sc.tl.umap(adata_hvg)

# Save clustered object (keeps condition/sample_id in obs)
adata_hvg.write("results/adata_cond_clustered.h5ad")

print("Saved:", adata_hvg.shape)
print("Has condition?", "condition" in adata_hvg.obs.columns)
print("Clusters:", adata_hvg.obs["leiden"].nunique())


In [None]:
# NOTE: This notebook was originally run with an HVG flag stored in adata.var.
# The saved object used later may not contain 'highly_variable' depending on how it was exported.
# This guard keeps the notebook runnable without redoing HVG selection.

import os
import scanpy as sc

# --- Project root ---
os.chdir("/mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients")

# --- Load normalized + HVG object (from Notebook 03) ---
adata = sc.read_h5ad("results/adata_with_condition_raw.h5ad")

# --- Use HVGs for clustering ---
# Use HVGs if available, otherwise keep all genes
if "highly_variable" in adata.var.columns:
    adata = adata[:, adata.var["highly_variable"]].copy()
else:
    print("WARNING: 'highly_variable' not found in adata.var â€” using all genes (no HVG filtering).")

# --- Scale + PCA ---
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")

# --- Neighbors graph (core step for clustering/UMAP) ---
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)

# --- Clustering ---
sc.tl.leiden(adata, resolution=0.5)

# --- UMAP (visual embedding) ---
sc.tl.umap(adata)

# --- Save for downstream notebooks ---
adata.write("results/adata_clustered_pbmc.h5ad")

# --- Minimal QC prints (always keep) ---
print("Cells x Genes:", adata.n_obs, "x", adata.n_vars)
print("Number of clusters (leiden):", adata.obs["leiden"].nunique())
print("Cluster sizes (top 10):")
print(adata.obs["leiden"].value_counts().head(10))


In [None]:
import scanpy as sc

# UMAP visualization (global map)
sc.pl.umap(
    adata,
    color="leiden",
    frameon=False,
    show=False,
    save="_UMAP_PBMC_Leiden_clusters.png"
)

In [None]:
#Save
adata.write("results/adata_cond_clustered.h5ad")