## Cell quality filtering

**Objective**
Remove low-quality cells based on basic QC metrics.

In this step, I perform cell-level quality control by removing low-quality cells based on standard QC metrics, including the number of detected genes, total counts, and mitochondrial read percentage.


**Inputs**
- Filtered raw object from notebook 01

**QC metrics used**
- nGenes
- nUMI (to be checked)
- percent_mito

**Filtering thresholds**
- nGenes > 200
- %mito < 15%

**Output**
- Cleaned object after cell quality filtering
- Object saved for downstream analysis

Cells were filtered based on standard QC thresholds on detected genes, mitochondrial content, and total counts to remove low-quality and outlier cells.


In [1]:
import os

os.chdir("/mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients")
print("cwd set to:", os.getcwd())

cwd set to: /mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients


In [2]:
import scanpy as sc

# Load raw object from Notebook 01
adata = sc.read_h5ad("results/adata_with_condition_raw.h5ad")

# Basic QC metrics
sc.pp.calculate_qc_metrics(adata, inplace=True)

# Add mitochondrial percentage (human MT genes usually start with "MT-")
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

# Quick check: what QC columns do we have?
print(adata.obs.columns.tolist())

# Show key QC summaries
print(adata.obs[["total_counts", "n_genes_by_counts", "pct_counts_mt"]].describe())


  utils.warn_names_duplicates("obs")


['gsm', 'sample_id', 'condition', 'replicate', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt']
        total_counts  n_genes_by_counts  pct_counts_mt
count  103202.000000      103202.000000  103202.000000
mean     3184.927979        1069.556016       2.657958
std      2799.308838         620.927868       3.531251
min       500.000000          44.000000       0.000000
25%      1803.000000         702.000000       0.421496
50%      2371.000000         874.000000       1.246488
75%      3344.000000        1198.000000       3.978809
max     62395.000000        6762.000000      89.819374


In [3]:
import scanpy as sc

adata = sc.read_h5ad("results/adata_with_condition_raw.h5ad")

# QC metrics
sc.pp.calculate_qc_metrics(adata, inplace=True)

# mito % (human MT- genes)
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

# Simple default filters (adjust later if needed)
adata_qc = adata[(adata.obs["n_genes_by_counts"] > 200) & (adata.obs["pct_counts_mt"] < 15)].copy()

print("Cells before QC:", adata.n_obs)
print("Cells after QC:", adata_qc.n_obs)

adata_qc.write("results/adata_cond_qc.h5ad")

  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


Cells before QC: 103202
Cells after QC: 102555


In [4]:
# Filtration 
# Number of cells before filtering
print("Cells before QC:", adata.n_obs)

# Apply filters
adata_qc = adata[
    (adata.obs["n_genes_by_counts"] >= 200) &
    (adata.obs["pct_counts_mt"] <= 15) &
    (adata.obs["total_counts"] <= 40000)
].copy()

# Number of cells after filtering
print("Cells after QC:", adata_qc.n_obs)


Cells before QC: 103202
Cells after QC: 102545


  utils.warn_names_duplicates("obs")


In [5]:
# Save results
adata_qc.write("results/adata_qc.h5ad")