### Normalization and highly variable genes (HVG)

**Objective**
Normalize gene expression values and identify highly variable genes for downstream analysis.

*How can technical variability be minimized to allow meaningful comparison between cells?* 

**Input**
- Cleaned object from cell quality filtering (Notebook 02)

**Methods**
- Library-size normalization
- Log-transformation
- Identification of highly variable genes (HVG)

**Output**
- Normalized object
- List of highly variable genes

Normalization and log-transformation were applied to correct for differences in sequencing depth across cells.
Highly variable genes were then selected to capture the main sources of biological variability, ensuring that downstream analyses focus on informative signals rather than technical noise.


In [1]:
import os

PROJECT_ROOT = "/mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients"
os.chdir(PROJECT_ROOT)

print("Working directory:", os.getcwd())


Working directory: /mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients


In [2]:
import scanpy as sc

# Fix project root (comme dans les autres notebooks)
os.chdir("/mnt/c/Users/yasmi/OneDrive/Desktop/Mini-Projets/scRNA_Influenza_Patients")

# Load QC-filtered data
adata = sc.read_h5ad("results/adata_with_condition_raw.h5ad")

# 1) Normalisation (rendre les cellules comparables)
sc.pp.normalize_total(adata, target_sum=1e4)

# 2) Log-transformation
sc.pp.log1p(adata)

# 3) Identification des gènes hautement variables (HVG)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)

# Vérifications rapides
print("HVG:", adata.var["highly_variable"].sum())
print(adata)

# Keep only HVGs for downstream analysis
adata = adata[:, adata.var["highly_variable"]].copy()
print("After HVG subsetting:", adata)


  utils.warn_names_duplicates("obs")


HVG: 2000
AnnData object with n_obs × n_vars = 103202 × 37493
    obs: 'gsm', 'sample_id', 'condition', 'replicate', 'batch'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
After HVG subsetting: AnnData object with n_obs × n_vars = 103202 × 2000
    obs: 'gsm', 'sample_id', 'condition', 'replicate', 'batch'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'


  utils.warn_names_duplicates("obs")


In [3]:
adata.write("results/adata_norm_hvg.h5ad")