# Technical Challenge Notebook  
*by Rüçhan Ekren - Toulouse - 15/08/2025*

---
### Task 2: **Review and Enhancement of the Patient Stratification Pipeline** 
---
**Objective:** Critically assess the currently suggested patient stratification pipeline and propose enhancements to improve its analytical rigor, biological relevance, and scalability.

[Assessment of the Current Pipeline](#assessment-of-the-current-pipeline)

[Suggested Pipeline](#suggested-pipeline)


Back to [Head of the page](#technical-challenge-notebook)



<a id="assessment-of-the-current-pipeline"></a>
### Assessment of the Current Pipeline
---

<div style="text-align: center;">
    <img src="patient_stratification_current_pipeline.png" alt="patient_stratification_current_pipeline">
</div>


The current pipeline aims to stratify patients into molecular subtypes using bulk transcriptomic data, with the overarching goal of linking these subtypes to clinical characteristics. The workflow begins with bulk RNA-seq data, selects the 100 most variable genes, applies hierarchical clustering, and finally compares the resulting clusters to clinical data for interpretability.

While this approach is straightforward, there are several critical limitations:

-   Lack of Normalization and Batch Correction: The pipeline omits normalization and batch effect correction, both of which are essential for RNA-seq analysis. Without these steps, technical variation can dominate the data, causing the “high variance” genes to reflect artifacts rather than true biological differences.

-   Simplistic Feature Selection: Selecting the top 100 genes by variance is a coarse strategy, especially for count-based RNA-seq data. Variance-stabilizing methods such as VST or rlog from DESeq2 better account for the mean–variance relationship and reduce bias toward highly expressed genes. Moreover, fixing the number at exactly 100 is arbitrary and may exclude important signals.

-   Limited Clustering Strategy: Using only hierarchical clustering restricts the ability to detect diverse patterns in the data. Alternative approaches such as non-negative matrix factorization (NMF), k-means, or consensus clustering can reveal complementary structure and improve robustness.

-   Late Integration of Clinical Data: Restricting clinical data use to the final interpretation stage misses the opportunity to guide unsupervised analysis toward clinically meaningful patterns. Early integration, for example through semi-supervised clustering or multi-omics factor analysis, could produce clusters with stronger clinical relevance.

Overall, the pipeline could be made more robust by incorporating normalization and batch correction early, using statistically sound feature selection, testing multiple clustering methods, and integrating clinical data throughout the analysis rather than at the very end.

Back to [Head of the page](#technical-challenge-notebook)



<a id="suggested-pipeline"></a>
### Suggested Pipeline
---

<div style="text-align: center;">
  <img src="suggested_workflow.png" alt="Suggested workflow" width="800" height="600" style="display:block;margin-left:auto;margin-right:auto;">
</div>

An analysis workflow that addresses the limitations of the current pipeline approach:

### **RNA-seq**

1. **QC:**  

   - Run **FastQC** on raw reads.

   - Aggregate QC reports using **MultiQC**.

   *(Ensures raw data quality before downstream steps.)*

2. **Processing:**  
   
   - 'nf-core/rnaseq' pipeline can be used, or directly 'STAR' + 'Salmon' can be used for alignment and quantification.

3. **Normalization & Batch Correction:**

   - Apply statistically sound normalization for count data (e.g., variance-stabilizing transformation or rlog equivalent).

   - Perform batch effect correction using **pyComBat-Seq**.

   *(Removes technical noise before feature selection.)*

4. **Feature Selection:**

   - Select features from variance-stabilized data.

   - Optionally guide selection using clinical or biological relevance.

   *(Avoids arbitrary cutoffs and captures relevant signals.)*

---

### **Cell Type Deconvolution**

- **Inference:** Use the normalized bulk RNA-seq expression matrix as input for a deconvolution tool such as [CellAnneal](https://doi.org/10.21105/joss.05610).
- 
  If available, use a relevant scRNA-seq dataset as reference.

- **Output:** Produces a matrix of estimated cell type fractions per patient.

- **Statistical Testing:** Apply non-parametric tests (Kruskal–Wallis, Wilcoxon rank-sum) to identify cell types with significantly different abundances between groups.

- **Stability Assessment:** Perform bootstrapping or cross-validation to ensure robustness.

---

### **CNV Usage / Inference**

- If there is availability of Copy Number Variation data, retrieve it.
  
    - If the data is, *by any chance scRNAseq* , use tools such as [infercnvpy](https://github.com/icbi-lab/infercnvpy) to infer CNVs from expression profiles.
  
    - For the bulk RNAseq data, CNV inference tool I found, was built on R, can be an alternative [CaSpER](https://www.nature.com/articles/s41467-019-13779-x)

- Outputs are integrated with expression and cell fraction data.

---

### **Quality Control Checkpoint**

- Missing values on clinical data are pretty common and challenging to impute robustly. `missingno` can uncover their pattern, improving feature prioritization. Help to choose possible imputation strategy, if possible.

- Use **scanpy** or `seaborn`/`matplotlib` with **scikit-learn** to generate PCA and UMAP plots.  

- Confirm removal of technical artifacts before integration.

---

### **Multi-Modal, (Multi-) Omics and Clinical Data Integration**

- Integrate gene expression, cell fractions, CNVs, and **clinical data at this stage** using:  

  - [mofapy2](https://github.com/bioFAM/mofapy2) / [mofax](https://github.com/bioFAM/mofax)  

  - Or experimental VAE-based tools like [move](https://github.com/rasmussenlab/move).  

- *(Ensures clinical data influences clustering rather than only post-hoc interpretation.)*

---

### **Clustering**

- Perform clustering on **integrated latent factors**, not on a single data type.  

- Compare multiple algorithms:  

  - `GaussianMixture`  

  - `AgglomerativeClustering`  

  - `KMeans`  

  - Consensus clustering  

- *(Improves robustness by avoiding reliance on a single method.)*

---

### **Comprehensive Downstream Analysis**

With robust clusters in hand:

1. **Survival Analysis:**  

   - Use `lifelines` or `scikit-survival` for Kaplan–Meier curves and survival tests.

2. **Differential Expression & Pathway Analysis:**  

   - Use `pydeseq2` with covariate adjustment for DE analysis.  

   - Perform pathway enrichment using `gseapy`.

---

### **Practical Implementation & Benefits**

- **Efficiency:** Parallelize per-sample processing.

- **Reliability:** QC checkpoints before data integration.  

- **Reusability:** Modular preprocessing for reuse in other projects.  

- **Reproducibility:** Orchestrate with **Nextflow** for scalable and reproducible execution.

---




Back to [Head of the page](#technical-challenge-notebook)