# AI4Health - 03 – Omics: Precision Medicine

---

## Introduction

Modern medicine is increasingly driven by **omics data**, a large-scale measurements of genes, transcripts, proteins, and more. In diseases like acute myeloid leukemia (AML), analysing thousands of gene expression levels per patient has revealed hidden subtypes that respond differently to treatment. This ability to discover new patient groups is a cornerstone of **precision medicine**, where therapies are tailored to the unique biology of each individual.

In this notebook, you will use unsupervised machine learning to explore gene expression data from leukemia patients. By applying dimensionality reduction and clustering techniques, you will uncover patterns that may correspond to clinically meaningful subgroups—potentially leading to more personalised care.

You will learn how to:

- Understand the structure and challenges of gene expression (omics) datasets
- Apply **Principal Component Analysis (PCA)** to reduce dimensionality and reveal major biological signals
- Use **K-Means clustering** to identify potential patient subtypes
- Visualise and interpret the results in the context of precision medicine

By the end of this notebook, you will have hands-on experience with the core steps of omics data analysis, and a deeper appreciation for how machine learning can drive discoveries in personalised healthcare.

### Learning Objectives:

- Recognise the challenges of high-dimensional omics data
- Apply PCA for dimensionality reduction in gene expression analysis
- Use clustering to identify potential disease subtypes
- Reflect on the clinical impact of data-driven patient

---

## Additional Context

### What is Omics Data?

Omics data refers to large-scale biological measurements that capture the molecular makeup of cells, tissues, or organisms. Common omics types include:
- **Genomics** (DNA sequences and mutations)
- **Transcriptomics** (gene expression levels)
- **Proteomics** (protein abundance)
- **Metabolomics** (small molecule metabolites)

In clinical research, omics data is often represented as a matrix where each row is a patient (or sample) and each column is a molecular feature (e.g., gene or protein). These datasets are typically **high-dimensional**—with thousands of features but relatively few samples.

### Gene‑expression matrices & the need for dimensionality reduction
A typical gene‑expression experiment measures **>10 000 genes** for **only a few dozen patients**:

| patients (rows) | genes (columns) |
|-----------------|-----------------|
| 72              | 7 129           |

That many columns cause:
* **Curse of dimensionality** – Euclidean distances become less meaningful, hurting clustering.
* **Over‑fitting** – more features than samples means many spurious patterns.
* **Computation & memory** – every extra dimension costs RAM & CPU.

### Principal Component Analysis (PCA) in this context
PCA rotates the data into a new set of orthogonal axes (principal components, PCs) ordered by how much variance they capture.  In gene‑expression:
1. **Center & scale** each gene (z‑score).
2. Compute covariance matrix of genes.
3. Eigen‑decompose to get PCs.
4. Keep the top *k* PCs that explain most variance (often 2‑50), giving a *compressed* representation while preserving the major biological signals.

### Why Use Clustering in Precision Medicine?

Clustering algorithms like **K-Means** group patients based on molecular similarity, without using predefined labels. This can reveal:
- **Hidden subtypes** of disease that may respond differently to treatment
- **Patient stratification** for personalised therapies
- **Biological insights** into disease mechanisms

Unsupervised clustering is a key step in **precision medicine**, where the goal is to tailor treatments to the unique molecular profile of each patient.

### Key Concepts in Omics Machine Learning

When applying machine learning to omics data, several foundational concepts are essential for meaningful analysis and interpretation. These concepts help address the unique challenges posed by high-dimensional biological datasets and ensure that results are robust and biologically relevant.

- **Normalisation**: Ensures each gene or feature contributes equally by removing scale differences.
- **Principal Components**: New axes that summarise the main sources of variation in the data.
- **Cluster Validation**: Assessing whether discovered groups are biologically or clinically meaningful.
- **Visualisation**: Essential for interpreting patterns and communicating findings in high-dimensional data.

### Clinical and Ethical Considerations

When working with omics data, it is essential to address clinical and ethical issues alongside technical analysis. Responsible machine learning in precision medicine demands that models are interpretable and validated with independent data and clinical outcomes. Protecting patient privacy is paramount, and care must be taken to ensure findings are generalisable and do not worsen health disparities, so that advances translate into real-world benefits without unintended harm.

---

## Related Guides

- *MatPlotLib - Pyplot:* https://matplotlib.org/stable/tutorials/pyplot.html
- *Seaborn - Kernel density estimation:* https://seaborn.pydata.org/tutorial/distributions.html#tutorial-kde
- *SciKit-Learn - Downloading datasets from openml.org:* https://scikit-learn.org/stable/datasets/loading_other_datasets.html#openml
- *SciKit-Learn - Generated datasets (blobs):* https://scikit-learn.org/stable/datasets/sample_generators.html
- *SciKit-Learn - K-means:* https://scikit-learn.org/stable/modules/clustering.html#k-means
- *SciKit-Learn - Preprocessing data (standard scaler):* https://scikit-learn.org/stable/modules/preprocessing.html
- *SciKit-Learn - Principal component analysis (PCA):* https://scikit-learn.org/stable/modules/decomposition.html#pca

---

## Step 1: Load Required Libraries

Before starting the analysis, we need to import essential Python libraries. These include tools for data manipulation (NumPy), preprocessing and scaling, dimensionality reduction (PCA), clustering (KMeans), and visualisation (matplotlib). Loading these libraries allows us to handle large datasets and exploring high-dimensional gene expression.

In [None]:
import matplotlib.pyplot as plt
import numpy
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.datasets import fetch_openml, make_blobs
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Matplotlib settings for crisp plots
plt.rcParams['figure.dpi'] = 120

print("OK")

**Questions:**

- **1.1.** Why do we need specialised libraries for analysing high-dimensional omics data?
- **1.2.** What is the role of dimensionality reduction (like PCA) in gene expression analysis?
- **1.3.** How do clustering algorithms such as K-Means contribute to precision medicine?
- **1.4.** Why is visualisation important when working with thousands of features?

---

## Step 2: Load gene‑expression data

Gene expression datasets typically consist of thousands of genes measured across a relatively small number of patients. In this step, we’ll load such a dataset, ideally a real-world leukemia dataset, but we’ll generate synthetic data if needed. We’ll examine the shape and structure of the data, which is crucial for understanding the challenges of high-dimensional analysis. Previewing the data helps us verify that it loaded correctly and gives us a sense of what each row (patient) and column (gene) represents, setting the stage for meaningful downstream analysis.

- *SciKit-Learn - fetch_openml:* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html
- *SciKit-Leanr - make_blobs:* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

In [None]:
# We try to fetch the classic Golub leukemia dataset via OpenML (ID 1104).
# If internet is unavailable, we fall back to a tiny synthetic set so the notebook
# always runs on limited hardware.
try:
    leukemia = fetch_openml(data_id=1104, as_frame=False)  # 7 129 genes × 72 patients
    X = leukemia.data
    y = leukemia.target.astype(str)  # 'ALL' vs 'AML'
    print(f"Leukemia matrix shape: {X.shape}")
except Exception as e:
    print("OpenML download failed - generating synthetic data.")
    X, y = make_blobs(n_samples=120, n_features=500, centers=3, random_state=42)
    y = y.astype(str)
    print(f"Synthetic matrix shape: {X.shape}")

**Questions:**

- **2.1.** Why are gene expression datasets typically so high-dimensional, and what challenges does this create for analysis?
- **2.2.** What are the differences between real and synthetic gene expression data, and how might this affect your results?
- **2.3.** Why is it important to know the biological meaning of your samples and features before proceeding?

---

## Step 3: Normalise (z‑score)

Raw gene expression values can vary greatly in scale, both across genes and between patients. To ensure that each gene contributes equally to our analysis, we standardise the data by centering each gene to have a mean of 0 and scaling to unit variance (z-scoring). This normalisation step is critical before applying PCA or clustering, as it prevents highly expressed genes from dominating the results and ensures that all features are on a comparable scale. We will check the mean and standard deviation after scaling to confirm that the normalisation was successful.

- *SciKit-Learn - StandardScaler:* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
# Each gene is centred to mean 0 and scaled to unit variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # shape unchanged
print(f"Scaled matrix mean = {X_scaled.mean():.3f}, std = {X_scaled.std():.3f}")

**Questions:**

- **3.1.** Why is normalisation (z-scoring) necessary for gene expression data before PCA or clustering?
- **3.2.** What could happen if you skip normalisation when analysing omics data?
- **3.3.** How does scaling each gene to mean 0 and variance 1 affect the interpretation of downstream results?
- **3.4.** Are there situations where a different normalisation method might be more appropriate?

---

## Step 4: Visualise Normalised Data

Before proceeding to PCA, it is helpful to visualise the normalised data to ensure that the scaling process was successful. This step includes plotting a heatmap of the normalised gene expression matrix and visualising the distribution of a few selected genes.

Visualisation helps confirm that the data is properly scaled and ready for dimensionality reduction.

- *Seaborn - heatmap:* https://seaborn.pydata.org/generated/seaborn.heatmap.html
- *Seaborn - kdeplot:* https://seaborn.pydata.org/generated/seaborn.kdeplot.html

In [None]:
# Plot a heatmap of a subset of the normalised data
plt.figure(figsize=(6, 4))
sns.heatmap(X_scaled[:10, :50], cmap="viridis", cbar=True)
plt.title("Heatmap of Normalised Gene Expression (Subset)")
plt.xlabel("Genes")
plt.ylabel("Patients")
plt.show()

# Visualise the distribution of a few selected genes
plt.figure(figsize=(6, 4))
for i in range(3):  # Plot distributions for 3 random genes
    sns.kdeplot(X_scaled[:, i], label=f"Gene {i+1}")
plt.title("Distribution of Selected Normalised Genes")
plt.xlabel("Z-Score")
plt.ylabel("Density")
plt.legend()
plt.show()

**Questions:**

- **4.1.** What insights can you gain from the heatmap of the normalised gene expression data?
- **4.2.** Why is it important to visualise the distribution of selected genes after normalisation?
- **4.3.** How might anomalies in the visualisation indicate issues with the normalisation process?

---

## Step 5: Principal Component Analysis (PCA)

With normalised data, we can now apply Principal Component Analysis (PCA) to reduce the dataset’s dimensionality. PCA transforms the data into a new set of orthogonal axes (principal components) that capture the greatest variance. By projecting the data onto the top principal components, we retain the most important biological signals while discarding noise and redundancy. This step makes the data easier to visualise and interpret, and prepares it for clustering. We will plot the first two principal components to visually explore patterns and potential groupings among patients.

- *SciKit-Learn - PCA:* https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
# Retain just enough PCs to explain 90 % variance
pca = PCA(n_components=0.90, svd_solver='full', random_state=0)
X_pca = pca.fit_transform(X_scaled)

print(f"Reduced to {X_pca.shape[1]} PCs covering {pca.explained_variance_ratio_.sum():.2%} variance")

# Plot first 2 PCs
plt.figure()
plt.scatter(X_pca[:,0], X_pca[:,1], s=20, alpha=0.7)
plt.title("PCA - first two components")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

**Questions:**

- **5.1.** Do you notice distinct groups or gradients, and what might separate those samples biologically?
- **5.2.** How do you decide how many principal components to retain?
- **5.3.** What does it mean if you see clear groupings or gradients in the first two PCs?
- **5.4.** What are the limitations of PCA for exploring biological data?

---

## Step 6: Evaluate PCA Results

After performing PCA, it is important to evaluate the explained variance ratio for each principal component. This step includes plotting a scree plot to show how much variance is captured by each component and deciding the optimal number of components to retain.

The scree plot helps identify the "elbow point," where additional components contribute less to the total variance.

In [None]:
# Plot the explained variance ratio for each principal component
plt.figure(figsize=(6, 3))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o', linestyle='--')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.grid()
plt.show()

# Cumulative explained variance
cumulative_variance = pca.explained_variance_ratio_.cumsum()
plt.figure(figsize=(6, 3))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='orange')
plt.title("Cumulative Explained Variance")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Variance Explained")
plt.axhline(y=0.90, color='r', linestyle='--', label="90% Variance")
plt.legend()
plt.grid()
plt.show()

**Questions:**

- **6.1.** What does the "elbow point" in the scree plot represent, and how does it guide the selection of principal components?
- **6.2.** Why is it important to consider cumulative explained variance when deciding the number of components to retain?
- **6.3.** How might retaining too few or too many principal components affect the downstream clustering results?

---

## Step 7: K‑Means Clustering on PCA Space

After reducing the data’s dimensionality, we use K-Means clustering to group patients based on their gene expression profiles in the PCA-transformed space. Clustering helps us identify potential subtypes or patterns that may correspond to clinically meaningful groups. By visualising the clusters on the first two principal components, we can assess whether distinct groups emerge and consider their possible biological or clinical significance. This step demonstrates how unsupervised learning can generate hypotheses about disease subtypes and inform precision medicine.

- *SciKit-Learn - KMeans:* https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# Cluster in the reduced space (keep k small for resource limits)
k = 3

kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
labels = kmeans.fit_predict(X_pca)

print("Cluster sizes:", numpy.bincount(labels))

# Visualise clusters
plt.figure()
for lab in range(k):
    plt.scatter(X_pca[labels==lab, 0], X_pca[labels==lab, 1], label=f"Cluster {lab}", s=20, alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("K-Means clusters on PCA projection")
plt.legend()
plt.show()

**Questions:**

- **7.1.** What does this cluster composition suggest?
- **7.2.** What clinical decisions might hinge on distinguishing these clusters?
- **7.3.** How do you choose the number of clusters (k) in K-Means, and what are the risks of choosing too many or too few?
- **7.4.** What does it mean if clusters align (or do not align) with known clinical subtypes?
- **7.5.** How could clustering results inform clinical decisions or future research in precision medicine?

---

## Step 8: Summary and Reflection

In this notebook, you explored the application of unsupervised machine learning to high-dimensional gene expression (omics) data—a cornerstone of modern precision medicine. Starting with the challenges of analysing thousands of genes across a small number of patients, you learned how dimensionality reduction (PCA) can reveal major biological patterns and make complex data more interpretable. Clustering in the reduced PCA space allowed you to identify potential patient subtypes, which may correspond to clinically meaningful groups and inform more personalised treatment strategies.

Throughout the workflow, you saw the importance of careful data preprocessing (such as normalisation), the power of visualisation for understanding hidden structure, and the limitations of unsupervised methods in biological discovery. You also reflected on how these computational techniques can drive advances in personalised healthcare, while recognising the need for biological validation and clinical context.

### Summary

- Omics datasets are high-dimensional and require dimensionality reduction for effective analysis.
- PCA helps uncover major biological signals and prepares data for clustering.
- K-Means clustering in PCA space can reveal potential disease subtypes.
- Visualisation is key to interpreting complex patterns in gene expression data.
- Unsupervised methods can generate hypotheses for precision medicine, but require clinical validation.

### What's next?

- **8.1.** How could you validate whether discovered clusters correspond to real clinical subtypes?
- **8.2.** What additional data (e.g., clinical outcomes, genetic mutations) could help interpret the clusters?
- **8.3.** How might these methods be extended to multi-omics or longitudinal data?
- **8.4.** What are the ethical and practical considerations when using unsupervised learning to guide clinical decisions?

---

## Explore Further

### Articles

- **Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry**
<br>*Mass Spectrometry Reviews*
  - https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/mas.21602

- **Dimension reduction techniques for the integrative analysis of multi-omics data**
<br>*Briefings in Bioinformatics*
  - https://academic.oup.com/bib/article/17/4/628/2240645

- **Clustering and visualization of single-cell RNA-seq data using path metrics**
<br>*PLOS Computational Biology*
  - https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012014