Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,13 @@ parts:
- file: tutorials/brain-disorder-diagnosis/extend-reading/data-config
title: Data & Config
- file: tutorials/drug-target-interaction/tutorial
- file: tutorials/multiomics-cancer-classification/tutorial
- file: tutorials/multiomics-cancer-classification/cancer-tutorial
sections:
- file: tutorials/multiomics-cancer-classification/extend-reading/data
title: Data
- file: tutorials/multiomics-cancer-classification/extend-reading/helper-functions
title: Helper Functions & Model Details
- file: tutorials/multiomics-cancer-classification/extend-reading/interpretation-study
title: Interpretation Study
- file: tutorials/multiomics-cancer-classification/extend-reading/extension-tasks
title: Extension Tasks
95 changes: 9 additions & 86 deletions tutorials/drug-target-interaction/tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -77,24 +77,8 @@
"source": [
"The main package required for this tutorial is `PyKale`.\n",
"\n",
"`PyKale` is an open-source interdisciplinary machine learning library developed at the University of Sheffield, with a focus on applications in biomedical and scientific domains."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install --quiet git+https://github.com/pykale/pykale@main\\\n",
" && echo \"PyKale installed successfully ✅\" \\\n",
" || echo \"Failed to install PyKale ❌\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`PyKale` is an open-source interdisciplinary machine learning library developed at the University of Sheffield, with a focus on applications in biomedical and scientific domains.\n",
"\n",
"Then, we install `PyG` (PyTorch Geometric) and related packages.\n",
"\n",
"Please **do not** re-run this session after installation completed. Runing this installation multiple times will trigger issues related to `PyG`. If you want to re-run this installation, please click the `Runtime` on the top menu and choose `Disconnect and delete runtime` before installing."
Expand All @@ -106,74 +90,13 @@
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import os\n",
"os.environ['TORCH'] = torch.__version__\n",
"!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html\n",
"!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html\n",
"\n",
"!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git \\\n",
" && echo \"PyG installed successfully ✅\" \\\n",
" || echo \"Failed to install PyG ❌\"\n",
"\n",
"!pip install rdkit-pypi \\\n",
" && echo \"rdkit installed successfully ✅\" \\\n",
" || echo \"Failed to install rdkit ❌\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we install other required packages in [`mmai-tutorials/requirements.txt`](https://github.com/pykale/mmai-tutorial/blob/main/requirements.txt)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%cd /content/mmai-tutorials/tutorials/drug-target-interaction\n",
"\n",
"!pip install --quiet -r /content/mmai-tutorials/requirements.txt \\\n",
" && echo \"Required packages installed successfully ✅\" \\\n",
" || echo \"Failed to install required packages ❌\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please run the following block to reinstall `NumPy` to avoid bugs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"!pip install --upgrade --force-reinstall numpy==2.0.0\n",
"os.kill(os.getpid(), 9)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Install yacs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install yacs"
"!pip install --quiet \\\n",
" git+https://github.com/pykale/pykale@main \\\n",
" yacs==0.1.8 \\\n",
" torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric \\\n",
" -f https://data.pyg.org/whl/torch-2.6.0+cu124.html \\\n",
" && echo \"pykale,yacs and wfdb installed successfully ✅\" \\\n",
" || echo \"Failed to install pykale,yacs ❌\""
]
},
{
Expand Down
551 changes: 551 additions & 0 deletions tutorials/multiomics-cancer-classification/cancer-tutorial.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
A2ML1|144568
ABAT|18
ABCA13|154664
ABCC11|85320
ABCC8|6833
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
1.000000000000000000e+00
2.000000000000000000e+00
3.000000000000000000e+00
0.000000000000000000e+00
4.000000000000000000e+00
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
1.000000000000000000e+00
2.000000000000000000e+00
3.000000000000000000e+00
0.000000000000000000e+00
4.000000000000000000e+00

Large diffs are not rendered by default.

Large diffs are not rendered by default.

46 changes: 46 additions & 0 deletions tutorials/multiomics-cancer-classification/extend-reading/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Data

## Data Organization

To help users better understand the format and organization of the input data, we provide example data under [data-example folder](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/data-example/).

For each modality/omics, the raw data is separated into five files:
- [`{}_feat_name.csv`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/data-example/1_feat_name.csv): Names of all features of current modality. Each row represents one example.
- [`{}_lbl_te.csv`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/data-example/1_lbl_te.csv): Labels of test set.
- [`{}_lbl_tr.csv`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/data-example/1_lbl_tr.csv): Labels of training set.
- [`{}_te.csv`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/data-example/1_te.csv): Values of all features for test examples. Each row is one example and each column is one feature. In this tutorial, it is the normalized gene expression level for mRNA and miRNA, and is $\beta$ value for DNA methylation.
- [`{}_tr.csv`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/data-example/1_tr.csv): Values of all features for training examples. Each row is one example and each column is one feature. In this tutorial, it is the normalized gene expression level for mRNA and miRNA, and is $\beta$ value for DNA methylation.

Within in `{}` is the index of the modality, starting from 1. For example, if user has two modalities, the files are named as `1_feat_name.csv`, ..., `2_feat_name.csv`, ...

After organizing the data, please don't foget to change `DATASET.NUM_MODALITIES` to the specific number of modalities in `.yaml` file.

## Description of Datasets in Tutorial

A brief description of BRCA and ROSMAP dataset is shown in the following
table.

**Table 1**: Characteristics of the preprocessed BRCA multiomics dataset.

| Omics | #Training samples | #Test samples | #Features |
|:----------------:|:-----------------:|:-------------:|:---------:|
| mRNA expression | 612 | 263 | 1000 |
| DNA methylation | 612 | 263 | 1000 |
| miRNA expression | 612 | 263 | 503 |

**Table 2**: Characteristics of the preprocessed ROSMAP multiomics dataset.

| Omics | #Training samples | #Test samples | #Features |
|:----------------:|:-----------------:|:-------------:|:----------:|
| mRNA expression | 245 | 106 | 200 |
| DNA methylation | 245 | 106 | 200 |
| miRNA expression | 245 | 106 | 200 |

## Data Downloading, Loading, and Pre-processing
The data downloading function has been integrated in `kale.loaddata.multiomics_datasets.SparseMultiomicsDataset`, which also included loading data from the raw `.csv` files and pre-processing data to the graphs we needed.

This function transforms tabular multiomics data into graph datasets by treating each patient sample as a node and using molecular measurements as node features. For each modality, it reads feature matrices and labels from `.csv` files, splits them into training and test sets.

Then, it samples similaity networks from the graphs as shown in the top figure. It computes sample similarities to define edges between nodes. These similarities are used to construct sparse adjacency matrices, where only the most relevant connections are retained.

Next, each graph is represented with node features, edge connections, and labels, encapsulating both feature and relational structure for downstream graph-based learning.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Extension Tasks
## Task 1 - Unimodal v.s. Multimodal
Users can try to set the `cfg.DATASET.NUM_MODALITIES=1` to try only using single mRNA experission for prediction and compare its results with the ones using all three modalities.

## Task 2 - Try another dataset
To try ROSMAP dataset, replace `"experiments/BRCA.yaml"` with `"experiments/ROSMAP.yaml"` in the following line under Configuration section and run the pipeline again.
```python
cfg.merge_from_file("experiments/BRCA.yaml")
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Helper Functions and Model Definition

## Helper Functions

We provide helper functions that can be inspected directly in the `.py` files located in the notebook's current directory. Two additional helper scripts are:
- [`config.py`](https://github.com/pykale/mmai-tutorial/blob/main/tutorials/multiomics-cancer-classification/config.py): Defines the base configuration settings, which can be overridden using a custom `.yaml` file.
- [`model.py`](https://github.com/pykale/mmai-tutorial/blob/main/tutorials/multiomics-cancer-classification/model.py): Defines the network structure of MOGONET.

## Model Definition in `model.py`
`PyKale` applies `kale.embed` and `kale.predict` to define `MogonetModel` class in [`model.py`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/model.py), which wraps all the necessary components of the MOGONET pipeline based on the configuration.

This wrapper takes care of:
- Building GCN encoders for each omics modality.
- Creating linear classifiers for modality-specific outputs.
- Optionally initializing a VCDN decoder for multimodal fusion.

MOGONET consists of two major sections: modality-specific encoders to encode graph data into latent space, and View Correlation Discovery Network (VCDN) for multimodal feature fusion, as shown in the top figure.
The modalitity-specific encoders is called from `kale.embed` and VCDN along with the prediction head is integrated in `kale.predict`.

### Embedding Extraction

`PyKale` support graph convolutional neural networks (GCNs) in `kale.embed`, which is used as the modality-specific encoders for MOGONET.
GCNs are neural networks designed for graph data, which generalize convolution operations to graph-structured data by aggregating feature information from neighboring nodes.
Through GCN encoders in `PyKale`, we can encode the raw data to graph embeddings.

### Prediction of Cancer Subtypes

`PyKale` support prediction layers that output the final prediction of cancer subtypes according to the input embeddings.
In `kale.predict`, the VCDN of MOGONET fuses modality-specific predictions for final classification and captures correlations between different modalities at the decision level.
Besides, `PyKale` also call the linear classifier layer from `kale.predict.decode.LinearClassifier` to implement MOGONET.

### Descriptions of Other APIs in [`model.py`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/model.py)

[`model.py`](https://github.com/pykale/mmai-tutorials/blob/main/tutorials/multiomics-cancer-classification/model.py) also calls `kale.pipeline.multiomics_trainer`, which provides `MultiomicsTrainer`, the training and evaluation engine that orchestrates how unimodal encoders and the multimodal VCDN fusion layer work together. It supports pretraining and full training regimes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Interpretation Study - Feature Masking Importance Analysis
To better understand which features among multiomics data are the key biomarkers in cancer subtype classification and how decisions were made by the model, we perform an interpretation study to identify important biomarkers.
We identify the most influential features (biomarkers) using **feature-masking-based importance analysis**.

We use `kale.interpret` to perform interpretation, where a function that systematically masks input features and observes the effect on performance—highlighting which features are most important for classification is provided.

## How Feature Importance Is Computed?
The `select_top_features_by_masking` function in `PyKale` implements a feature ablation approach to estimate feature importance for multi-omics data.

For each feature in each modality:

- Temporarily mask (zero out) the feature.

- Evaluate the model on the test set.

- Measure the performance drop (e.g., in F1 score). The larger the drop, the more important the feature is.

- Importance is calculated as $Importance_j=(FullMetric-MaskedMetric_j)\times d$,
where $j$ is the feature index and $d$ is the number of features in the modality (to scale the effect)
For demonstration, we use **F1 score** as the metric to calculate feature importance.

## Full results of interpretation study

We attach the full results of most important features reported in the original paper for reference:

**Table 3**: Important features in BRCA dataset.

| Omics | Importance features |
|:----------------:|:---------------------------------------------:|
| mRNA expression | SOX11, AMY1A, SLC6A15, FABP7, SLC6A14, SLC6A2, FGFBP1, DSG1, UGT8, ANKRD45, PI3, SERPINB5, COL11A2, ARHGEF4, SOX10 |
| DNA methylation | GPR37L1, MIR563, OR1J4, ATP10B, KRTAP3-3, FLJ41941, TMEM207, CDH26, MT1DP |
| miRNA expression | hsa-mir-205, hsa-mir-187, hsa-mir-452, hsa-mir-20b, hsa-mir-224, hsa-mir-204 |

**Table 4**: Important features in ROSMAP dataset.

| Omics | Importance features |
|:----------------:|:---------------------------------------------:|
| mRNA expression | NPNT, CDK18, KIF5A, SPACA6, TCEA3, SYTL1, ARRDC2, APLN |
| DNA methylation | TMC4, AGA, HYAL2, CCL3, TTC15 |
| miRNA expression | hsa-miR-423-3p, hsa-miR-33a, hsa-miR-640, hsa-miR-362-3p, hsa-miR-491-5p, hsa-miR-206, hsa-miR-548b-3p, hsa-miR-127-3p, hsa-miR-106a_hsa-miR-17, hsa-miR-424, hsa-miR-577, hsa-miR-873, hsa-miR-651, hsa-miR-199b-5p, hsa-miR-192, hsa-miR-199a-5p, hsv1-miR-H1 |
760 changes: 0 additions & 760 deletions tutorials/multiomics-cancer-classification/tutorial.ipynb

This file was deleted.