#####========================#####
#####        WTOT-matching v1.0 #####
#####        WTOT-coclust v1.0       #####
#####======================#####
# Application: This repository contains python and R codes to run the algorithms WTOT-matching and WTOT-coclust, as presented in the paper Optimal transport-based machine learning to match specific patterns: application to the detection of molecular regulation patterns in omics data by T. T. Y. Nguyen, W. Harchaoui, L. Mégret, C. Mendoza, O. Bouaziz, C. Neri, A. Chambaz (2024). The paper can be found .
# The aim of WTOT-matching and WTOT-coclust is to learn a pattern of correspondence between two datasets in situations where it is desirable to match elements that exhibit an affine relationship (our approach accommodates any relationship, not necessarily affine, as long as it can be parametrized). In the motivating case-study, the challenge is to better understand micro-RNA regulation in Huntington's disease model mice.
# The algorithms unfold in two stages. During the first stage, an optimal transport plan P and an optimal affine transformation are learned, using the Sinkhorn algorithm and a mini-batch gradient descent. During the second stage, P is exploited to derive either several co-clusters (WTOT-coclust) or several sets of matched elements (WTOT-matching).

# The Jupyter notebook `WTOT_MC_demo.ipynb` presents several illustrations. 

# The main files of the repository are:
# - `utils.py`: defines key-functions used during the first stage of the algorithms to compute the optimal transport matrix *P*, kernel, mapping, the squared Euclidean distance and the best number of coclusters;
# - `wtot.py`: it is the core code implementing the first stage of the algorithms;
# - `match_coclust.py`: it is the core code of the second stage of the algorithms. 

# The folder `simulations` contains the codes used to generate data for the experimantal study presented in the paper. The folder `datasets` contains the miRNA and mRNA data obtained in the striatum and cortex of the HD model mice; and the results obtained by running the WTOT-matching and WTOT-coclust. The file `sample_A4.npz` is a synthetic dataset generated in configuration A4 of the simulation study (see Section 5 of the paper). 

#
# Version: WTOT-matching v1.0 ; WTOT-coclust v1.0
# Date: 15 April 2020
#
# Contributors (alphabetic order): O. Bouaziz (1), A. Chambaz (1), W. Harchaoui (1), L. Mégret (2), C. Mendoza (2), C. Neri (2), T. T. Y. Nguyen (1) #
# Laboratory:
#   (1) MAP5, F-75006 Paris, France
#   (2) UMR CNRS 8256, Team Brain-C Lab, F-75005 Paris, France
#
# Affiliations:
#   (1) Université Paris Cité, CNRS
#   (2) Sorbonne Université, CNRS
#
#####=================================================================#####
#####       Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license          #####
#####=============================================================#####
#####               Copyright (C) T. T. Y. Nguyen, W. Harchaoui, L. Mégret, C. Mendoza, O. Bouaziz, C. Neri, A. Chambaz (1) #####
#####                       Christian Neri(christian.neri@inserm.fr) Antoine Chambaz(antoine.chambaz@u-paris.fr) 2024                               #####
#####========================#####
#      
#      This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 
#      International License. To view a copy of this license, visit 
#      http://creativecommons.org/licenses/by-nc-nd/4.0/ or send a letter to 
#      Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
#      
#####======================#####
#####=====================#####

# Optimal transport-based machine learning to match specific expression patterns in omics data


In this notebook, we will show how to use (a) optimal transport and (b) matching or co-clustering procedures to match two data sets.


## Imports and installs

In [5]:
# Ignore this cell if the corresponding packages are already installed

#!pip install coclust
#!pip install scikit-learn

In [6]:
import numpy as np
from wtot import wtot
from match_coclust import matching, SCC1_star, SCC1, SCC2_star, SCC2
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Example 1: Real data 

The real data set has been kindly made public by Langfelder et al. (see their articles published in [Nature Neuroscience](https://europepmc.org/article/med/26900923) and [Plos One](https://pubmed.ncbi.nlm.nih.gov/29324753/)).

## Data loading

Load the real data then convert to matrices.

In [7]:
data_micro = pd.read_csv('./datasets/LFC_Cortex_mirna.txt', sep = ' ', delimiter = '\t') 
data_mess = pd.read_csv('./datasets/LFC_Cortex_mrna.txt', sep = ' ', delimiter = '\t')

x = data_micro.values
y = data_mess.values

x = x[:200,:3]
y = y[: 200,:3]

In [8]:
## Hyperparameter
m = 1
n = 3

## Algorithm WTOT_matching and WTOT_coclust
### First step ( WTOT_...)

We compute the optimal transport matrix, optimal transformation and an estimator of the "weights" (see paper).

In [9]:
results = wtot(x, y, m = 1, n = 3, batch_size_x = 64 , batch_size_y = 64)

# value of the optimal transport matrix, the optimal transformation, and the "weights".
pi_np = results['P'] 
theta = results['theta']
w = results['w']

### Second step 
#### Matching

In [10]:
results_match = matching(pi_np)

In [11]:
# the collection calM
N_m = results_match['N_m']
print('The indices of the miRNAs associated to the first mRNA of the list:', N_m[0], '.\n')
# the collection calN
M_n = results_match['M_n']
print('The indices of the mRNAs associated to the first miRNA of the list:', M_n[0], '.\n')

The indices of the miRNAs associated to the first mRNA of the list: {0, 100, 71, 72, 105, 169, 171, 12, 177, 122, 156} .

The indices of the mRNAs associated to the first miRNA of the list: [0, 13, 45, 47, 63, 64, 66, 70, 78, 79, 101, 119, 141, 155, 167, 173, 176, 177, 195] .



#### Co-clustering

In [12]:
### WTOT-SCC1
SCC1_res = SCC1(pi_np)

### WTOT-SCC1*
SCC1_star_res = SCC1_star(pi_np, 4 )

### WTOT-SCC2
SCC2_res = SCC2(pi_np)

### WTOT-SCC*
SCC2_star_res = SCC2_star(pi_np, 4)


# Example 2: synthetic data

We now present an illustration based on simulated data.

## Data simulation

In [13]:
### an example of synthesic data
datas = np.load('./datasets/sample_A4.npz', allow_pickle = True) # the configuration A4 of the first simulation study
datas = datas['dats']

id_sample = 1
x         = datas[id_sample]['x']
y         = datas[id_sample]['y']
labels_x  = datas[id_sample]['labels_x']
labels_y  = datas[id_sample]['labels_y']

### First step ( WTOT_...)
We compute the optimal transport matrix, optimal transformation and an estimator of the "weights".

In [14]:
results = wtot(x, y, m = 2, n=1, batch_size_x = 64, batch_size_y = 64)

# value of the optimal transport matrix, the optimal transformation, and the "weights"
pi_np = results['P'] 
theta = results['theta']
w = results['w']

### Second step 
#### Matching

In [15]:
results_match = matching(pi_np)

In [16]:
# the collection of calM
N_m = results_match['N_m']
print('The indices of the columns associated to the first row:', N_m[0], '.\n')
# the collection of calN
M_n = results_match['M_n']
print('The indices of rows associated to the first column', M_n[0], '.\n')

The indices of the columns associated to the first row: {290, 294, 166, 204, 205, 239, 112, 50, 243, 277, 86, 22, 249, 187} .

The indices of rows associated to the first column [74, 110, 111, 145, 178, 211, 278, 286, 293] .



#### Co-clustering

In [17]:

### WTOT-SCC1
SCC1_res = SCC1(pi_np)
### WTOT-SCC1*
SCC1_star_res =SCC1_star(pi_np, 4 )
### WTOT-SCC2
SCC2_res = SCC2(pi_np)
### WTOT-SCC*
SCC2_star_res = SCC2_star(pi_np, 4)