### add some picture of coclustering.

# Optimal transport-based machine learning to match specific expression patterns in omics data

In this notebook, we will show how to use optimal transport and matching or co-clustering to match two data sets.

The methods we will use are described in the following paper: 

T. T. Y., NGUYEN, O. Bouaziz, W. Harchauoui, C. Neri, A. Chambaz, [Optimal transport-based machine learning to match specific expression patterns in omics data](https://arxiv.org/pdf/2107.11192.pdf)


## Imports and installs

In [None]:
# Ignore this cell if the corresponding packages are already installed

#!pip install coclust
#!pip install scikit-learn

In [1]:
import numpy as np
from wtot import wtot
from match_coclust import matching, SCC1_star, SCC1, SCC2_star, SCC2
import pandas as pd




# Example 1: Real data 
## Data loading

The real dataset loader utilities assume there is a "datasets/" folder in the current directory

Load real data then convert to matrices

In [2]:
data_micro = pd.read_csv('./datasets/LFC_Cortex_mirna.txt', sep = " ", delimiter = "\t") 
data_mess = pd.read_csv('./datasets/LFC_Cortex_mrna.txt', sep = " ", delimiter = "\t")

x = data_micro.values;
y = data_mess.values;

x = x[:200,:3]
y = y[: 200,:3]

In [3]:
## Hyperparameter
m = 1
n = 3

## Algorithm WTOT_matching and WTOT_coclust
### The first step ( WTOT_...)
We compute the optimal transport matrix, optimal transformation and an estimator of weight

In [4]:
results = wtot(x, y, m = 1, n = 3, batch_size_x = 64 , batch_size_y = 64)

# value of OT matrix, optimal transformation, the weight
pi_np = results['P'] 
theta = results['theta']
w = results['w']

### The second step 
#### Matching

In [6]:
results_match= matching(pi_np)

In [7]:
# the collection of calM
N_m = results_match['N_m']
print('The set of columns is associated to the first row:', N_m[0])
# the collection of calN
M_n = results_match['M_n']
print('The set of rows is associated to the first column', M_n[0])

The set of columns is associated to the first row: {0, 71, 72, 169, 12, 177, 122, 156}
The set of rows is associated to the first column [0, 2, 13, 45, 47, 64, 66, 70, 78, 79, 101, 119, 155, 167, 173, 176, 195]


#### Co-clustering

In [8]:
### WTOT-SCC1
SCC1_res = SCC1(pi_np)

### WTOT-SCC1*
SCC1_star_res = SCC1_star(pi_np, 4 )

### WTOT-SCC2
SCC2_res = SCC2(pi_np)

### WTOT-SCC*
SCC2_star_res = SCC2_star(pi_np, 4)


# Example 2: synthesis data
## Data simulating

In [9]:
### an example of synthesis data
datas = np.load('./datasets/sample_A4.npz', allow_pickle = True) # the configuration A4 of the first simulation study
datas = datas['dats']

id_sample = 1
x         = datas[id_sample]['x']
y         = datas[id_sample]['y']
labels_x  = datas[id_sample]['labels_x']
labels_y  = datas[id_sample]['labels_y']

### The first step ( WTOT_...)
We compute the optimal transport matrix, optimal transformation and an estimator of weight

In [10]:
results = wtot(x, y, m = 2, n=1, batch_size_x = 64, batch_size_y = 64)

# value of OT matrix, optimal transformation, the weight
pi_np = results['P'] 
theta = results['theta']
w = results['w']

### The second step 
#### Matching

In [11]:
results_match = matching(pi_np)

In [12]:
# the collection of calM
N_m = results_match['N_m']
print('The set of columns is associated to the first row:', N_m[0])
# the collection of calN
M_n = results_match['M_n']
print('The set of rows is associated to the first column', M_n[0])

The set of columns is associated to the first row: {290, 228, 294, 166, 200, 170, 204, 239, 112, 50, 243, 20, 52, 277, 151}
The set of rows is associated to the first column [99, 103, 111, 145, 178, 211, 215, 220, 286]


#### Co-clustering

In [13]:

### WTOT-SCC1
SCC1_res = SCC1(pi_np)
### WTOT-SCC1*
SCC1_star_res =SCC1_star(pi_np, 4 )
### WTOT-SCC2
SCC2_res = SCC2(pi_np)
### WTOT-SCC*
SCC2_star_res = SCC2_star(pi_np, 4)