# Imports

Make sure you have correctly downloaded the entire repository, as well as the data files (specified in the README.md section), before starting this example notebook.

In [5]:
from model_class import Alligner, DataAccessor
import numpy as np
import torch
import pandas as pd

# Create an Alligner

Here you can create your Alligner instance by specifying various training parameters.

In [2]:
parameters = {
    "vae_learning_rate": 5e-3,  # Learning rate for the VAE (Variational Autoencoder)
    "adv_learning_rate": 5e-3,  # Learning rate for the adversarial network
    "epochs": 50,  # Number of training epochs
    "n_genes": 18178,  # Number of genes in the dataset
    "encoded_dim": 64,  # Dimensionality of the encoded representation
    "n_sources": 2,  # Number of data sources
    "src_weights_src_adv": torch.tensor([0.5, 1.]),  # Weights for source adversarial training
    "kl_weight": 1e-5,  # Weight for the KL divergence loss
    "src_adv_weight": 0.01,  # Weight for the source adversarial loss
    'batch_size': 32,  # Batch size for training
    "early_stopping_patience": 30,  # Patience for early stopping
    'train_ratio': 0.9,  # Ratio of training data
    'val_ratio': 0.1,  # Ratio of validation data
    'test_ratio': 0.,  # Ratio of test data (set to 0 if no test set is used)
    'Datasets': ['TCGA', 'ACH'],  # List of datasets to use
    'selected_gpu': 3,  # Choose the GPU index you want to use
    'space_to_project_into': 'TCGA',  # Space into which data is projected
    'enc_hidden_layers': (256, 128),  # Hidden layer sizes for the encoder
    'enc_dropouts': (0.1, 0.1),  # Dropout rates for the encoder
    'dec_hidden_layers': (128, 256),  # Hidden layer sizes for the decoder
    'dec_dropouts': (0.1, 0.1),  # Dropout rates for the decoder
    'mlp_hidden_layers': (64, 64),  # Hidden layer sizes for the MLP (Multi-Layer Perceptron)
    'mlp_dropouts': None,  # Dropout rates for the MLP (None indicates no dropout)
    'batch_norm_mlp': True,  # Whether to use batch normalization in the MLP
    'softmax_mlp': True,  # Whether to use softmax activation in the MLP
    'batch_norm_enc': True,  # Whether to use batch normalization in the encoder
    'batch_norm_dec': True,  # Whether to use batch normalization in the decoder
    'optimizer_name_src_adv': 'Adam',  # Optimizer name for the source adversarial network
    'optimizer_name_BatchAE': 'Adam',  # Optimizer name for the Batch AutoEncoder
    'discrepancy': 'KLD',  # Discrepancy measure (e.g., KLD for Kullback-Leibler Divergence)
    'discrepancy_weight': 1e-5,  # Weight for the discrepancy loss
    'missing_gene_percentage': 0.5,  # Percentage of missing genes in the dataset
}


alligner = Alligner(parameters)

# Get the data :

### Data Access with `DataAccessor`

To access the data, we use a class called `DataAccessor`. This class ensures that given the original data (CCLE samples transcriptomic, TCGA samples transcriptomic, the CCLE samples metadata, the TCGA samples metadata, and the file containing the list of the L1000 genes), all the data is stored as a pandas DataFrame and can be accessed easily.

Here is the list of functions and attributes associated with this class:

- **`self.fit(ccle_data_path, tcga_data_path, tcga_projects_path, ccle_metadata_file_path, tcga_metadata_file_path, L1000_file_path)`**:
  - The first function to call when you create the class.
  - It adapts all the data for future utilization.

- **`self.n_genes`**:
  - The number of genes in common for the two given sets of samples (CCLE and TCGA).

- **`self.ccle_final`**:
  - A pandas DataFrame where the last `n_genes` columns are the transcriptomics data for CCLE samples, and the first columns are the metadata (the first two columns being the source and the ID of the sample).

- **`self.tcga_final`**:
  - A pandas DataFrame where the last `n_genes` columns are the transcriptomics data for TCGA samples, and the first columns are the metadata (the first two columns being the source and the ID of the sample).

- **`self.get_features(panda_dataframe, device, standardize=False)`**:
  - This function takes a pandas DataFrame like `ccle_final` or `tcga_final`, and puts all the transcriptomic data in the form of a PyTorch vector of shape `n_samples x n_genes`.
  - It places the vector on the specified device and allows for optional standardization of the data according to the columns.
  - It ensures the order of the genes in the vector is maintained.

In [3]:
with open('ach_gdsc_indexes.txt', 'r') as file:
    # Read each line and split it into a list of strings
    ach_gdsc_indexes = file.read().splitlines()

test_data_accessor = DataAccessor()

test_data_accessor.fit(ccle_data_path = "OmicsExpressionProteinCodingGenesTPMLogp1.csv",
tcga_data_path = "TumorCompendium_v11_PolyA_hugo_log2tpm_58581genes_2020-04-09.tsv",
tcga_projects_path = "TCGA_Projects.csv",
ccle_metadata_file_path = "Model.csv",
tcga_metadata_file_path = "clinical.tsv",
L1000_file_path = 'geneinfo_beta.txt',
)

df = test_data_accessor.ccle_final
ccle_final_gdsc  = df[df['ModelID'].isin(ach_gdsc_indexes)]
ccle_final_not_in_gdsc = df[~df['ModelID'].isin(ach_gdsc_indexes)]


device = torch.device(f"cuda:{parameters['selected_gpu']}")
ccle_features = test_data_accessor.get_features(test_data_accessor.ccle_final, device, standardize=False)
tcga_features = test_data_accessor.get_features(test_data_accessor.tcga_final, device, standardize=False)
ccle_features_gdsc = test_data_accessor.get_features(ccle_final_gdsc, device, standardize=False)
ccle_features_not_in_gdsc = test_data_accessor.get_features(ccle_final_not_in_gdsc, device, standardize=False)


# Training :

In [4]:
alligner.fit(ccle_features_gdsc, ccle_features_not_in_gdsc, tcga_features, test_data_accessor.must_keep_indices)

Create not corrupted datasets...
Done
Create corrupted datasets...
Done
Concatenate datasets...
Done
Splits indices beteween train and val...
Done
Create samplers...
Done
Create Datloaders...
Done
Create Optimizers...
Done
Starting training...
Epoch 1/50 - Train AE Loss: 4.0692, Train Adv Loss: 0.4503, Train Total Loss: 4.0650
Epoch 2/50 - Train AE Loss: 0.9569, Train Adv Loss: 0.4153, Train Total Loss: 0.9527
Epoch 3/50 - Train AE Loss: 0.8702, Train Adv Loss: 0.4142, Train Total Loss: 0.8661
Epoch 4/50 - Train AE Loss: 0.8361, Train Adv Loss: 0.4071, Train Total Loss: 0.8321
Epoch 5/50 - Train AE Loss: 0.8027, Train Adv Loss: 0.4018, Train Total Loss: 0.7987
Epoch 6/50 - Train AE Loss: 0.7996, Train Adv Loss: 0.4025, Train Total Loss: 0.7956
Epoch 7/50 - Train AE Loss: 0.7757, Train Adv Loss: 0.4047, Train Total Loss: 0.7717
Epoch 8/50 - Train AE Loss: 0.7683, Train Adv Loss: 0.3964, Train Total Loss: 0.7644
Epoch 9/50 - Train AE Loss: 0.7413, Train Adv Loss: 0.3951, Train Total Loss

# Alligne:

In [8]:
ccle_final_copy = test_data_accessor.ccle_final.copy()
ccle_final_copy.iloc[:,-test_data_accessor.n_genes:]= alligner.alligne(ccle_features, space_to_project_into='TCGA')

tcga_final_copy = test_data_accessor.tcga_final.copy()
tcga_final_copy.iloc[:,-test_data_accessor.n_genes:]= alligner.alligne(tcga_features, space_to_project_into='TCGA')


# Remap The data with the metadata and save it

In [9]:
# ----------- For TCGA samples ---------------------

# Select the first two columns containing the metadata
tcga_final_copy.drop('case_id', axis=1, inplace=True)
first_two_columns = tcga_final_copy.iloc[:, :2]

# Select the last columns containing the transcriptomic data
last_columns = tcga_final_copy.iloc[:, -test_data_accessor.n_genes:]

# Concatenate the selected columns into a new DataFrame
new_tcga_final_copy = pd.concat([first_two_columns, last_columns], axis=1)
new_tcga_final_copy = new_tcga_final_copy.rename(columns={'case_submitter_id': 'index'}) # Adjust columns names



# ----------- For CCLE samples ---------------------

# Select the first two columns containing the metadata
first_two_columns = ccle_final_copy.iloc[:, :2]

# Select the last columns containing the transcriptomic data
last_columns = ccle_final_copy.iloc[:, -test_data_accessor.n_genes:]

# Concatenate the selected columns into a new DataFrame
new_ccle_final_copy = pd.concat([first_two_columns, last_columns], axis=1)
new_ccle_final_copy = new_ccle_final_copy.rename(columns={'ModelID': 'index'}) # Adjust columns names



# Final dataframe :
df = pd.concat([new_tcga_final_copy, new_ccle_final_copy], axis=0)

# You can choose to save it using the following line :
#df.to_feather('final_df.feather')

Unnamed: 0,Source,index,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,...,OR4F29,TBCE,CCDC39,ARHGAP11B,AL160269.1,POLR2J3,SPDYE11,C8orf44-SGK3,NPBWR1,CDR1
0,TCGA,TCGA-DD-AAVP,3.135174,0.028396,5.864007,2.509013,2.262848,0.752228,3.398299,6.004123,...,0.125754,4.754342,0.151330,0.788762,0.466421,4.190383,0.077371,0.544528,-0.290041,0.124342
1,TCGA,TCGA-KK-A7B2,3.367176,0.037910,6.084354,2.987013,2.448074,1.206126,2.436819,4.580562,...,0.099038,4.317380,0.658841,1.277173,0.090477,4.576279,0.038742,0.670695,0.469221,0.130066
2,TCGA,TCGA-DC-6158,4.501471,0.089853,6.992742,3.417116,4.130279,1.205358,3.043563,6.278275,...,0.118081,4.743795,0.807063,2.780234,0.342528,5.517419,0.049574,0.743986,0.467096,0.132996
3,TCGA,TCGA-DD-A4NP,2.950265,0.007800,5.657268,2.424083,1.807881,0.796275,3.911061,5.852226,...,0.106564,4.573579,0.146527,0.627981,0.441784,3.827664,0.075630,0.598240,-0.363453,0.153568
4,TCGA,TCGA-HQ-A5ND,4.219013,-0.044418,6.873882,2.790040,4.198637,0.713537,3.847236,5.863390,...,0.092701,4.381327,0.516011,2.513701,0.362027,5.056733,0.030816,0.507946,1.093556,0.001036
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1474,CCLE,ACH-003157,4.173240,-0.263910,6.647718,2.810106,3.619272,0.277842,5.766932,6.555429,...,0.071739,4.774451,1.088025,2.425648,0.160152,4.889832,0.039053,0.749143,0.281475,0.465202
1475,CCLE,ACH-003158,4.016809,0.015881,6.709239,2.484421,4.480680,0.214116,3.921771,6.299941,...,0.056172,5.190631,0.819769,2.855280,0.020416,5.090271,0.019383,0.429407,0.245780,0.302200
1476,CCLE,ACH-003159,3.995608,-0.089760,6.970976,2.685435,4.369046,0.213909,4.673708,6.419062,...,0.044034,4.996519,0.859422,2.649955,0.104735,4.940071,0.015761,0.629773,0.316780,0.396788
1477,CCLE,ACH-003160,4.220507,-0.100069,6.990480,2.856611,4.420870,0.204042,4.240157,6.129759,...,0.067665,5.084096,1.045277,2.884316,0.160637,5.106096,0.019712,0.635350,0.440597,0.372604
