### Python code for **A Novel Sentence Transformer-based Natural Language Processing Approach for Schema Mapping of Electronic Health Records to the OMOP Common Data Model**
In this tutorial, we describe how to download data from Athena, preprocess data, and train a sentence-transformer for schema mapping to OMOP Common Data model.

### 1/ Download Data from Athena – OHDSI Vocabularies Repository

 To begin, navigate to the [Athena website](https://athena.ohdsi.org/). Read and accept the License Agreement, sign in or create an account, then hit the **download** bottom on the upper right side of the page. You will see a list of datasets like this

<img src="./asset/download_page.png" alt="download page" style="width:70%;"/>


Select the dataset you need. You may click select all and download all of them.

Once selected, it would generate a download link, which will be emailed to you. Download the vocabulary files from the link provided. These files will be in a compressed format, which you should extract to a directory for later use.


### 2/ Preprocessing the Training Set

First, install the required packages

In [None]:
!pip install pandas sentence_transformers

This project requires PyTorch. Please refer to [PyTorch offical website](https://pytorch.org/get-started/locally/) for the tutorial on how to set up PyTorch locally.

In this tutorial, we focus on training a sentence transformer, which is the most essential part of this study.

In [None]:
import pandas as pd

# Specify the path to your data files
data_path = 'path/to/your/data/'  # Modify this to your data directory

# Load the concept relationship dataframe
df1 = pd.read_csv(f'{data_path}CONCEPT_RELATIONSHIP.csv', sep='\t')

# Collect mapping relationships
# Concept 1 is standard concept in OMOP CDM, whereas concept 2 is not.
df1 = df1[df1['relationship_id'] == 'Mapped from']
df1 = df1[['concept_id_1', 'concept_id_2']]

# Load the concepts name table
df2 = pd.read_csv(f'{data_path}CONCEPT.csv', sep='\t')

# merge the two dataframes
id_to_concept_name = df2.set_index('concept_id')['concept_name'].to_dict()
id_to_domain_id = df2.set_index('concept_id')['domain_id'].to_dict()
id_to_vocabulary_id = df2.set_index('concept_id')['vocabulary_id'].to_dict()
id_to_concept_class_id = df2.set_index('concept_id')['concept_class_id'].to_dict()
id_to_standard_concept = df2.set_index('concept_id')['standard_concept'].to_dict()

df1['concept_name_1'] = df1['concept_id_1'].map(id_to_concept_name)
df1['domain_id_1'] = df1['concept_id_1'].map(id_to_domain_id)
df1['vocabulary_id_1'] = df1['concept_id_1'].map(id_to_vocabulary_id)
df1['concept_class_id_1'] = df1['concept_id_1'].map(id_to_concept_class_id)
df1['standard_concept_1'] = df1['concept_id_1'].map(id_to_standard_concept)

df1['concept_name_2'] = df1['concept_id_2'].map(id_to_concept_name)
df1['domain_id_2'] = df1['concept_id_2'].map(id_to_domain_id)
df1['vocabulary_id_2'] = df1['concept_id_2'].map(id_to_vocabulary_id)
df1['concept_class_id_2'] = df1['concept_id_2'].map(id_to_concept_class_id)
df1['standard_concept_2'] = df1['concept_id_2'].map(id_to_standard_concept)

# Using the df1 above, you can filter the dataset according to its domain_id, vocabulary_id, etc, then you can collect a subset as your training data

# Removing Na
df3 = df1.dropna(subset=['domain_id_1'])
print(df3.shape)

# Confirm that concept 1 is standard, whereas concept 2 is not.
df3 = df3[df3['standard_concept_2'] != 'S']
df3 = df3[df3['standard_concept_1'] == 'S']
print(df3.shape)

# Remove records when concept 1 and 2 are exactly the same
df3 = df3[df3['concept_class_id_1'] != df3['concept_class_id_2']]
df4 = df3[['concept_name_1', 'concept_name_2', 'concept_id_1', 'concept_id_2']]
df4 = df4[df4['concept_name_1']!=df4['concept_name_2']]

# Now we get all non-standard to standard mapping on Athena
df4.to_csv(f'{data_path}omop-non-standard-to-standard-mapping-pairs.csv', index=False)

Likewise, we can collect synonyms and their corresponding standard concepts in OMOP CDM.

In [None]:
# --------------concept synonyms--------------
# Load datasets
df = pd.read_csv(f'{data_path}CONCEPT_SYNONYM.csv', sep='\t', on_bad_lines = 'skip', low_memory = False)
df2 = pd.read_csv(f'{data_path}CONCEPT.csv', sep = '\t')

# Collect synonym relationships
id_to_concept_name = df2.set_index('concept_id')['concept_name'].to_dict()
id_to_domain_id = df2.set_index('concept_id')['domain_id'].to_dict()
id_to_standarded_or_not = df2.set_index('concept_id')['standard_concept'].to_dict()
df['concept_name'] = df['concept_id'].map(id_to_concept_name)
df['domain_id'] = df['concept_id'].map(id_to_domain_id)
df['standard_concept'] = df['concept_id'].map(id_to_standarded_or_not)

# Drop na
df = df.dropna(subset=['domain_id'])
print(df.shape)



To create a more targeted subset, we can uncomment and run the following 3 line in the next code block.   
For example, to focus our model on conditions, procedures, and drugs

In [None]:
# df = df[df['domain_id'].str.contains('Condition|Procedure|Drug')] 
# print(df.shape)
# print(df.domain_id.unique())

We can skip the code block above, which may result in a larger training set, making the training longer.  
Including data from domains that we are not interested in didn't seeem to be very helpful to the model performance, but may yield a model that maps more domains.

In [None]:
# Save the data
df = pd.DataFrame({'concept_name_1': df['concept_name'], 'concept_name_2': df['concept_synonym_name'], 'concept_id_1': df['concept_id']})
df.to_csv(f'{data_path}omop-synonyms.csv', index=False)

### 3/ Train Sentence-Transformer
Now we finally move on to the training of a sentence-transformer model that is more capable in OMOP schema mapping.

In [None]:
# Here we train an existing sentence transformers model
from sentence_transformers import SentenceTransformer, models, SentencesDataset
from torch import nn
from torch.utils.data import DataLoader
from sentence_transformers import losses
import torch
import pandas as pd
import os
from sentence_transformers.readers import InputExample

# To only use gpu 0, uncommment the following line
# os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Load the model
# On https://sbert.net/docs/sentence_transformer/pretrained_models.html you can find other sentence transformer models as well
# for example, if you have limited GPU resource, you might want to train a smaller model called all-MiniLM-L6-v2
model = SentenceTransformer('all-mpnet-base-v2')

# You can also build a new sentence-transformer based on an encoder-only model (uncommment the following 3 line)
# word_embedding_model = models.Transformer('UFNLP/gatortron-base', max_seq_length=32) # use larger max_seq_length, e.g. 64, if you have more powerful GPU(s)
# pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
# model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# If you want to train the model on more than one GPU, uncommment the following 3 lines
# if torch.cuda.device_count() > 1:
#     print("Let's use", torch.cuda.device_count() , "GPUs!")
#     model = nn.DataParallel(model)

# Move the model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Train the model using synonyms and mapping relationships on Athena
synonyms = pd.read_csv(f'{data_path}omop-synonyms.csv')
nsts = pd.read_csv(f'{data_path}omop-non-standard-to-standard-mapping-pairs.csv')
training_data = pd.concat([synonyms, nsts])
print(training_data.head())
train_examples = []

for _, row in training_data.iterrows():
    concept_name_1 = str(row['concept_name_1'])
    concept_name_2 = str(row['concept_name_2'])
    
    # Skip the row if either concept_name_1 or concept_name_2 is not a string
    if not isinstance(concept_name_1, str) or not isinstance(concept_name_2, str):
        continue
    
    # Skip the row if either concept_name_1 or concept_name_2 is empty
    if not concept_name_1 or not concept_name_2:
        continue
    
    texts = [concept_name_1, concept_name_2]
    example = InputExample(texts=texts)
    train_examples.append(example)
    
train_dataset = SentencesDataset(train_examples, model)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=128)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
num_epochs = 10
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

Now train the model

In [None]:
# If you only have one gpu
model.fit(train_objectives=[(train_dataloader, train_loss)],
           epochs=num_epochs,
           warmup_steps=warmup_steps,
           show_progress_bar=True)

# Edit the path to save your model
torch.save(model.state_dict(), "/path/to/the/model")

In [None]:
# If you have multiple gpus, do not run the code block above; instead, uncomment and run the following lines of Python code
# model.module.fit(train_objectives=[(train_dataloader, train_loss)],
#            epochs=num_epochs,
#            warmup_steps=warmup_steps,
#            show_progress_bar=True)

# torch.save(model.module.state_dict(), "/path/to/the/model")