## **Extracting the Embeddings produced by different PTMs**

### *Notebook Outline*

- [**Feature Engineering**](#Features)
- [**Casting Data into an HuggingFace Dataset**](#Casting)
- [**Getting the Embeddings**](#embeddings)


---

In [1]:
# !pip install adapter-transformers
from transformers import AutoAdapterModel
# %run /kaggle/usr/lib/setup/setup.py
%run conf/setup.ipynb
%load_ext memory_profiler
os.environ['TOKENIZERS_PARALLELISM'] = 'true'

In [2]:
## KAGGLE
# data = pd.read_csv('/kaggle/input/subset-wlabels/subset_wlabels.csv').set_index('System ID')

## LOCAL
 data = pd.read_csv('subset_wlabels.csv').set_index('System ID')
data['Publication Date'] = pd.to_datetime(data['Publication Date'])
# Fix missing values coding in the data_origin column
data['Data_origin'] = data['Data_origin'].replace('N.A.', pd.NA)
data.sort_values(by='Lenght_Abs', inplace=True)

data.info()
print()
print("# of unique PMCID values:", data['PMCID'].nunique())
print("# of unique PMID values:", data['PMID'].nunique())
print("# of unique DOI values:", data['DOI'].nunique())
print("# of unique Title values:", data['Title'].nunique())

<class 'pandas.core.frame.DataFrame'>
Index: 560 entries, 32804639 to 33046370
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   DOI                       434 non-null    object        
 1   Latest Version            560 non-null    object        
 2   PMCID                     324 non-null    object        
 3   PMID                      368 non-null    float64       
 4   Pub Year                  560 non-null    int64         
 5   Publication Date          560 non-null    datetime64[ns]
 6   Publication Types         560 non-null    object        
 7   Source                    560 non-null    object        
 8   Peer_Review               560 non-null    int64         
 9   Title                     560 non-null    object        
 10  Cleaned_Abs               560 non-null    object        
 11  Lenght_Abs                560 non-null    int64         
 12  Condition      

### **Feature Engineering** <a id="Features"></a>

In this part of the analysis, we perform various data transformations to enrich our dataset. Let's take a look at the steps:

1. **Concatenate Task and Modality**: We create a new label column called "Task_Modality" by combining the modified "Task_(primary)" and "Modality" columns using the string ' with '.

2. **Remove Numeric Prefixes**: We remove numeric prefixes from the "Task_(primary)" column which were present in the original categorization from Born et al. (2020).

3. **Update Task Modality for Reviews**: For rows where "Task_(primary)" is equal to 'Review', we replace 'with' with 'on' in the "Task_Modality" column.

4. **Concatenate Text Fields**: Combine the title and abstract texts into a single sequence, separated by a special token and let the model encode this.

In [3]:
# Concatenate the modified "Task_(primary)" column with "Modality" column using the string ' with '
data['Task_Modality'] = (data['Task_(primary)'].str.replace(r'^\d+\.\s*', '') +
                         ' with ' +
                         data['Modality'])

# Remove numeric prefixes from 'Task_(primary)' column
data['Task_(primary)'] = data['Task_(primary)'].str.split('.').str[-1].str.strip()

# Select rows where "Task_(primary)" is equal to 'Review' and replace 'with' with 'on' in "Task_Modality" column
data.loc[data['Task_(primary)'] == 'Review', 'Task_Modality'] = data[data['Task_(primary)'] == 'Review']['Task_Modality'].str.replace('with', 'on', case=False)

# Concatenate Title and Abstract as it is usually done with these texts
data.insert(12, 'Title_Abstract', data['Title'] + ' [SEP] ' + data['Cleaned_Abs'])

data.head(2)


The default value of regex will change from True to False in a future version.



Unnamed: 0_level_0,DOI,Latest Version,PMCID,PMID,Pub Year,Publication Date,Publication Types,Source,Peer_Review,Title,...,Visualization Categories,influence_score,popularity_alt_score,popularity_score,influence_alt_score,tweets_count,Data_origin,Task_(primary),Modality,Task_Modality
System ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
32804639,10.1109/MPULS.2020.3008354,Yes,,32804639.0,2020,2020-08-18,Journal Article,Peer reviewed (PubMed),1,ai-driven covid-19 tools to interpret quantify...,...,Other Topics,4e-06,49.752,3e-06,185.0,0.0,,Review,Multimodal,Review on Multimodal
36237723,10.3348/jksr.2020.0138,Yes,PMC9431829,36237723.0,2020,2020-11-01,English Abstract;Journal Article;Review,Peer reviewed (PubMed),1,role of chest radiographs and ct scans and the...,...,Polymerase Chain Reaction;Reverse Transcription,2e-06,27.552,2e-06,86.0,-1.0,,Review,Multimodal,Review on Multimodal


<br>

### **Casting Data into an HuggingFace Dataset** <a id="Casting"></a>

[`🤗 Datasets`](https://huggingface.co/docs/datasets/index) supports loading datasets from Pandas DataFrames with the [`from_pandas()`](https://huggingface.co/docs/datasets/tabular_load#pandas-dataframes) method.
When the dataset doesn’t look as expected, we should [explicitly specify the dataset features](https://huggingface.co/docs/datasets/loading#specify-features). A `pandas.Series` may not always carry enough information for **Arrow** to automatically infer a data type.

[Features](https://huggingface.co/docs/datasets/about_dataset_features) defines the internal structure of a dataset. It is used to specify the underlying serialization format. What’s more interesting to you though is that Features contains high-level information about everything from the column names and types, to the ClassLabel. You can think of Features as the backbone of a dataset.
The Features format is simple: `dict[column_name, column_type]`. It is a dictionary of column name and column type pairs. The column type provides a wide range of options for describing the type of data you have.

In [4]:
# Columns to encode
columns_to_encode = ['Modality', 'Task_Modality', 'Task_(primary)']

# Loop through the columns and perform encoding
for column in columns_to_encode:
    label_encoder = LabelEncoder()
    data[column] = label_encoder.fit_transform(data[column])

In [5]:
modalities = ['X-Ray','CT','Multimodal','Ultrasound']
tasks = ['Risk identification','Detection/Diagnosis','Monitoring/Severity assessment','Prognosis/Treatment','Post-hoc','Segmentation-only','Review']
task_modalities = ['Detection/Diagnosis with X-Ray','Detection/Diagnosis with CT','Monitoring/Severity assessment with CT','Segmentation-only with CT','Detection/Diagnosis with Multimodal','Review on Multimodal','Prognosis/Treatment with CT','Prognosis/Treatment with X-Ray','Monitoring/Severity assessment with X-Ray','Segmentation-only with X-Ray','Detection/Diagnosis with Ultrasound','Review on X-Ray','Post-hoc with X-Ray','Monitoring/Severity assessment with Multimodal','Review on CT','Risk identification with CT','Post-hoc with CT','Segmentation-only with Multimodal','Monitoring/Severity assessment with Ultrasound']

dataset_features = Features({'DOI': Value(dtype='string'),
'PMCID': Value(dtype='string'),
'PMID': Value(dtype='float64'),
'Publication Types': Value(dtype='string'),
'Title': Value(dtype='string'),
'Title_Abstract': Value(dtype='string'),
'Task_(primary)': ClassLabel(names=tasks),
'Modality': ClassLabel(names=modalities),
'Task_Modality': ClassLabel(names=task_modalities),
'System ID': Value(dtype='string')})

In [6]:
dataset = Dataset.from_pandas(data[['DOI','PMCID','PMID','Publication Types','Title','Title_Abstract',
                                     'Task_(primary)','Modality','Task_Modality']], features = dataset_features)
dataset.features

{'DOI': Value(dtype='string', id=None),
 'PMCID': Value(dtype='string', id=None),
 'PMID': Value(dtype='float64', id=None),
 'Publication Types': Value(dtype='string', id=None),
 'Title': Value(dtype='string', id=None),
 'Title_Abstract': Value(dtype='string', id=None),
 'Task_(primary)': ClassLabel(num_classes=7, names=['Risk identification', 'Detection/Diagnosis', 'Monitoring/Severity assessment', 'Prognosis/Treatment', 'Post-hoc', 'Segmentation-only', 'Review'], id=None),
 'Modality': ClassLabel(num_classes=4, names=['X-Ray', 'CT', 'Multimodal', 'Ultrasound'], id=None),
 'Task_Modality': ClassLabel(num_classes=19, names=['Detection/Diagnosis with X-Ray', 'Detection/Diagnosis with CT', 'Monitoring/Severity assessment with CT', 'Segmentation-only with CT', 'Detection/Diagnosis with Multimodal', 'Review on Multimodal', 'Prognosis/Treatment with CT', 'Prognosis/Treatment with X-Ray', 'Monitoring/Severity assessment with X-Ray', 'Segmentation-only with X-Ray', 'Detection/Diagnosis with U

<br>

### **Getting the Embeddings** <a id="embeddings"></a>

We begin by instantiating the tokenizer for each candidate model. 15 BERT variants were selected for comparison, and we select one at a time by uncommenting the corresponding checkpoint identifier. The models considered in this study share identical architectures, namely `bertbase` or `bertlarge`. However, their distinctions lie in three key aspects: the pre-training dataset, weight initialization, and the vocabulary. 

The tokenizer is initialized using the selected model checkpoint, and we manually set the maximum token length to 512, as it may default to a very large value by default (int(1e30)). This ensures that the input text is appropriately tokenized and fits within the model's constraints.

[PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) - Base Class for all fast Tokenizers. <br>
[BertTokenizerFast](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizerFast) - Subclass in use here.

In [None]:
# Instantiate the tokenizer for each candidate model
%memit

#model_ckpt = 'bert-base-uncased'
#model_ckpt = 'allenai/scibert_scivocab_uncased'
#model_ckpt = 'dmis-lab/biobert-base-cased-v1.2'
#model_ckpt = 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'
#model_ckpt = 'deepset/covid_bert_base'
#model_ckpt = 'lordtt13/COVID-SciBERT'
#model_ckpt = 'manueltonneau/clinicalcovid-bert-base-cased'
#model_ckpt = 'StanfordAIMI/RadBERT'
#model_ckpt = 'allenai/specter2'

model_ckpt = 'bert-large-cased' # LARGE
#model_ckpt = 'dmis-lab/biobert-large-cased-v1.1' # LARGE
#model_ckpt = 'microsoft/BiomedNLP-PubMedBERT-large-uncased-abstract'
#model_ckpt = 'manueltonneau/biocovid-bert-large-cased' # LARGE

tokenizer = AutoTokenizer.from_pretrained(model_ckpt, model_max_length=512)
# Manually setting model_max_length as it may default to VERY_LARGE_INTEGER (int(1e30)) 
tokenizer.init_kwargs

The `max_model_input_sizes` Class attribute: A dictionary with, as keys, the short-cut-names of the pretrained models using the tokenizer, and as associated values, the maximum length of the sequence inputs of this model, or `None` if the model has no maximum input size.
Checking this attribute is useful for manually setting `model_max_length` as it defaults to `VERY_LARGE_INTEGER` (*int(1e30)*).
 <br>
 
  **Note:** Not setting this parameter will force the tokenizer to ignore the `Truncation = True` argument in the function call.  

In [8]:
tokenizer.max_model_input_sizes

{'bert-base-uncased': 512,
 'bert-large-uncased': 512,
 'bert-base-cased': 512,
 'bert-large-cased': 512,
 'bert-base-multilingual-uncased': 512,
 'bert-base-multilingual-cased': 512,
 'bert-base-chinese': 512,
 'bert-base-german-cased': 512,
 'bert-large-uncased-whole-word-masking': 512,
 'bert-large-cased-whole-word-masking': 512,
 'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
 'bert-large-cased-whole-word-masking-finetuned-squad': 512,
 'bert-base-cased-finetuned-mrpc': 512,
 'bert-base-german-dbmdz-cased': 512,
 'bert-base-german-dbmdz-uncased': 512,
 'TurkuNLP/bert-base-finnish-cased-v1': 512,
 'TurkuNLP/bert-base-finnish-uncased-v1': 512,
 'wietsedv/bert-base-dutch-cased': 512}

In [9]:
print('Vocabulary Size: ' + str(tokenizer.vocab_size))
print('Context Window Size: ' + str(tokenizer.model_max_length)) # Enough for papers' abstracts
print('Model Input Fields: ' + str(tokenizer.model_input_names)) # Names of the fields that the model expects in its forward pass

Vocabulary Size: 28996
Context Window Size: 512
Model Input Fields: ['input_ids', 'token_type_ids', 'attention_mask']


We now test the tokenization process using an individual record from the dataset. We are using the tokenizer to encode the text of the "Title_Abstract" column. \
To observe the impact of padding, set the padding strategy to `max_length`, which will pad the sequences to the maximum length.\
To observe the impact of padding, tokenize the last record, which originally has 619 tokens, but it will be truncated to 512 tokens.

In [10]:
# Testing Tokenization with an individual record
# Check out padding by setting padding strategy to 'max_length'
# Check out Truncation by tokenizing the last records (619 tokens truncated to 512)
encoded_text = tokenizer(dataset['Title_Abstract'][0], padding=True, truncation=True)

print(encoded_text) # transformers.tokenization_utils_base.BatchEncoding
print()
print('Length of a generic individual encoded record ' + str(len(encoded_text['input_ids'])))

{'input_ids': [101, 170, 1182, 118, 4940, 1884, 18312, 118, 1627, 5537, 1106, 19348, 186, 27280, 6120, 13093, 4351, 102, 154, 4746, 24936, 7628, 1110, 170, 1363, 1645, 1165, 1122, 2502, 1106, 3455, 13093, 4351, 1107, 1103, 2147, 1222, 1884, 15789, 27608, 10351, 3653, 113, 18732, 23314, 2137, 118, 1627, 114, 117, 1133, 25220, 3622, 2228, 2070, 6360, 7516, 1277, 1167, 8232, 119, 1706, 1115, 1322, 117, 1317, 1844, 2114, 1138, 4972, 1702, 1106, 8246, 4810, 113, 19016, 114, 1112, 170, 6806, 1111, 3455, 1105, 23389, 161, 118, 11611, 1105, 3254, 18505, 1106, 3702, 8944, 113, 16899, 114, 14884, 1116, 117, 1105, 4395, 1106, 4267, 8517, 22583, 1105, 8804, 18732, 23314, 2137, 118, 1627, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [11]:
# convert the token IDs back to their corresponding string tokens
#tokenizer.convert_ids_to_tokens(encoded_text.input_ids)

<br>

Each of the models used here will be a [`BertModel`]([transformers.models.bert.modeling_bert.BertModel](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel)) instance.

The bare Bert Model transformer outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model *(such as downloading or saving, resizing the input embeddings, pruning heads etc.)*

The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of cross-attention is added between the self-attention layers, following the architecture described in *Attention is all you need*

To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and `add_cross_attention` set to True; an `encoder_hidden_states` is then expected as an input to the forward pass.

> In the Jupyter notebook, the `get_embeddings()` function plays a crucial role in addressing errors related to the scope of global variables within the `tokenize()` and `extract_hidden_state()` functions, as surprisingly raised in Kaggle's Kernel. By encapsulating the instantiation of the tokenizer and associated models within the `get_embeddings()` function, we ensure convenient and localized access to these resources for every row. It's important to note that the function should be executed without batching, indicated by setting `batch_size = None`. If batching is used, the model would be loaded multiple times, which is inefficient. Please be aware that the **batch mapping feature is currently under development** and not fully functional at this stage.


In [12]:
# embeddings = dataset.map(get_embeddings, fn_kwargs={"model_ckpt": model_ckpt}, batched=True, batch_size=None)
# embeddings.features

We initialize and set up the candidate model from which exctracting the embeddings. We check for the availability of CUDA and assign the appropriate device (GPU if available, else CPU).
We then instantiate the model using `AutoModel.from_pretrained()` with the specified checkpoint (`model_ckpt`) and move it to the available device.
Please note that the commented blocks at the bottom demonstrate how to load the SPECTER model with adapters for specific tasks, if needed.

In [None]:
# Check if CUDA is available, and assign the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Instantiate each candidate model
model = AutoModel.from_pretrained(model_ckpt).to(device)

# Instantiate SPECTER model with adapter
#model = AutoModel.from_pretrained(model_ckpt)
# load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
#model.load_adapter("allenai/specter2_proximity", source="hf", set_active=True)
#model.to(device)
print("running on device: {}".format(device))

In this cell, we generate text embeddings using a selected pre-trained BERT variant model. We derive the hidden size from the number of inputs attribute of the pooler layer.

The process involves iterating through the dataset's 'Title_Abstract' column and obtaining embeddings for each abstract using the specified tokenizer and model. The embeddings are stored in three separate arrays, each representing the `CLS` token, `SEP` token, and the pooled representation of the abstract, respectively.

Finally, the generated embeddings are added as new columns in the pandas DataFrame with corresponding names, like `[model_name]_CLS_embed`, `[model_name]_SEP_embed`, and `[model_name]_POOL_embed`. \
 These are then saved as Pandas Series in a serialized .pkl file format.



In [15]:
%%time
%memit

# One batch extraction no longer works with the 517 records dataset
#embeddings = generate_embeddings(dataset['Title_Abstract'], tokenizer, model, device)

hidden_size = model.pooler.dense.in_features
n_rows = dataset.num_rows

## UNCOMMENT IF MODEL IS A BERT MODEL (768-dim Embeddings)
embeddings = (np.zeros([n_rows, hidden_size]), np.zeros([n_rows, hidden_size]), np.zeros([n_rows, hidden_size]))
for i, abst in enumerate(dataset['Title_Abstract']): 
    embeddings[0][i], embeddings[1][i], embeddings[2][i] = generate_embeddings(abst, tokenizer, model, device)

model_name = model_ckpt.split("/")[-1]

# Optionally save individual arrays to Numpy standard binary format .npy
#np.save(model_name + '_CLS_embed', embeddings[0])
#np.save(model_name + '_SEP_embed', embeddings[1]) 
#np.save(model_name + '_POOL_embed', embeddings[2]) 

# Create new columns in the pandas DataFrame and assign the embeddings
data[model_name + '_CLS_embed'] = embeddings[0].tolist()
data[model_name + '_SEP_embed'] = embeddings[1].tolist()
data[model_name + '_POOL_embed'] = embeddings[2].tolist()

peak memory: 4028.95 MiB, increment: 0.00 MiB
CPU times: user 29.4 s, sys: 236 ms, total: 29.6 s
Wall time: 31.2 s


In [16]:
# Save CLS embeddings and keys to a serialized file
data[model_name + '_CLS_embed'].to_pickle(model_name + '_CLS_embed.pkl')

# Save SEP embeddings and keys to a serialized file
data[model_name + '_SEP_embed'].to_pickle(model_name + '_SEP_embed.pkl')

# Save POOL embeddings and keys to a serialized file
data[model_name + '_POOL_embed'].to_pickle(model_name + '_POOL_embed.pkl')

***