## **Comparing the Embeddings produced by different PTMs**

### *Notebook Outline*

- [**Feature Engineering**](#Features)
- [**Loading [CLS] Embeddings**](#CLS)
- [**Model's Parameters Tuning**](#tuning)
- [**Performing the Comparison**](#CompareCLS)
- [**Loading [SEP] Embeddings**](#SEP)
- [**Performing the Comparison**](#CompareSEP)
- [**Loading POOL Embeddings**](#POOL)
- [**Performing the Comparison**](#ComparePOOL)

---

In [1]:
%run setup.ipynb
%load_ext memory_profiler

In [2]:
data = pd.read_csv('..\data\subset_wlabels.csv').set_index('System ID')
data['Publication Date'] = pd.to_datetime(data['Publication Date'])
# Fix missing values coding in the data_origin column
data['Data_origin'] = data['Data_origin'].replace('N.A.', pd.NA)
data.sort_values(by='Lenght_Abs', inplace=True)
# -

data.info()
print()
print("# of unique PMCID values:", data['PMCID'].nunique())
print("# of unique PMID values:", data['PMID'].nunique())
print("# of unique DOI values:", data['DOI'].nunique())
print("# of unique Title values:", data['Title'].nunique())

<class 'pandas.core.frame.DataFrame'>
Index: 560 entries, 32804639 to 33046370
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   DOI                       434 non-null    object        
 1   Latest Version            560 non-null    object        
 2   PMCID                     324 non-null    object        
 3   PMID                      368 non-null    float64       
 4   Pub Year                  560 non-null    int64         
 5   Publication Date          560 non-null    datetime64[ns]
 6   Publication Types         560 non-null    object        
 7   Source                    560 non-null    object        
 8   Peer_Review               560 non-null    int64         
 9   Title                     560 non-null    object        
 10  Cleaned_Abs               560 non-null    object        
 11  Lenght_Abs                560 non-null    int64         
 12  Condition      

### **Feature Engineering** <a id="Features"></a>

In this part of the analysis, we perform various data transformations to enrich our dataset. Let's take a look at the steps:

1. **Concatenate Task and Modality**: We create a new label column called "Task_Modality" by combining the modified "Task_(primary)" and "Modality" columns using the string ' with '.

2. **Remove Numeric Prefixes**: We remove numeric prefixes from the "Task_(primary)" column which were present in the original categorization from Born et al. (2020).

3. **Update Task Modality for Reviews**: For rows where "Task_(primary)" is equal to 'Review', we replace 'with' with 'on' in the "Task_Modality" column.

4. **Initialize Record Labels**: We initialize the "RecordLabel" column by combining the "Title" and a line break. This will be used for future interactive visualization.

5. **Fill Missing DOI Information**: We iterate through the rows and fill in the "RecordLabel" column with relevant information. If the "DOI" is missing, we check for "PMCID" and "PMID" values. If present, we include them in the "RecordLabel". Otherwise, we concatenate "Title" with "Source" and "Publication Date".


In [3]:
# Concatenate the modified "Task_(primary)" column with "Modality" column using the string ' with '
data['Task_Modality'] = (data['Task_(primary)'].str.replace(r'^\d+\.\s*', '') +
                         ' with ' +
                         data['Modality'])

# Remove numeric prefixes from 'Task_(primary)' column
data['Task_(primary)'] = data['Task_(primary)'].str.split('.').str[-1].str.strip()

# Select rows where "Task_(primary)" is equal to 'Review' and replace 'with' with 'on' in "Task_Modality" column
data.loc[data['Task_(primary)'] == 'Review', 'Task_Modality'] = data[data['Task_(primary)'] == 'Review']['Task_Modality'].str.replace('with', 'on', case=False)

# Initialize the 'RecordLabel' column with the 'Title' and a line break
data['RecordLabel'] = data['Title'] + ' \n' 

for i, row in data.iterrows():
    if pd.isnull(row['DOI']):  # Check if 'DOI' is missing
        if pd.notnull(row['PMCID']):  
            # If 'PMCID' is present, add it to the 'RecordLabel'
            data.at[i, 'RecordLabel'] += 'PMCID: ' + row['PMCID']
        elif pd.notnull(row['PMID']):  
            # If 'PMCID' is missing but 'PMID' is present, add 'PMID' to the 'RecordLabel'
            data.at[i, 'RecordLabel'] += 'PMID: ' + row['PMID']
        else:  
            # If both 'PMCID' and 'PMID' are missing, concatenate 'Title' with 'Source' and 'Publication Date'
            data.at[i, 'RecordLabel'] += 'Published on ' + row['Source'] + ', ' + str(row['Publication Date']).split()[0]
    else:  
        # If 'DOI' is present, add it to the 'RecordLabel'
        data.at[i, 'RecordLabel'] += 'DOI: ' + row['DOI']

data.head(2)


The default value of regex will change from True to False in a future version.



Unnamed: 0_level_0,DOI,Latest Version,PMCID,PMID,Pub Year,Publication Date,Publication Types,Source,Peer_Review,Title,...,influence_score,popularity_alt_score,popularity_score,influence_alt_score,tweets_count,Data_origin,Task_(primary),Modality,Task_Modality,RecordLabel
System ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
32804639,10.1109/MPULS.2020.3008354,Yes,,32804639.0,2020,2020-08-18,Journal Article,Peer reviewed (PubMed),1,ai-driven covid-19 tools to interpret quantify...,...,4e-06,49.752,3e-06,185.0,0.0,,Review,Multimodal,Review on Multimodal,ai-driven covid-19 tools to interpret quantify...
36237723,10.3348/jksr.2020.0138,Yes,PMC9431829,36237723.0,2020,2020-11-01,English Abstract;Journal Article;Review,Peer reviewed (PubMed),1,role of chest radiographs and ct scans and the...,...,2e-06,27.552,2e-06,86.0,-1.0,,Review,Multimodal,Review on Multimodal,role of chest radiographs and ct scans and the...


In [4]:
print('Task Label Frequency Table:')
print(data['Modality'].value_counts())
print()
print('Task_Modality Label Frequency Table:')
print(data['Task_Modality'].value_counts())

Task Label Frequency Table:
X-Ray         243
CT            197
Multimodal     70
Ultrasound     50
Name: Modality, dtype: int64

Task_Modality Label Frequency Table:
Detection/Diagnosis with X-Ray                    182
Detection/Diagnosis with CT                        83
Review on Multimodal                               36
Segmentation-only with CT                          34
Detection/Diagnosis with Ultrasound                28
Prognosis/Treatment with CT                        27
Monitoring/Severity assessment with CT             22
Detection/Diagnosis with Multimodal                22
Prognosis/Treatment with X-Ray                     19
Post-hoc with X-Ray                                17
Post-hoc with CT                                   15
Monitoring/Severity assessment with Ultrasound     15
Risk identification with CT                        10
Segmentation-only with X-Ray                        8
Monitoring/Severity assessment with X-Ray           7
Review on CT           

### **Loading [CLS] Embeddings**<a id="CLS"></a>

First we define model names with respective variable names, as well as filenames as exported in the previous .ipynb script.

In [5]:
models = [
    'bert-base-uncased',
    'scibert_scivocab_uncased',
    'biobert-base-cased-v1.2',
    'BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext',
    'covid_bert_base',
    'COVID-SciBERT',
    'clinicalcovid-bert-base-cased',
    'RadBERT',
    'specter2',
    'specter2C', # with Classification Adapter
    'specter2P', # with Proximity Adapter

    'bert-large-cased', #LARGE
    'biobert-large-cased-v1.1', #LARGE
    'BiomedNLP-PubMedBERT-large-uncased-abstract', #LARGE
    'biocovid-bert-large-cased', #LARGE
]

# Retrieving filenames as defined earlier (PICKLE)
CLS_embeds = [model_name + '_CLS_embed.pkl' for model_name in models]
SEP_embeds = [model_name + '_SEP_embed.pkl' for model_name in models]
POOL_embeds = [model_name + '_POOL_embed.pkl' for model_name in models]

# In Python, variable names can only contain letters, digits, underscores
models_norm = ['BERT_base','SciBERT', 'BioBERT_Base','PubMedBERT','COVID_BERT', 'COVID_SciBERT','ClinicalCovidBERT','RadBERT','SPECTER','SPECTER_CLF','SPECTER_PRX','BERT_large','BioBERT_Large','PubMedBERT_large','BioCovidBERT']

In this cell, we load CLS token embeddings for 15 different models. The code utilizes a loop to read each model's embeddings from a pickle file and stores them in respective variables. We then collect all Pandas Series into a list. Memory usage is measured using `%memit`.


In [6]:
## LOADING CLS TOKEN EMBEDDINGS (15)
for i, model in enumerate(models_norm):
    cls_embedding_path = os.path.join(RESULTS_PATH, 'embeddings/', CLS_embeds[i])
    string_to_exec = f"{model} = pd.read_pickle('{cls_embedding_path}')"
    exec(string_to_exec)
%memit

peak memory: 733.82 MiB, increment: 0.09 MiB


In [7]:
# Collecting all embeddings in a iterable object
all_embeddings = [BERT_base,SciBERT,BioBERT_Base,PubMedBERT,COVID_BERT,COVID_SciBERT,ClinicalCovidBERT,RadBERT,SPECTER,SPECTER_CLF,SPECTER_PRX,BERT_large,BioBERT_Large,PubMedBERT_large,BioCovidBERT]

### **Model's Parameters Tuning**<a id="tuning"></a>

We perform a grid search to find the optimal parameters for our K-Nearest Neighbors (KNN) classifier. We explore different values for `n_neighbors` ranging from 1 to 15 and two options for the `weights` parameter (`'uniform'` and `'distance'`). The metric used is the cosine distance, the standard choice when dealing with document vectors, as it provides a better way to capture semantic similarity.

To evaluate the performance of each combination, we use stratified k-fold cross-validation with 10 folds, ensuring robustness in our results. The scoring metrics used are `'accuracy'` and `'balanced_accuracy'`. The results of the grid search are stored in a DataFrame called `best_params`, where we keep track of the best parameters for each model based on their respective embeddings.

Let's see which parameters are considered optimal for each model after the grid search:



In [8]:
%timeit
# Perform grid search to find the optimal parameters
grid = {"n_neighbors": range(1,20), "weights":['uniform','distance']}
kf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 42)
metrics = ['accuracy','balanced_accuracy']
best_params = pd.DataFrame(index=models_norm, columns=grid.keys())  

for i,embedding in enumerate(all_embeddings):
    X = np.array(embedding.tolist())   
    search = GridSearchCV(KNeighborsClassifier(n_jobs = -1), grid, scoring=metrics, refit='accuracy', cv=kf, n_jobs=-1)
    search.fit(X, data['Modality'])

    best_params.at[models_norm[i], 'n_neighbors'] = search.best_params_["n_neighbors"]
    best_params.at[models_norm[i], 'weights'] = search.best_params_["weights"]


In [9]:
display(best_params)
print('Median of best n_neighbors parameters: ' + str(np.median(best_params.n_neighbors)))

Unnamed: 0,n_neighbors,weights
BERT_base,10,distance
SciBERT,18,uniform
BioBERT_Base,5,distance
PubMedBERT,14,distance
COVID_BERT,15,uniform
COVID_SciBERT,14,distance
ClinicalCovidBERT,7,distance
RadBERT,18,uniform
SPECTER,6,distance
SPECTER_CLF,9,uniform


Median of best n_neighbors parameters: 13.0


### **Comparing [CLS] Embeddings**<a id="CompareCLS"></a>

In [10]:
accuracy, balanced_acc = compare_embeddings(all_embeddings, data['Modality'], k=10, strategy='[CLS]', tables=None, save=False, n_neighbors=13)
%memit

peak memory: 740.20 MiB, increment: 0.00 MiB


***

### **Loading [SEP] Embeddings**<a id="SEP"></a>

In [11]:
## LOADING SEP TOKEN EMBEDDINGS (15)
for i, model in enumerate(models_norm):
    sep_embedding_path = os.path.join(RESULTS_PATH, 'embeddings/', SEP_embeds[i])
    string_to_exec = f"{model} = pd.read_pickle('{sep_embedding_path}')"
    exec(string_to_exec)
%memit

peak memory: 1020.96 MiB, increment: 0.00 MiB


In [12]:
# Collecting all embeddings in a iterable object
all_embeddings = [BERT_base,SciBERT,BioBERT_Base,PubMedBERT,COVID_BERT,COVID_SciBERT,ClinicalCovidBERT,RadBERT,SPECTER,SPECTER_CLF,SPECTER_PRX,BERT_large,BioBERT_Large,PubMedBERT_large,BioCovidBERT]

### **Comparing [SEP] Embeddings**<a id="CompareSEP"></a>

In [13]:
accuracy, balanced_acc = compare_embeddings(all_embeddings, data['Modality'], k=10, strategy='[SEP]',
                                             tables=[accuracy,balanced_acc],
                                             save=False, n_neighbors=13)
%memit

peak memory: 771.13 MiB, increment: 0.01 MiB


***

### **Loading POOL Embeddings**<a id="POOL"></a>

In [14]:
## LOADING POOL TOKEN EMBEDDINGS (15)
for i, model in enumerate(models_norm):
    pool_embedding_path = os.path.join(RESULTS_PATH, 'embeddings/', POOL_embeds[i])
    string_to_exec = f"{model} = pd.read_pickle('{pool_embedding_path}')"
    exec(string_to_exec)
%memit

peak memory: 1043.77 MiB, increment: 0.00 MiB


In [15]:
# Collecting all embeddings in a iterable object
all_embeddings = [BERT_base,SciBERT,BioBERT_Base,PubMedBERT,COVID_BERT,COVID_SciBERT,ClinicalCovidBERT,RadBERT,SPECTER,SPECTER_CLF,SPECTER_PRX,BERT_large,BioBERT_Large,PubMedBERT_large,BioCovidBERT]

### **Comparing POOL Embeddings**<a id="ComparePOOL"></a>

In [16]:
accuracy, balanced_acc = compare_embeddings(all_embeddings, data['Modality'], k=10, strategy='AVG',
                                             tables=[accuracy,balanced_acc],
                                             save=True, n_neighbors=13)
%memit

peak memory: 770.34 MiB, increment: 0.01 MiB


The chance-level accuracies were obtained using the `DummyClassifier ` from Scikit-Learn with `strategy=‘stratified’`. A DummyClassifier makes predictions that ignore the input features, so its accuracy values are exactly the same regardless of the embeddings provided. These scores serve as simple baselines to compare against other more complex classifiers.

In [17]:
chance = chance_knn_accuracy(all_embeddings, data['Modality'], k=10, rs=42)
print("Chance-level accuracy: %0.3f (+/- %0.2f)" % (chance['accuracy'], chance['accuracy_std']))
print()
print("Chance-level balanced accuracy: %0.3f (+/- %0.2f)" % (chance['ba'], chance['ba_std']))

Chance-level accuracy: 0.355 (+/- 0.13)

Chance-level balanced accuracy: 0.248 (+/- 0.09)


In [18]:
df_display(accuracy, title='k-NN Accuracy over "Modality" prediction', highlight=True)


this method is deprecated in favour of `Styler.format(precision=..)`



Unnamed: 0,[CLS],[SEP],AVG
BERT base,0.543,0.577,0.613
SciBERT,0.582,0.564,0.636
BioBERT Base,0.532,0.652,0.611
PubMedBERT,0.577,0.745,0.643
COVID BERT,0.564,0.529,0.602
COVID SciBERT,0.645,0.605,0.627
ClinicalCovidBERT,0.654,0.645,0.634
RadBERT,0.579,0.579,0.582
SPECTER,0.825,0.838,0.688
SPECTER CLF,0.777,0.771,0.662


In [19]:
df_display(balanced_acc, title='k-NN Balanced Accuracy over "Modality" prediction', highlight=True)


this method is deprecated in favour of `Styler.format(precision=..)`



Unnamed: 0,[CLS],[SEP],AVG
BERT base,0.386,0.433,0.499
SciBERT,0.439,0.41,0.482
BioBERT Base,0.36,0.526,0.462
PubMedBERT,0.429,0.586,0.501
COVID BERT,0.43,0.376,0.449
COVID SciBERT,0.499,0.479,0.506
ClinicalCovidBERT,0.532,0.508,0.496
RadBERT,0.38,0.38,0.397
SPECTER,0.753,0.767,0.578
SPECTER CLF,0.688,0.678,0.545
