Prerequisites
1. Tokenize and preprocess the RELISH and TREC data sets.
    - Use the [medline-preprocessing](https://github.com/zbmed-semtec/medline-preprocessing) module to retrieve both the RELISH and TREC data sets.
    - Make sure to tokenize and preprocess it and save both data sets as .npy files.

In [None]:
import os
from scipy import spatial
import numpy as np
import pandas as pd
os.chdir('../Code')
from Process import prepareFromNPY, generateWord2VecModel, generateDocumentEmbeddings

Code Strategy
1. Retrieve the PMIDs, Titles and Abstracts seperately from each data set.
2. Train a word2vec model using either the RELISH or TREC data set.
    - We use gensim to train the model.
    - Outputs a .model file.
    - Uses pickle to split the .model file into different files incase it surpasses a size treshhold.
3. Generate the document embeddings from either the RELISH or TREC data set.
    - Retrieve word embeddings of each token using the word2vec model.
    - Calculate the document embeddings for each document using the centroids function,
    taking the average of all word embeddings embeddings.
    - A document consists of a pair of title and abstract.

1. Retrieve the PMIDs, Titles and Abstracts seperately from each data set.

In [None]:
#Parse the RELISH .npy file.
pmidRELISH, titleRELISH, abstractRELISH = prepareFromNPY("../Data/RELISH/Tokenized_Input/RELISH_Tokenized_Sample.npy")

2. Train a word2vec model using either the RELISH or TREC data set.

In [None]:
#Generate the word2vec model using gensim.
generateWord2VecModel(titleRELISH, abstractRELISH, "../Data/RELISH/Output/word2vecModel/model.model")

3. Generate the document embeddings from either the RELISH or TREC data set and save as .npy.

In [None]:
#Generate document embeddings for each PMID.
generateDocumentEmbeddings(pmidRELISH, titleRELISH, abstractRELISH, directoryOut="../Data/RELISH/Output/DocumentEmbeddings", gensimModelPath="../Data/RELISH/Output/word2vecModel/model.model", saveAs='numpy')

Alternatively save the embeddings as a pandas dataframe:

In [None]:
#Generate document embeddings for each PMID.
generateDocumentEmbeddings(pmidRELISH, titleRELISH, abstractRELISH, directoryOut="../Data/RELISH/Output/DocumentEmbeddings", gensimModelPath="../Data/RELISH/Output/word2vecModel/model.model", saveAs='pandas')

Accessing the embeddings and calculating cosine similarity.

Access embeddings via .npy:

In [None]:
pmidA = 17928366
pmidB = 22569528
embeddingA = np.load(f'../Data/RELISH/Output/DocumentEmbeddings/{pmidA}.npy', allow_pickle=True)
embeddingB = np.load(f'../Data/RELISH/Output/DocumentEmbeddings/{pmidB}.npy', allow_pickle=True)
print(f'The cosine similarity score between the PMIDs {pmidA} and {pmidB}: {round((1 - spatial.distance.cosine(embeddingA, embeddingB)), 2)}')

Access embeddings via pickle:

In [None]:
df = pd.read_pickle("../Data/RELISH/Output/DocumentEmbeddings/embeddings.pkl", compression='infer', storage_options=None)
embeddingA = df['embeddings'][0]
embeddingB = df['embeddings'][1]
print(f"The cosine similarity score between the PMIDs {df['pmids'][0]} and {df['pmids'][1]}: {round((1 - spatial.distance.cosine(embeddingA, embeddingB)), 2)}")