Prerequisites
1. Tokenize and preprocess the RELISH and TREC data sets.
    - Use the [medline-preprocessing](https://github.com/zbmed-semtec/medline-preprocessing) module to retrieve both the RELISH and TREC data sets.
    - Make sure to tokenize and preprocess it and save both data sets as .npy files.

In [None]:
import os
from scipy import spatial
import numpy as np
import pandas as pd
os.chdir('../Code')
from generate_embeddings import prepare_from_npy, generate_Word2Vec_model, generate_document_embeddings

Code Strategy
1. Retrieve the PMIDs, Titles and Abstracts seperately from each data set.
2. Train a word2vec model using either the RELISH or TREC data set.
    - We use gensim to train the model.
    - Outputs a .model file.
    - Uses pickle to split the .model file into different files incase it surpasses a size treshhold.
3. Generate the document embeddings from either the RELISH or TREC data set.
    - Retrieve word embeddings of each token using the word2vec model.
    - Calculate the document embeddings for each document using the centroids function,
    taking the average of all word embeddings embeddings.
    - A document consists of a pair of title and abstract.

1. Retrieve the PMIDs, Titles and Abstracts seperately from each data set.

In [None]:
#Parse the RELISH .npy file.
pmidRELISH, docRELISH = prepare_from_npy("../data/RELISH/Tokenized_Input/RELISH_Tokenized_Sample.npy")

2. Train a word2vec model using either the RELISH or TREC data set.

In [None]:
#Generate the word2vec model using gensim.
params = {'vector_size':200, 'epochs':5, 'window':5, 'min_count':2, 'workers':4}
generate_Word2Vec_model(docRELISH, pmidRELISH, params, "../data/RELISH/model.model", False)
#generate_Word2Vec_model(article_doc: list, pmids: list, params: list, filepath_out: str, use_pretrained: bool)

3. Generate the document embeddings from either the RELISH or TREC data set and save as .npy.

In [None]:
#Generate document embeddings for each PMID.
generate_document_embeddings(pmidRELISH, docRELISH, directory_out="../data/RELISH/", gensim_model_path="../data/RELISH/model.model", param_iteration=0)

Access embeddings via pickle:

In [None]:
df = pd.read_pickle("../data/RELISH/0/embeddings.pkl", compression='infer', storage_options=None)
embeddingA = df['embeddings'][0]
embeddingB = df['embeddings'][1]
print(f"The cosine similarity score between the PMIDs {df['pmids'][0]} and {df['pmids'][1]}: {round((1 - spatial.distance.cosine(embeddingA, embeddingB)), 2)}")