# Embeddings

Prerequisites

1. Tokenize and preprocess the RELISH and TREC data sets.
    - Use the medline-preprocessing module to retrieve both the RELISH and TREC data sets.
    - Make sure to tokenize and preprocess it and save both data sets as .npy files.

In [None]:
import numpy as np
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
os.chdir('../Code')
from embeddings import process_data_from_npy, createDoc2VecModel, saveDoc2VecModel, create_document_embeddings

Code Strategy

1. Retrieve the data seperately from TREC or RELISH corpus.
    - The data provides us with PMIDs, titles and the abstracts.
    - Concatenate each title and the abstract to obtain each document.
2. Train a doc2vec model using either the RELISH or TREC data set.
    - We use gensim to train the model.
    - We use PMIDs as tags for each document during training.
    - Outputs a .model file.
    - Save the .model file.
3. Generate the document embeddings from either the RELISH or TREC data set.
    - Retrieve document embeddings using the doc2vec model by providing PMIDs as tags to the model.
    - Save the embeddings in the .npy format.

Retrieve the PMIDs, titles and abstracts and documents from TREC or RELISH data set

In [None]:
# For RELISH
pmids, titles, abstracts, docs = process_data_from_npy("../Data/RELISH/TSV/sample.npy")
# For TREC
pmids, titles, abstracts, docs = process_data_from_npy("../Data/TREC/TSV/sample.npy")

Define the parameter for the Doc2Vec model

In [None]:
params = {'dm': 0, 'epochs': 5, 'min_count': 1, 'vector_size': 300, 'window': 7, 'workers': 8}

Generate and train the Doc2Vec model using either the RELISH or TREC corpus.

In [None]:
model = createDoc2VecModel(pmids, docs, params)

Save the Doc2Vec model

In [None]:
# Create model directory
model_directory = '../Data/Models/'
os.mkdir(model_directory)
model_path = model_directory + 'doc2vec.model'

# Save the Doc2Vec model
saveDoc2VecModel(model, model_path)

Generate the document embeddings from either the RELISH or TREC corpus and save as .npy.

In [None]:
# Create path to save the embeddings
embeddings_directory = '../Data/Embeddings/'
os.mkdir(embeddings_directory)

# Generate and save the embeddings
create_document_embeddings(pmids, model, embeddings_directory)