# Eval helper Overview
* I wrote two helper scripts to help run the time-leveraged pipeline: eval_helper and analysis_helper
* This notebook focuses on how to use eval_helper
* eval_helper finds the years in our modified SemMedDB dataset (TLDR) with the most approved drug-disease indications and can train a knowledge graph embedding model on those selected year
* two main steps:
    * pick the best years to train models on
    * train models on the specified years

In [1]:
import sys
import optuna

sys.path.append("../tools")
import eval_helper

## Initialize the eval_helper class
* the docstring is pretty self explanatory on how to use the class
* eval_helper takes several variables:
    * `models_to_run` - is the number of years you want to train models on
    * `strategy` is the approach to identify the best years to train models on. two options - max_valid and max_test_valid
    * `train_models` - whether or not you want to train a model. If True it will run the training loop by calling pykeen on the dataset years chosen above

### Using the max_test_valid strategy

In [2]:
x = eval_helper.TimeDrugRepo(
    data_dir="/home/rogertu/.data/pykeen/datasets/timeresolvedkg/data/time_networks-6_metanode",
    strategy="max_test_valid",
    build_dataset_kwargs={"split_ttv": True},
)

#### Recommended years to train models on

In [3]:
x.recommended_years

[1994, 2020, 1979, 2019, 2017]

In [4]:
x.recommended_counts

year,train_indications,test_indications,valid_indications
i64,i64,i64,i64
1979,1393,348,1593
1994,2412,602,1564
2017,3723,930,444
2019,3989,997,287
2020,4204,1050,218


## Using the max_valid strategy

In [5]:
y = eval_helper.TimeDrugRepo(
    data_dir="/home/rogertu/.data/pykeen/datasets/timeresolvedkg/data/time_networks-6_metanode",
    strategy="max_valid",
    build_dataset_kwargs={"split_ttv": True},
)

### Recommended years to train models on

In [6]:
y.recommended_years

[1994, 1979, 1964, 2007, 1952]

In [7]:
y.recommended_counts

year,train_indications,test_indications,valid_indications
i64,i64,i64,i64
1952,76,18,498
1964,422,105,920
1979,1393,348,1593
1994,2412,602,1564
2007,3084,770,1071


## Train models
* its as simple as initializing a model, and specifying the model during initialization
* here we'll use TransE as an example on how to train models

In [None]:
# storage location for the postgres hpo database
storage = optuna.storages.RDBStorage(
    url="postgresql+psycopg2://rogertu:admin@localhost/optuna_test"
)

# get the best parameters from the hpo study
transe = optuna.study.load_study(storage=storage, study_name="transe_hpo_time3")
transe_params = transe.best_params

In [None]:
# model kwargs and parameters for training
model_kwargs = dict(
    # Model
    model="TransE",
    model_kwargs=dict(scoring_fct_norm=2, embedding_dim=100),
    # Loss
    loss="InfoNCELoss",
    loss_kwargs=dict(
        margin=transe_params["loss.margin"],
        log_adversarial_temperature=transe_params["loss.log_adversarial_temperature"],
    ),
    # Regularization
    regularizer="LpRegularizer",
    regularizer_kwargs=dict(weight=transe_params["regularizer.weight"]),
    # Training
    training_kwargs=dict(
        num_epochs=10,
        batch_size=144,
        checkpoint_frequency=0,
        checkpoint_name="TransE.pt",
    ),
    # Negative Sampler
    negative_sampler="basic",
    negative_sampler_kwargs=dict(
        num_negs_per_pos=transe_params[
            "negative_sampler.num_negs_per_pos"
        ],  # corruption_scheme=("h","r","t",),  # defines which part of the triple to corrupt
        filtered=True,  # Uses a default 'Bloom' filter to minimize false negatives
    ),
    # optimizer
    optimizer="Adam",
    optimizer_kwargs=dict(lr=transe_params["optimizer.lr"]),
    # lr scheduler
    lr_scheduler="ExponentialLR",
    lr_scheduler_kwargs=dict(gamma=transe_params["lr_scheduler.gamma"]),
    # Tracking
    result_tracker="wandb",
    result_tracker_kwargs=dict(project="KGE-on-time-results", group="transe"),
    # Misc
    device="cuda:0",  # use gpu position 0
)

In [None]:
# train models
x = eval_helper.TimeDrugRepo(
    data_dir="/home/rogertu/.data/pykeen/datasets/timeresolvedkg/data/time_networks-6_metanode",
    strategy="max_test_valid",
    build_dataset_kwargs={"split_ttv": True},
)