In [None]:
import os
import pandas as pd
import torch
from hopwise.data import create_dataset

### 2. Embeddings preprocessing for PGPR

In this phase we specify the checkpoint folder along its name and we load it.

In [None]:
checkpoint_name = "saved/.....pth"

In [None]:
checkpoint = torch.load(checkpoint_name, weights_only=False)

**The detected embeddings are**


Now we visualise which embeddings have been saved

In [None]:
checkpoint["state_dict"].keys()

**Maybe you want to exclude some embeddings?**

As I said before, there can be embeddings that we would exclude

In [None]:
excluded = ["relation_bias_embedding.weight","norm_vec.weight","proj_mat_e.weight"]

**The detected dataset is**

Let's check which dataset has been used and where is the folder the dataset where the embeddings will be saved.

In [None]:
dataset_name = checkpoint["config"]["dataset"]
data_path = checkpoint["config"]["data_path"]
data_path
dataset_name

**Create the mappings between embedding and original entity/relation/user**
- Users have a mapping 1-1 so we don't need a mapping.

- We suppose that indexing starts at 1. (tipically 0 is reserved for [PAD])

*Note Francesca: Questi mapping sono necessari perché gli embeddings salvati nel modello sono semplicemente una matrice |users|x|items|, per questo abbiamo bisogno di mappare ogni riga (corrispondente all'utente) al corrispondente utente nel dataset originale. Stessa cosa per le entità e le relazioni. Per cui, se in ml1m un'entità si chiama 'mdajd12' allora nel file finale .entityemb che vedrai successivamente in questo notebook, la riga corrispondente all'entità iesima avrà il suo nome nel dataset originale.*

In [None]:
dataset = create_dataset(checkpoint["config"])

In [None]:
# create the reverse mapping
uid2token = {id: token for token, id in dataset.field2token_id["user_id"].items()}
print(uid2token)
eid2token = {id: token for token, id in dataset.field2token_id["tail_id"].items()}
print(eid2token)
rid2token = {id: token for token, id in dataset.field2token_id["relation_id"].items()}
print(rid2token)

In [None]:
# # add dummy relation, check kge code
# rid2token[len(rid2token)] = 'ui_dummy_relation'

In [None]:
assert len(eid2token.keys()) == checkpoint["state_dict"]["entity_embedding.weight"].shape[0]
assert len(rid2token.keys()) == checkpoint["state_dict"]["relation_embedding.weight"].shape[0]

*if the assertion check fails, make sure that you've trained the kge without adding dummy relations/entities explicitly when creating relation/entity embeddings!*

### Create the new embeddings

Now we take the embeddings, we create the mappings and we save the embeddings ready to be used with PGPR 🥳🚀

In [None]:
def format_embedding(weight, columns, emb_type):
    weight = weight.detach().cpu().numpy()
    new_emb_dict = {columns[0]: list(), columns[1]: list()}

    if emb_type == "entity":
        mapping = eid2token
    elif emb_type == "relation":
        mapping = rid2token
    elif emb_type == "user":
        mapping = uid2token

    # Create index
    new_emb_dict[columns[0]] = [mapping[id] if mapping is not None else id for id in range(1, weight.shape[0])]

    # Create embedding
    new_emb_dict[columns[1]] = [" ".join(f"{x}" for x in row) for row in weight[1:]]

    filename = f"{dataset_name}.{emb_type}emb"
    df = pd.DataFrame(new_emb_dict)
    print(f"[+] Saving the new {dataset_name} {columns[0]} embedding in {data_path}/{filename}!")
    df.to_csv(os.path.join(data_path, filename), sep="\t", index=False)

In [None]:

for emb_name, emb in checkpoint["state_dict"].items():
    if emb_name in excluded:
        continue
    # What is? Entity? User? Relation? Item?
    emb_type = emb_name.split("_")[0]
    # Create the new embedding file columns
    columns = [f"{emb_type}_embedding_id:token", f"{emb_type}_embedding:float_seq"]
    print(f"[+] Formatting {emb_name} with columns {columns}")
    format_embedding(emb, columns, emb_type)

### Next?

Let's check that everything is fine!

Now, in the dataset folder you can see the new saved embeddings

In [None]:
os.listdir(data_path)

### It's not yet over... final steps

Some final checks to make sure everything runs smoothly...

Once everything is run, you should have an output from previous boxes as below.

**We want to make sure that the dataset configuration is ok.**

Suppose that the output of the format embedding phase is:

```text
    [+] Formatting user_embedding.weight with columns ['userid:token', 'user_embedding:float_seq']
    [+] Saving the new ml-1m userid:token embedding in /home/recsysdatasets/ml-1m/ml-1m.useremb!
    [+] Formatting entity_embedding.weight with columns ['entityid:token', 'entity_embedding:float_seq']
    [+] Saving the new ml-1m entityid:token embedding in /home/recsysdatasets/ml-1m/ml-1m.entityemb!
    [+] Formatting relation_embedding.weight with columns ['relationid:token', 'relation_embedding:float_seq']
    [+] Saving the new ml-1m relationid:token embedding in /home/recsysdatasets/ml-1m/ml-1m.relationemb!
```