### Format Knowledge-Graph Embeddings for Hopwise `dataset.get_preload_weight()` function
This notebook shows you how you can format kge methods embeddings to be loaded with `dataset.get_preload_weight`


📚 [Load Pretrained Embedding Documentation](https://recbole.io/docs/user_guide/usage/load_pretrained_embedding.html)

**Load Libraries**

In [None]:
import torch
import os
import pandas as pd
import numpy as np
import torch.nn as nn
from hopwise.data import create_dataset

### Load Checkpoint

In [None]:
checkpoint_name = 'saved/TransE-Mar-13-2025_16-18-32.pth'

In [None]:
checkpoint = torch.load(checkpoint_name)

**The Embeddings detected are**

In [None]:
checkpoint['state_dict'].keys()

**Do you want to exclude some embeddings?**

In [None]:
excluded = ['relation_bias_embedding.weight']

**The Dataset detected is**

In [None]:
dataset_name = checkpoint['config']['dataset']
dataset_name

**The Dataset folder detected is**

In [None]:
data_path = checkpoint['config']['data_path']
data_path

**Create the mappings between embedding and original entity/relation/user**

- Users have a mapping 1-1 so we don't need a mapping.

- We suppose that indexing starts at 1. (tipically 0 is reserved for [PAD])

In [None]:
dataset = create_dataset(checkpoint['config'])

In [None]:
dataset.field2token_id['tail_id']

In [None]:
# create the reverse mapping
uid2token = {id: token for token,id in dataset.field2token_id['user_id'].items()}
print(uid2token)
eid2token = {id: token for token, id in dataset.field2token_id['tail_id'].items()}
print(eid2token)
rid2token = {id: token for token, id in dataset.field2token_id['relation_id'].items()}
print(rid2token)

In [None]:
# # add dummy relation, check kge code
# rid2token[len(rid2token)] = 'ui_dummy_relation'

In [None]:
assert (len(eid2token.keys()) == checkpoint['state_dict']['entity_embedding.weight'].shape[0])
assert (len(rid2token.keys()) == checkpoint['state_dict']['relation_embedding.weight'].shape[0])

*if the assertion check fails, make sure that you've trained the kge without adding dummy relations/entities explicitly when creating relation/entity embeddings!*

### Create the new embeddings

In [None]:
def format_embedding(weight, columns, emb_type):
    weight = weight.detach().cpu().numpy()
    new_emb_dict = {columns[0]: list(), 
                    columns[1]: list() }
    
    if emb_type == 'entity':
        mapping = eid2token    
    elif emb_type == 'relation':
        mapping = rid2token
    elif emb_type == 'user':
        mapping = uid2token
        
    # Create index
    new_emb_dict[columns[0]] = [mapping[id] if mapping is not None else id for id in range(1,len(weight))]

    # Create embedding
    new_emb_dict[columns[1]] = [" ".join(f"{x}" for x in row) for row in weight[1:]]
    
    filename = f'{dataset_name}.{emb_type}emb'
    df = pd.DataFrame(new_emb_dict)
    print(f"[+] Saving the new {dataset_name} {columns[0]} embedding in {data_path}/{filename}!")
    df.to_csv(os.path.join(data_path,filename), sep='\t',index=False)

In [None]:
for emb_name, emb in checkpoint['state_dict'].items():
    if emb_name in excluded:
        continue
    # What is? Entity? User? Relation? Item? 
    emb_type = emb_name.split("_")[0]
    # Create the new embedding file columns
    columns = [f'{emb_type}id:token', f'{emb_type}_embedding:float_seq']
    print(f"[+] Formatting {emb_name} with columns {columns}")
    format_embedding(emb, columns, emb_type)

### Next?

Now, in the dataset folder there are these file

In [None]:
os.listdir(data_path)

**We want to make sure that the dataset configuration is ok.**

Suppose that the output of the format embedding phase is:

```text
    [+] Formatting user_embedding.weight with columns ['userid:token', 'user_embedding:float_seq']
    [+] Saving the new ml-1m userid:token embedding in /home/recsysdatasets/ml-1m/ml-1m.useremb!
    [+] Formatting entity_embedding.weight with columns ['entityid:token', 'entity_embedding:float_seq']
    [+] Saving the new ml-1m entityid:token embedding in /home/recsysdatasets/ml-1m/ml-1m.entityemb!
    [+] Formatting relation_embedding.weight with columns ['relationid:token', 'relation_embedding:float_seq']
    [+] Saving the new ml-1m relationid:token embedding in /home/recsysdatasets/ml-1m/ml-1m.relationemb!
```

Then, you should go to the dataset configuration file (in our case is in `hopwise/properties/dataset/ml-1m.yaml`) and add the new files to be loaded


```text
    additional_feat_suffix: [useremb, entityemb, relationemb]  
    load_col:                                                  
        useremb: [userid, user_embedding]
        entityemb: [entityid, entity_embedding]
        relationemb: [relationid, relation_embedding]
    
    alias_of_user_id: [userid]
    alias_of_entity_id: [entityid]
    alias_of_relation_id: [relationid]
    
    preload_weight:
      userid: user_embedding
      entityid: entity_embedding
      relationid: relation_embedding

```



### The end

Now in your code you should be able to access to pretrained embeddings in your model through:

*Torch*
```python
    pretrained_user_emb = dataset.get_preload_weight('userid')
    pretrained_entity_emb = dataset.get_preload_weight('entityid')
    pretrained_relation_emb = dataset.get_preload_weight('relationid')
    
    self.user_embedding = nn.Embedding.from_pretrained(torch.from_numpy(pretrained_user_emb))
    self.entity_embedding = nn.Embedding.from_pretrained(torch.from_numpy(pretrained_entity_emb))
    self.relation_embedding = nn.Embedding.from_pretrained(torch.from_numpy(pretrained_relation_emb))
```

*Numpy*:
```python
    self.pretrained_user_emb = dataset.get_preload_weight('userid')
    self.entity_embedding = dataset.get_preload_weight('entityid')
    self.relation_embedding = dataset.get_preload_weight('relationid')
```


