# Producing Vector Variants

This notebook consists of the pipeline to download vectors from KGvec2go, produce embedding variants (inclusind the autoencoded ones), filter entities and convert them to the TXT format expected by evaluation frameworks. Most steps of this pipeline require high RAM and empty storage available, but each step can be performed separately. In the last cells, it is possible to define which pipeline steps should run for which embedding variants.

### Setup before running this notebook for the first time:

1. Clone the binarizer repo (from Tissier et al.) to the folder `binarizer`

```
git clone https://github.com/tca19/near-lossless-binarization binarizer
```

2. You will need a preinstalled [OpenBLAS package](https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages) and a C compiler to run `makefile`

```
cd binarizer
make
```

> Note: if running on MacOS with ARM processor, OpenBLAS may have been installed with homebrew and Xcode in a different path than usual, and some changes will be necessary in `makefile` (lines 21 to 30):

```
CC       = gcc
CFLAGS   = -ansi -pedantic -Wall -Wextra -Wno-unused-result -Ofast -funroll-loops
LDLIBS   = -lopenblas -lm
CPPFLAGS = -I/opt/homebrew/opt/openblas/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include
LDFLAGS  = -L/opt/homebrew/opt/openblas/lib

all: binarize similarity_binary topk_binary

binarize: binarize.c
	$(CC) binarize.c -o binarize $(CFLAGS) $(LDLIBS) $(CPPFLAGS) $(LDFLAGS)
```

3. Move this notebook and copy the pickle files `dlcc_entities.pickle` and `geval_entities.pickle` from `resources` to `binarizer`

In [1]:
import pickle
from gensim.models import KeyedVectors
import numpy as np

with open('dlcc_entities.pickle', 'rb') as file:
    dlcc_entities = pickle.load(file)

with open('geval_entities.pickle', 'rb') as file:
    geval_entities = pickle.load(file)

all_entities = dlcc_entities.union(geval_entities)

In [2]:
embeddings_source_files = {
    "rdf2vec-cbow-200": {
        "source_kv": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-cbow-200/model.kv",
        "source_npy": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-cbow-200/model.kv.vectors.npy",
    },
    "rdf2vec-cbow-oa-200": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-cbow-oa-200/cwindow200_classic.txt", # 8145384
        "no_header": False,
    },
    "rdf2vec-sg-200": {
        "source_kv": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-sg-200/model.kv",
        "source_npy": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-sg-200/model.kv.vectors.npy",
    },
    "rdf2vec-sg-oa-200": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-sg-oa-200/sgpos200_classic.txt", # 8145384
        "no_header": False,
    },
    "non-rdf2vec-ComplEx": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_ComplEx.txt", # 8499982
        "no_header": True,
    },
    "non-rdf2vec-DistMult": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_DistMult.txt", # 8499982
        "no_header": True,
    },
    "non-rdf2vec-RESCAL": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_RESCAL.txt", # 8499982
        "no_header": True,
    },
    "non-rdf2vec-RotatE": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_RotatE.txt", # 8499982
        "no_header": True,
    },
    "non-rdf2vec-TransE-L1": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_TransE-L1.txt", # 8499982
        "no_header": True,
    },
    "non-rdf2vec-TransE-L2": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_TransE-L2.txt", # 8499982
        "no_header": True,
    },
    "non-rdf2vec-TransR": {
        "source_txt": "https://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/non-rdf2vec/vectors_dbpedia_TransR.txt", # 8499982
        "no_header": True,
    },
}

In [3]:
def download_vectors(embedding_name):
    print(f"Downloading {embedding_name} vectors...")
    
    embedding_dict = embeddings_source_files.get(embedding_name)
    
    if "source_txt" in embedding_dict.keys():
        embedding_source = embedding_dict.get("source_txt")
        
        !curl -o {embedding_name}.txt  {embedding_source}
    
    elif "source_kv" in embedding_dict.keys():
        embedding_source_kv = embedding_dict.get("source_kv")
        embedding_source_npy = embedding_dict.get("source_npy")
        
        !curl -o {embedding_name}.kv  {embedding_source_kv}
        !curl -o {embedding_name}.kv.vectors.npy  {embedding_source_npy}

In [4]:
def produce_binary_variants(embedding_name, variants=[], remove_original_file=True):
    
    ![ -d embeddings ] || mkdir embeddings
        
    embedding_dict = embeddings_source_files.get(embedding_name)
    txt_format = True if "source_txt" in embedding_dict.keys() else False
    
    if txt_format:
        no_header = embedding_dict.get("no_header")
        word_vectors = KeyedVectors.load_word2vec_format(
            f'{embedding_name}.txt',
            no_header=no_header,
            unicode_errors='replace',
        )
    else:
        word_vectors = KeyedVectors.load(
            f'{embedding_name}.kv',
            mmap='r',
        )
    
    if "200-original" in variants:
        with open(f'embeddings/{embedding_name}-200-original.txt', "w") as file:
            for i in range(len(word_vectors.index_to_key)):
                token = str(word_vectors.index_to_key[i])
                if token.startswith('dbr:'):
                    token = token.replace('dbr:', 'http://dbpedia.org/resource/')
                if token in all_entities:
                    vector_string = ' '.join([str(x) for x in word_vectors.vectors[i].tolist()])
                    file.write(f'{token} {vector_string} \n')
                    
        print(f'Written {embedding_name}-200-original.txt in embeddings folder')
    
    if "200-avgbin" in variants:
        avg_embeddings = np.mean(word_vectors.vectors, axis=0)
        bin_model_vectors = np.greater_equal(word_vectors.vectors, avg_embeddings)
        
        with open(f'embeddings/{embedding_name}-avgbin.txt', "w") as file:
            for i in range(len(word_vectors.index_to_key)):
                token = str(word_vectors.index_to_key[i])
                if token.startswith('dbr:'):
                    token = token.replace('dbr:', 'http://dbpedia.org/resource/')
                if token in all_entities:
                    bin_vector_string = ' '.join([str(x) for x in (bin_model_vectors[i]*1).tolist()])
                    file.write(f'{token} {bin_vector_string} \n')
        
        del avg_embeddings
        del bin_model_vectors
        print(f'Written {embedding_name}-200-avgbin.txt in embeddings folder')
    
    autoencoded_variants = [variant for variant in variants if "autoencoded" in variant]
    
    if len(autoencoded_variants)>0:
        word_vectors.save_word2vec_format(
            f"{embedding_name}-to-binarize.txt", 
            write_header=True,
        )
        print(f'{embedding_name}-to-binarize.txt is ready to be binarized to VEC files')
    
    del word_vectors
    
    if remove_original_file:
        if txt_format:
            !rm {embedding_name}.txt
        else:
            !rm {embedding_name}.kv
            !rm {embedding_name}.kv.vectors.npy
    
    for autoencoded_variant in autoencoded_variants:
        n_bits = int(autoencoded_variant.split("-")[0])
        !./binarize -input {embedding_name}-to-binarize.txt -output {embedding_name}-{n_bits}-full.vec -n-bits {n_bits}
        !mv {embedding_name}-{n_bits}-full.vec /embeddings
        print(f"Binary file is in embeddings/{embedding_name}-{n_bits}-full.vec")
    
    if remove_original_file and len(autoencoded_variants)>0:
        !rm {embedding_name}-to-binarize.txt
    

In [5]:
def int_to_bitlist(int_str):
    bitlist = [bit for bit in bin(int(int_str))[2:]]
    missing_zeros = 64 - len(bitlist)
    return ['0'] * missing_zeros + bitlist

def get_full_binary_vector(line):
    bin_vector = []
    for int_str in line.split()[1:]:
        bin_vector.extend(int_to_bitlist(int_str))
    return line.split()[0] + ' ' + ' '.join(bin_vector)

In [6]:
def convert_and_filter_vec_to_txt(embedding_name, n_bits_list=[128, 256, 512], remove_vec_file=True, full=True):
    full_str = "-full" if full else ""
    for n_bits in n_bits_list:
        print(f'Filtering entities in embeddings/{embedding_name}-{n_bits}{full_str}.vec...')
        with open(f"embeddings/{embedding_name}-{n_bits}{full_str}.vec", "r", encoding="utf-8") as file:
            lines = file.readlines()

        with open(f"embeddings/{embedding_name}-{n_bits}-autoencoded.txt", "w") as file:
            rows_count = 0
            for line in lines[1:]:
                if line.split()[0] in all_entities:
                    full_binary_vector = get_full_binary_vector(line)
                    file.write(f'{full_binary_vector} \n')
                    rows_count += 1

        print(f'Filtered {rows_count} entities')
        print(f'Successfully written file {embedding_name}-{n_bits}-autoencoded.txt in embeddings folder')

        with open(f"embeddings/{embedding_name}-{n_bits}-autoencoded.vec", "w") as file:
            file.write(f"{rows_count} {n_bits} \n")
            for line in lines[1:]:
                if line.split()[0] in all_entities:
                    file.write(line)

        print(f'Successfully written file {embedding_name}-{n_bits}-autoencoded.vec in embeddings folder')
        
        if remove_vec_file:
            print(f'Deleting {embedding_name}-{n_bits}{full_str}.vec')
            !rm {embedding_name}-{n_bits}-full.vec

In [7]:
def run_pipeline(pipeline_config):
    download_embeddings = pipeline_config.get("download_embeddings")
    produce_binary_embeddings = pipeline_config.get("produce_binary_embeddings")
    produce_variants = pipeline_config.get("produce_variants")
    convert_filter_vec_embeddings = pipeline_config.get("convert_filter_vec_embeddings")
    convert_filter_n_bits = pipeline_config.get("convert_filter_n_bits")
    
    for embedding_name in download_embeddings:
        download_vectors(embedding_name)
        
        if embedding_name in produce_binary_embeddings and len(produce_variants)>0:
            produce_binary_variants(embedding_name, variants=produce_variants)
            produce_binary_embeddings.remove(embedding_name)
            
            if embedding_name in convert_filter_vec_embeddings and len(convert_filter_n_bits)>0:
                convert_and_filter_vec_to_txt(embedding_name, n_bits_list=convert_filter_n_bits)
                convert_filter_vec_embeddings.remove(embedding_name)
    
    for embedding_name in produce_binary_embeddings:
        produce_binary_variants(embedding_name, variants=produce_variants)

        if embedding_name in convert_filter_vec_embeddings and len(convert_filter_n_bits)>0:
            convert_and_filter_vec_to_txt(embedding_name, n_bits_list=convert_filter_n_bits)
            convert_filter_vec_embeddings.remove(embedding_name)
    
    for embedding_name in convert_filter_vec_embeddings:
        convert_and_filter_vec_to_txt(embedding_name, n_bits_list=convert_filter_n_bits)
                

**Warning:** most steps of this pipeline require high RAM and storage available, and may take hours to complete. The complete pipeline downloads more than 200 GB from the web and may take several days.

In [8]:
pipeline_config = {
    "download_embeddings": [
#         'rdf2vec-cbow-200', 
#         'rdf2vec-cbow-oa-200', 
#         'rdf2vec-sg-200', 
#         'rdf2vec-sg-oa-200', 
#         'non-rdf2vec-ComplEx', 
#         'non-rdf2vec-DistMult', 
#         'non-rdf2vec-RESCAL', 
#         'non-rdf2vec-RotatE', 
#         'non-rdf2vec-TransE-L1', 
#         'non-rdf2vec-TransE-L2', 
#         'non-rdf2vec-TransR',
    ],
    "produce_binary_embeddings": [
#         'rdf2vec-cbow-200', 
#         'rdf2vec-cbow-oa-200', 
#         'rdf2vec-sg-200', 
#         'rdf2vec-sg-oa-200', 
#         'non-rdf2vec-ComplEx', 
#         'non-rdf2vec-DistMult', 
#         'non-rdf2vec-RESCAL', 
#         'non-rdf2vec-RotatE', 
#         'non-rdf2vec-TransE-L1', 
#         'non-rdf2vec-TransE-L2', 
#         'non-rdf2vec-TransR',
    ],
    "produce_variants": [
#         "200-original",
#         "200-avgbin",
#         "128-autoencoded",
#         "256-autoencoded",
#         "512-autoencoded",
    ],
    "convert_filter_vec_embeddings": [
#         'rdf2vec-cbow-200', 
#         'rdf2vec-cbow-oa-200', 
#         'rdf2vec-sg-200', 
#         'rdf2vec-sg-oa-200', 
#         'non-rdf2vec-ComplEx', 
#         'non-rdf2vec-DistMult', 
#         'non-rdf2vec-RESCAL', 
#         'non-rdf2vec-RotatE', 
#         'non-rdf2vec-TransE-L1', 
#         'non-rdf2vec-TransE-L2', 
#         'non-rdf2vec-TransR',
    ],
    "convert_filter_n_bits": [
#         128,
#         256,
#         512,
    ],
}

In [9]:
run_pipeline(pipeline_config)

In case you want to obtain only the autoencoded binary variants from the VEC files stored in the `resources` folder, move them to the `embeddings` folder and use the `convert_and_filter_vec_to_txt()` function with `full` set to `False`.

In [10]:
convert_filter_vec_embeddings = [
#     'rdf2vec-cbow-200', 
#     'rdf2vec-cbow-oa-200', 
#     'rdf2vec-sg-200', 
#     'rdf2vec-sg-oa-200', 
#     'non-rdf2vec-ComplEx', 
#     'non-rdf2vec-DistMult', 
#     'non-rdf2vec-RESCAL', 
#     'non-rdf2vec-RotatE', 
#     'non-rdf2vec-TransE-L1', 
#     'non-rdf2vec-TransE-L2', 
#     'non-rdf2vec-TransR',
]
    
n_bits_list = [
#     128, 
#     256, 
#     512,
]
    
for embedding_name in convert_filter_vec_embeddings:
    convert_and_filter_vec_to_txt(embedding_name, n_bits_list=n_bits_list, remove_vec_file=True, full=False)