The goal of this file is to create an embedding for each relation.

Steps to achieve this:

1.   Download the entity mapping.
2.   Download the relation mapping.
3.   Convert each relation (including inverse relations) to text.
4.   Remove duplicate relations.
5.   Extract embeddings for each relation.



In [1]:
import pandas as pd
import os
from tqdm import tqdm
import fasttext
import numpy as np
import json
import logging
import wget
from sh import gunzip

In [2]:
data_dir = 'data'
os.makedirs(data_dir)
mappings_path = os.path.join(data_dir, 'mappings')
embeddings_path = os.path.join(data_dir, 'embeddings')
os.makedirs(mappings_path)
os.makedirs(embeddings_path)

Wikidata5m follows the identifier system used in Wikidata. Each entity and relation is identified by a unique ID. Entities are prefixed by Q, while relations are prefixed by P.
An entity or relationship has its own alias.
So we can download them in the following codes

# ***Download entity mapping***

In [3]:
!gdown 'https://drive.google.com/uc?id=1x9PS3wac1Yq9sDw2zfl8Jso1NSJ05LkU' -O "{os.path.join(mappings_path, 'wd5m_aliases_entities_v3.txt')}"

Downloading...
From (original): https://drive.google.com/uc?id=1x9PS3wac1Yq9sDw2zfl8Jso1NSJ05LkU
From (redirected): https://drive.google.com/uc?id=1x9PS3wac1Yq9sDw2zfl8Jso1NSJ05LkU&confirm=t&uuid=0f4b5df6-9112-49c3-aa76-2f2a20dda734
To: /vol/hd1/users/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/data/mappings/wd5m_aliases_entities_v3.txt
100%|█████████████████████████████████████████| 147M/147M [03:12<00:00, 767kB/s]


# ***Download relation mapping***

In [4]:
!gdown 'https://drive.google.com/uc?id=1tI924TQ1eW4EmIKEpOsHI3TH9ZCdJLDL' -O "{os.path.join(mappings_path, 'wd5m_aliases_relations_v3.txt')}"

Downloading...
From: https://drive.google.com/uc?id=1tI924TQ1eW4EmIKEpOsHI3TH9ZCdJLDL
To: /vol/hd1/users/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/data/mappings/wd5m_aliases_relations_v3.txt
100%|███████████████████████████████████████| 18.5k/18.5k [00:00<00:00, 124kB/s]


# ***Download a pre-trained word embedding model***

In [None]:
wget.download("https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz", out=embeddings_path)

In [2]:
import requests
from tqdm import tqdm

url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz"
output_path = "cc.en.300.bin.gz"

# ارسال درخواست HTTP
response = requests.get(url, stream=True)
response.raise_for_status()

# اندازه کل فایل از هدر Content-Length
total_size = int(response.headers.get('Content-Length', 0))

# نمایش نوار پیشرفت با tqdm
with open(output_path, "wb") as f, tqdm(
    desc="Downloading",
    total=total_size,
    unit="B",
    unit_scale=True,
    unit_divisor=1024,
) as progress_bar:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)
        progress_bar.update(len(chunk))

print("Download complete!")


Downloading: 100%|██████████| 4.19G/4.19G [1:44:17<00:00, 720kB/s]  

Download complete!





Unzip file

In [6]:
# embeddings_path = os.path.join(embeddings_path, 'cc.en.300.bin.gz')
gunzip("/home/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/cc.en.300.bin.gz")

''

In [5]:
relation_mapping_path = "/home/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/data/mappings/wd5m_aliases_relations_v3.txt"
relation_mapping = {}
with open(relation_mapping_path, 'r') as f:
    for line in tqdm(f):
        line = line.strip().split('\t')
        id_, name = line[0], line[1]
        relation_mapping[id_] = name
        relation_mapping["inverse of " + id_] = name


for rel in relation_mapping:
    if "inverse of " in rel:
        relation_mapping[rel] = "inverse of " + relation_mapping[rel]
relation_mapping

828it [00:00, 329777.20it/s]


{'P1389': 'product certification',
 'inverse of P1389': 'inverse of product certification',
 'P885': 'origin of the watercourse',
 'inverse of P885': 'inverse of origin of the watercourse',
 'P2563': 'superhuman feature or ability',
 'inverse of P2563': 'inverse of superhuman feature or ability',
 'P658': 'tracklist',
 'inverse of P658': 'inverse of tracklist',
 'P2098': 'substitute/deputy/replacement of office/officeholder',
 'inverse of P2098': 'inverse of substitute/deputy/replacement of office/officeholder',
 'P2378': 'issued by',
 'inverse of P2378': 'inverse of issued by',
 'P924': 'possible treatment',
 'inverse of P924': 'inverse of possible treatment',
 'P447': 'review score by',
 'inverse of P447': 'inverse of review score by',
 'P3093': 'recovered by',
 'inverse of P3093': 'inverse of recovered by',
 'P624': 'guidance system',
 'inverse of P624': 'inverse of guidance system',
 'P2974': 'habitat',
 'inverse of P2974': 'inverse of habitat',
 'P4387': 'update method',
 'inverse

In [8]:
#Convert dic to Json
with open('/home/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/data/relation2text-wikidata5m.json', 'w') as f:
    json.dump(relation_mapping, f)

In [6]:
print("The number of all the relations",len(list(relation_mapping.values())))

The number of all the relations 1656


In [7]:
relations = list(set(relation_mapping.values())) #remove duplicate relations
print("The number of unique realtions",len(relations))
print("/n")
print("---------------------")
relations

The number of unique realtions 1656
/n
---------------------


['inverse of indigenous to',
 'inverse of intangible cultural heritage status',
 'inverse of mushroom ecological type',
 'natural product of taxon',
 'mandates',
 'inverse of currency',
 'Vox-ATypI classification',
 'compressor type',
 'inverse of notable work',
 'inverse of color',
 'mountain range',
 'inverse of place served by transport hub',
 'inverse of category combines topics',
 'endemic to',
 'inverse of has grammatical mood',
 'inverse of office held by head of state',
 'edibility',
 'inverse of location of discovery',
 'inverse of fossil found in this unit',
 'inverse of activating neurotransmitter',
 'inverse of GUI toolkit or framework',
 'flag bearer',
 'inverse of followed by',
 'inverse of highway system',
 'inverse of file format',
 'inverse of colonel-in-chief',
 'inverse of armament',
 'located on terrain feature',
 'inverse of author of foreword',
 'has active ingredient',
 'inverse of donated by',
 'inverse of crosses',
 'discoverer or inventor',
 'house publication

In [11]:
len(relations)

1656

In [12]:
relation2index = {}
for i, rel in enumerate(relations):
    relation2index[rel] = i
    relation2index["inverse of " + rel] = i

In [13]:
relation2index

{'inverse of sourcing circumstances': 766,
 'inverse of inverse of sourcing circumstances': 0,
 'scheduled service destination': 1,
 'inverse of scheduled service destination': 133,
 'instrumentation': 2,
 'inverse of instrumentation': 1272,
 'original language of film or TV show': 3,
 'inverse of original language of film or TV show': 1555,
 'inverse of bowling style': 378,
 'inverse of inverse of bowling style': 4,
 'elected in': 5,
 'inverse of elected in': 1345,
 'inverse of compressor type': 855,
 'inverse of inverse of compressor type': 6,
 'inverse of exclave of': 153,
 'inverse of inverse of exclave of': 7,
 'inverse of transmitted signal type': 1167,
 'inverse of inverse of transmitted signal type': 8,
 'inverse of military casualty classification': 154,
 'inverse of inverse of military casualty classification': 9,
 'inverse of muscle insertion': 1653,
 'inverse of inverse of muscle insertion': 10,
 'inverse of biological process': 1534,
 'inverse of inverse of biological proc

In [14]:
with open('/home/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/data/relation2ind-wikidata5m.json', 'w') as f:
    json.dump(relation2index, f)

In [2]:
model_en = fasttext.load_model('/home/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/cc.en.300.bin')



In [3]:
model_en

<fasttext.FastText._FastText at 0x7f0b6ac3dd80>

This method generates an embedding for the entire sentence (or phrase) by aggregating the word embeddings (e.g., by averaging or another mechanism).

In [8]:
fasttext_emb = list(map(lambda x: model_en.get_sentence_vector(x), relations))
fasttext_emb

[array([-1.05102062e-02, -2.70554740e-02,  1.70062925e-03,  6.13854267e-03,
         1.06705129e-02, -2.19286140e-02,  8.71353783e-03,  1.09264757e-02,
        -7.21521070e-03,  1.15398085e-02, -1.51267601e-02,  1.83980782e-02,
         3.12181916e-02, -9.59837157e-03, -3.59944031e-02, -4.13522087e-02,
        -1.31014856e-02, -5.15746512e-03,  9.17572901e-03,  5.55152260e-02,
         7.20560132e-03,  2.69850437e-03, -1.97294378e-03, -7.26816244e-04,
        -2.10663639e-02, -3.21214162e-02,  4.79836308e-06,  3.94241251e-02,
        -3.56484279e-02,  8.74919146e-02,  1.08097550e-02, -5.02974726e-03,
         4.24352586e-02,  1.57935880e-02,  5.24038859e-02, -1.33527433e-02,
         2.92124972e-03,  1.75013915e-02, -1.48594258e-02,  9.57308710e-03,
         2.49637607e-02, -1.67066343e-02,  1.59187540e-02, -3.44990455e-02,
        -1.97994690e-02, -4.99870591e-02,  2.72792391e-02,  1.19005954e-02,
        -1.62985474e-02,  3.40876281e-02,  3.14947218e-04, -4.11576554e-02,
        -1.3

In [9]:
len(fasttext_emb)

1656

In [10]:
len(fasttext_emb[0]) #each relation has an embedding vector of dimension 300

300

In [12]:
np.save("/home/ahmadi/sadaf/GraphNeighborLM/Better-together/data-preparation/data/embeddings/fasttext_vecs-wikidata5m.npy", fasttext_emb)