## TransE模型实现

实现TransE.py中根据已有代码补充了模型的_calc函数和loss函数，实现如下：

```python
   def _calc(self, h, t, r):
        if self.norm_flag: # normalize embeddings with l2 norm
            h = F.normalize(h,p=2,dim = -1)
            t = F.normalize(t,p=2,dim = -1)
            r = F.normalize(r,p=2,dim = -1)
        score = h+r-t # score公式
        score = torch.norm(score,self.p_norm,dim = -1) #计算distance
        return score

```


```python

    def loss(self, pos_score, neg_score):
        #调用pytorch.MarginRankingLoss
        #由于MarginRankingLoss中 loss(x1,x2,y)=max(0,−y∗(x1−x2)+margin)
        #将y设为1，x1x2分别为neg和pos的score，实际为 loss = max(0,pos_score - neg_score + margin)
        loss_func = nn.MarginRankingLoss(self.margin.item(),size_average = False)
        y = torch.Tensor([1])
        Loss = loss_func(neg_score,pos_score,y)
        return Loss
        
```


## TransE应用

### 通过给定三元组中两个元素对预测另一个

首先训练模型，超参数都为默认值实现在TransE.py 中的Config()类，具体为

```python
class Config(object):    
    def __init__(self):
        self.p_norm = 1
        self.hidden_size = 50
        self.nbatches = 100
        self.entity = 0
        self.relation = 0
        self.trainTimes = 100
        self.margin = 1.0
        self.learningRate = 0.01
        self.use_gpu = False
```
训练100个epoch，其中一些epoch的loss如下所示：

Epoch 1 | loss: 225758.579712 

Epoch 20 | loss: 15363.347290 

Epoch 30 | loss: 9086.585068 

Epoch 50 | loss: 4878.229809  

Epoch 70 | loss: 3384.202034 

Epoch 100 | loss: 2373.714460


可以发现loss下降速度比较均匀，从第90个epoch开始逐渐开始波动，趋于稳定，且最终稳定于2400左右。

In [204]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import os
import time
import numpy as np
from TransE import TransE
from TransE import Config
from load_data import PyTorchTrainDataLoader

In [205]:
#初始化模型
config = Config()
train_dataloader = PyTorchTrainDataLoader(
                            in_path = "./data/", 
                            nbatches = config.nbatches,
                            threads = 8)
    
transe = TransE(
            ent_tot = train_dataloader.get_ent_tot(),
            rel_tot = train_dataloader.get_rel_tot(),
            dim = config.hidden_size, #50
            p_norm = config.p_norm, 
            norm_flag = True,
            margin=config.margin)

In [206]:
#初始化wikidata词条与embedding序号之间的字典
ent_dic = {}
rel_dic = {}
f = open('./data/entity2id.txt','r')
next(f)
for index in range(train_dataloader.get_ent_tot()):
    value,key = f.readline().strip().split()
    ent_dic[int(key)] = value
f = open('./data/relation2id.txt','r')
next(f)
for index in range(train_dataloader.get_rel_tot()):
    value,key = f.readline().strip().split()
    rel_dic[int(key)] = value

In [207]:
#载入预训练的embedding参数
ent_data = np.loadtxt('entity2vec.txt')
rel_data = np.loadtxt('relation2vec.txt')
ent_data = torch.Tensor(ent_data)
rel_data = torch.Tensor(rel_data)
transe.ent_embeddings = transe.ent_embeddings.from_pretrained(ent_data)
transe.rel_embeddings = transe.rel_embeddings.from_pretrained(rel_data)

In [208]:
transe.ent_embeddings.weight

Parameter containing:
tensor([[-0.2084, -0.5272, -0.0708,  ..., -0.2376,  0.5788,  0.0743],
        [ 0.0119, -0.1622, -0.0521,  ..., -0.1707, -0.3355, -0.5015],
        [-1.6516,  2.0057,  2.1716,  ..., -8.9639,  1.5062, -2.2987],
        ...,
        [ 0.2418, -0.2356,  0.1741,  ...,  0.0579, -0.1919,  0.0439],
        [ 0.0388,  0.1526,  0.1149,  ...,  0.2277,  0.2655, -0.0768],
        [-0.0456,  0.0462,  0.3552,  ...,  0.2218,  0.4954,  0.1430]])

In [209]:
transe.rel_embeddings.weight

Parameter containing:
tensor([[-0.1695,  0.4049,  0.7886,  ...,  0.3512,  0.5796,  0.1826],
        [ 0.4283, -0.0791,  0.5712,  ...,  1.0693, -0.6851, -0.6008],
        [ 0.9707, -0.3461,  0.8030,  ...,  0.7767, -0.7109,  0.9902],
        ...,
        [ 0.1265, -0.0305,  0.1099,  ...,  0.0170, -0.1378, -0.0112],
        [ 0.0812,  0.0130, -0.0776,  ...,  0.0495,  0.0462,  0.0362],
        [ 0.0162,  0.0444, -0.1756,  ...,  0.0398,  0.1665, -0.1241]])

In [210]:
#预测Q30+P36最接近的尾实体
data = {'batch_h':torch.LongTensor([list(ent_dic.keys())[list(ent_dic.values()).index('Q30')]]),
        'batch_r':torch.LongTensor([list(rel_dic.keys())[list(rel_dic.values()).index('P36')]]),
        'batch_t':torch.LongTensor([i for i in range(train_dataloader.get_ent_tot())])}

In [221]:
score = transe.predict(data)
for index in score.argsort()[0:10]:
    ent_dic[index]

编号   |           名称        | 含义
------| ------------------- | -------------
Q30 | United States of America | sovereign state in North America
Q1204  |  <font color='red'> Illinois </font>  | state of the United States of America
Q1221 | <font color='red'> Idaho </font> | state of the United States of America
Q23556 | Atlanta | city in DeKalb and Fulton counties in Georgia, United States
Q61  | <font color='red'> Washington, D.C. </font> | capital city of the United States
Q1345 | Philadelphia | largest city in Pennsylvania, United States
Q5925 | Orange County | county in California, United States
Q60 | New York City | largest city in the United States
Q1010232 | Doylestown | borough and county seat of Bucks County, Pennsylvania in the United States
Q494192 | Bucks County | county in Pennsylvania, United States

上表为当头实体为Q30、关系为P36时由模型进行预测的可能性最高的十个尾实体。其中理想答案Washington,D.C.排在第五位，观察其余各项可以发现均为美国地名，说明模型的预测基本比较有效；但与期望的state级别不同，其余项中仍出现了许多city级别的地名，推测进一步增加语料会取得更好的效果。

In [224]:
score_WashingtonDC = score[list(ent_dic.keys())[list(ent_dic.values()).index('Q61')]]
score_WashingtonDC

5.01134

In [220]:
per = (score_WashingtonDC - score.min())/(score.max()-score.min())
per

0.061476763

如上，Q61的意思为Washington D.C,是期望的Q30与P36的尾实体，可以看到其预测score为5.01，与score最低值Q30的4.61（最大可能值，由于score的运算方法是distance(h+r-t),distance越小即可能性越高）。对理想结果Q61做一个简单的估计发现，若数据在最小值到最大值之间分布较均匀，其应处于6.14%的位置。

In [225]:
data_1 = {'batch_h':torch.LongTensor([list(ent_dic.keys())[list(ent_dic.values()).index('Q30')]]),
        'batch_t':torch.LongTensor([list(ent_dic.keys())[list(ent_dic.values()).index('Q49')]]),
        'batch_r':torch.LongTensor([i for i in range(train_dataloader.get_rel_tot())])}
score_1 = transe.predict(data_1)

In [230]:
for index in score_1.argsort()[0:10]:
    print(rel_dic[index])

P30
P175
P2853
P35
P1962
P40
P138
P150
P1066
P461


编号   |           名称        | 含义
------| ------------------- | -------------
P30 |  <font color='red'> continent </font>  | continent of which the subject is a part
P175  | performer  | actor, musician, band or other performer associated with this role or musical work
P2853 | electrical plug type | standard plug type for mains electricity in a country
P35 | head of state | official with the highest formal authority in a country/state
P1962  | --- | --
P40 | child | subject has object as child. Do not use for stepchildren
P138 | named after | entity or event that inspired the subject's name, or namesake (in at least one language)
P150 | contains administrative territorial entity | (list of) direct subdivisions of an administrative territorial entity
P1066 | student of | person who has taught this person
P461 | opposite of| item that is the opposite of this item

由上表可得当头实体为Q30(America)、尾实体为Q49(North America)时由模型进行预测的可能性最高的十个关系。  
这次的理想结果*'continent'*即为可能性最高的结果。可能是由于关系间的相关性更不明显，有些可能性较高的关系如*'performer'*、*'electrical plug type'*等之间难以看出有合理的关系，不过如*'student of'*、*'child'*等可以看出头实体与尾实体之间的从属关系，确有一定的相似性（与理想结果）。至于*'contains administrative territorial entity'*、*'head of state'*这类关系则可以看出明显的相似性。

## 调整模型参数

调整p_norm和margin来观察模型变化,调整TransE.py中Config类中的p_norm和margin，并将模型训练的结果分别存入‘entity2vec_pnorm’、‘relation2vec_pnorm’和'entity2vec_margin'、‘relation2vec_margin’

#### 调整p_norm


将模型超参数p_norm调整为2之后进行100个epoch的训练，其中部分epoch训练后的loss如下：


Epoch 1 | loss: 201758.006714  

Epoch 20 | loss: 37889.272583  

Epoch 30 | loss: 36315.447113 

Epoch 50 | loss: 34995.471100  

Epoch 70 | loss: 34456.129486 

Epoch 100 | loss: 33535.526093


可以发现，与p_norm为1时相比，Epoch100是的loss较大，且loss下降较慢。

下面对其预测效果进行检测

In [231]:
transe_p = TransE(
            ent_tot = train_dataloader.get_ent_tot(),
            rel_tot = train_dataloader.get_rel_tot(),
            dim = config.hidden_size, #50
            p_norm = 2, 
            norm_flag = True,
            margin=config.margin)
ent_data_p = np.loadtxt('entity2vec_pnorm.txt')
rel_data_p = np.loadtxt('relation2vec_pnorm.txt')
ent_data_p = torch.Tensor(ent_data_p)
rel_data_p = torch.Tensor(rel_data_p)
transe_p.ent_embeddings = transe.ent_embeddings.from_pretrained(ent_data_p)
transe_p.rel_embeddings = transe.rel_embeddings.from_pretrained(rel_data_p)

In [232]:
transe_p.ent_embeddings.weight

Parameter containing:
tensor([[-1.2788e-01,  9.1089e-02, -1.7634e-02,  ...,  3.7515e-01,
         -1.6136e-01,  2.5071e-01],
        [ 3.5123e-02, -2.3281e-02, -1.4830e-01,  ..., -1.1870e-01,
          3.4253e-02, -3.1207e-02],
        [-1.3124e-01, -4.5984e-01,  5.6523e-01,  ...,  7.9198e-02,
         -6.4722e-01,  3.1638e-01],
        ...,
        [ 3.8566e-02, -5.0540e-03,  1.2878e-02,  ...,  1.9400e-04,
         -7.9931e-02, -1.4378e-01],
        [-4.6861e-02,  4.5408e-02, -4.5080e-02,  ..., -9.5583e-02,
         -5.0964e-02, -6.4660e-03],
        [-7.1060e-03,  4.9039e-02,  3.9668e-02,  ...,  3.5273e-02,
         -6.9708e-02, -1.0988e-01]])

In [233]:
transe_p.rel_embeddings.weight

Parameter containing:
tensor([[-0.1374,  0.1423, -0.1763,  ..., -0.0977,  0.2418,  0.1493],
        [ 0.2064, -0.0599, -0.2287,  ...,  0.1218,  0.2965, -0.3699],
        [-0.6295, -0.2186, -0.1002,  ..., -0.2876,  0.4727, -0.0400],
        ...,
        [ 0.0519, -0.1104, -0.0037,  ...,  0.0595, -0.0837,  0.0802],
        [ 0.0144, -0.0369, -0.0788,  ...,  0.0605,  0.0639,  0.0936],
        [-0.0100,  0.0288,  0.0759,  ...,  0.0362,  0.0991,  0.0254]])

In [234]:
score_p = transe_p.predict(data)

In [235]:
for index in score_p.argsort()[0:10]:
    print(ent_dic[index])

Q49201
Q1190590
Q28198
Q462799
Q159288
Q184116
Q840668
Q28260
Q49174
Q462177


编号   |           名称        | 含义
------| ------------------- | -------------
Q49201 | Portland | county seat of Cumberland County, Maine, United States
Q1190590  | Encino | neighborhood in the San Fernando Valley region of Los Angeles, California
Q28198 | Jackson  | capital and largest city of Mississippi
Q462799 | Evanston | suburban city in Cook County, Illinois, United States
Q159288  | Santa Barbara | city in California, United States
Q184116 | Gary | city in Lake County, Indiana, United States
Q840668 | La Jolla  | neighborhood in San Diego, California, United States
Q28260 | Lincoln | city in and state capital of Nebraska, United States
Q49174 | Bridgeport | county seat city of Fairfield County, Connecticut, United States
Q462177 | White Plains | county seat of Westchester County, New York, United States

调整p_norm之后，考察最相似的10个尾实体，结果中许多为地名或与美国相关的结果。问题与之前相同，city级别的实体占了更大比重，且最优结果未出现在相似度最高的10个实体内

In [236]:
score_WashingtonDC_p = score_p[list(ent_dic.keys())[list(ent_dic.values()).index('Q61')]]
score_WashingtonDC_p

0.6319797

In [237]:
per_p = (score_WashingtonDC_p-score_p.min())/(score_p.max()-score_p.min())
per_p

0.21097884

但对于理想结果，即WashingtonDC（Q61）的结果预测显著差于p_norm为1时的预测，若数据在最大最小值分布均匀时，其在整体数据的约21.1%的位置。

In [238]:
#预测调整后的Q30+Q49时的最高可能关系
score_p_1 = transe_p.predict(data_1)
for index in score_p_1.argsort()[0:10]:
    print(rel_dic[index])

P30
P1884
P1399
P1589
P206
P1056
P452
P101
P512
P172


编号   |           名称        | 含义
------| ------------------- | -------------
P30 | <font color='red'> continent </font> | continent of which the subject is a part
P1884  | hair color | person's hair color
P1399 | convicted of | crime a person was convicted of
P1589 | lowest point | point with lowest elevation in the country, region, city or area
P206  | located in or next to body of water | sea, lake or river
P1056 | product or material produced | material or product produced by a government agency, business, industry, facility, or process
P452 | industry | specific industry of company or organization
P101 | field of work | specialization of a person or organization
P512 | academic degree | academic degree that the person holds
P172 | ethnic group | subject's ethnicity 

观察调整p_norm的结果，从可能性最高的是个关系来看，模型对期望结果的预测仍然很好都为*'continent'*，但除此项之外的其余项的预测结果从语义相关性上看要稍差于原模型，大多难以看到合理性

#### 调整margin


原模型margin设为默认值1.0，在此处分别将margin设为0.5与2.0观察和分析模型的变化，其结果分别存入'entity2vec_margin1'、‘relation2vec_margin1’和'entity2vec_margin2'、‘relation2vec_margin2’

以下为将margin设为0.5时训练100个epoch中一些epoch的loss:

Epoch 1 | loss: 144220.670654

Epoch 20 | loss: 9628.716835
 
Epoch 30 | loss: 5772.977055

Epoch 50 | loss: 3057.383905
 
Epoch 70 | loss: 2090.676629

Epoch 100 | loss: 1368.878580

与原模型相比，最终loss降低了，而且尚未出现震荡稳定于某个值的趋势，即随着epoch的增加loss应该还能下降。

In [239]:
transe_m1 = TransE(
            ent_tot = train_dataloader.get_ent_tot(),
            rel_tot = train_dataloader.get_rel_tot(),
            dim = config.hidden_size, #50
            p_norm = 1, 
            norm_flag = True,
            margin= 0.5)
ent_data_m1 = np.loadtxt('entity2vec_margin1.txt')
rel_data_m1 = np.loadtxt('relation2vec_margin1.txt')
ent_data_m1 = torch.Tensor(ent_data_m1)
rel_data_m1 = torch.Tensor(rel_data_m1)
transe_m1.ent_embeddings = transe.ent_embeddings.from_pretrained(ent_data_m1)
transe_m1.rel_embeddings = transe.rel_embeddings.from_pretrained(rel_data_m1)

In [240]:
score_m1 = transe_m1.predict(data)
for index in score_m1.argsort()[0:10]:
    print(ent_dic[index])

Q30
Q61
Q15142
Q49201
Q333886
Q724
Q769446
Q60
Q65
Q47164


编号   |           名称        | 含义
------| ------------------- | -------------
Q30 | United States of America | sovereign state in North America
Q61  |  <font color='red'> Washington, D.C. </font> | capital city of the United States
Q15142 | University of Massachusetts Amherst | public university in Massachusetts, USA
Q49201 | Portland | county seat of Cumberland County, Maine, United States
Q333886  | Georgetown University | private university in Washington, D.C., United States
Q724 |   <font color='red'> Maine </font> | state of the United States of America
Q769446 | Van Nuys   | district in Los Angeles, California
Q60 | New York City | largest city in the United States
Q65 | Los Angeles | county seat of Los Angeles County, California; second largest city in the United States by population
Q47164 | Santa Monica | beachfront city in Los Angeles County, California, United States

In [244]:
score_WashingtonDC_m1 = score_m1[list(ent_dic.keys())[list(ent_dic.values()).index('Q61')]]
score_WashingtonDC_m1

5.0541167

In [245]:
per_m1 = (score_WashingtonDC_m1-score_m1.min())/(score_m1.max()-score_m1.min())
per_m1

0.12651896

去除本就是头实体的Q30，模型对期望结果预测表现很好，不过与前面模型结果略有不同的是，最可能项中入选了两个university这个合理性稍差的实体

In [246]:
score_m1_1 = transe_m1.predict(data_1)
for index in score_m1_1.argsort()[0:10]:
    print(rel_dic[index])

P30
P37
P1387
P1589
P38
P186
P552
P6
P1891
P1313


编号   |           名称        | 含义
------| ------------------- | -------------
P30 | <font color='red'> continent </font> | continent of which the subject is a part
P37  | official language  | language designated as official by this item
P1387 | political alignment | political position within the political spectrum
P1589 | lowest point | point with lowest elevation in the country, region, city or area
P38  | currency | currency used by item
P186 | material used | material the subject is made of or derived from
P552 | handedness | handedness of the person
P6 | head of government | head of the executive power of this town, city, municipality, state, country, or other governmental body
P1891 | signatory | person, country, or organization that has signed an official document
P1313 | office held by head of government| political office that is fulfilled by the head of the government of this item

除了最优的结果P30正确预测以外，模型对其他可能的关系的预测似乎比原模型差一些

以下为将margin设为2.0时训练100个epoch中一些epoch的loss:

Epoch 1 | loss: 440228.158203

Epoch 20 | loss: 35354.082520

Epoch 30 | loss: 20993.740387

Epoch 50 | loss: 11657.115448

Epoch 70 | loss: 8485.964310

Epoch 100 | loss: 6375.767132


In [248]:
transe_m2 = TransE(
            ent_tot = train_dataloader.get_ent_tot(),
            rel_tot = train_dataloader.get_rel_tot(),
            dim = config.hidden_size, #50
            p_norm = 1, 
            norm_flag = True,
            margin= 2)
ent_data_m2 = np.loadtxt('entity2vec_margin2.txt')
rel_data_m2 = np.loadtxt('relation2vec_margin2.txt')
ent_data_m2 = torch.Tensor(ent_data_m2)
rel_data_m2 = torch.Tensor(rel_data_m2)
transe_m2.ent_embeddings = transe.ent_embeddings.from_pretrained(ent_data_m2)
transe_m2.rel_embeddings = transe.rel_embeddings.from_pretrained(rel_data_m2)

In [249]:
score_m2 = transe_m2.predict(data)
for index in score_m2.argsort()[0:10]:
    print(ent_dic[index])

Q61
Q60
Q34006
Q30
Q65
Q1588
Q1408
Q1649
Q1439
Q1211


编号   |           名称        | 含义
------| ------------------- | -------------
Q61  |  <font color='red'> Washington, D.C. </font>| capital city of the United States
Q60 | New York City | largest city in the United States
Q34006  | Hollywood | district in Los Angeles, California, United States
Q30 | United States of America | sovereign state in North America
Q65 | Los Angeles | county seat of Los Angeles County, California; second largest city in the United States by population
Q1588 |  <font color='red'> Louisiana </font> | state in the southern United States
Q1408 |  <font color='red'> New Jersey </font> | state of the United States of America
Q1649 |  <font color='red'> Oklahoma </font>| state of the United States of America
Q1439 |  <font color='red'> Texas </font> | state in the southern United States
Q1221 |   <font color='red'> Idaho </font> | state of the United States of America

与margin = 1的原模型和margin = 0.5的模型预测结果相比，margin = 2的模型结果要明显好些：其预测结果的最大可能值就是理想结果Q61，此外再其预测最大可能的10个实体中，大多数都是state级别的，好于之前模型中存在许多city级别的实体。

In [252]:
score_m2_1 = transe_m2.predict(data_1)
for index in score_m2_1.argsort()[0:10]:
    print(rel_dic[index])

P30
P1303
P123
P2554
P734
P737
P172
P37
P150
P361


编号   |           名称        | 含义
------| ------------------- | -------------
P30 | <font color='red'> continent </font> | continent of which the subject is a part
P1303 | instrument | musical instrument that a person plays or teaches or used in a music occupation
P123 | publisher | organization or person responsible for publishing books, periodicals, printed music, podcasts, games or software
P2554  | production designer | production designer(s) of this motion picture, play, video game or similar
P734 | family name | part of full name of person
P737 | influenced by | this person, idea, etc. is informed by that other person, idea
P172 | ethnic group | subject's ethnicity 
P37  | official language  | language designated as official by this item
P150 | contains administrative territorial entity | (list of) direct subdivisions of an administrative territorial entity
P361 | part of | object of which the subject is a part 

对关系的预测则与margin = 1 的模型差别不大，除了理想结果P30以外，看起来语义合理的结果有P150和P361 

从对margin的实验中可以看出，在一定范围内margin越大模型结果越好。  
进而思考margin的理论上的作用，理解如下：  

从Loss入手，可以看到$Loss=\max \{ 0,d\left( h+r,t\right) -d\left( h'+r,t'\right)+margin \}$ 故Loss的取值可以分为三大类：

1. $d\left( h+r,t\right) +margin < d\left( h'+r,t'\right)$ **此时负例的距离大于正例+margin，即这个负例对模型影响较小，此时Loss为0，不对参数更新产生贡献。**

2. $d\left( h+r,t\right)  > d\left( h'+r,t'\right)$ **此时负例的距离小于正例，即这个负例比正例更接近原点，此时Loss为正，对参数更新产生贡献。**
3. $d\left( h+r,t\right) +margin > d\left( h'+r,t'\right)$ > $d\left( h+r,t\right)$ **此时负例的距离大于正例但小于正例+margin，即这个负例对模型的影响被margin限定为需要考虑，此时Loss为正，对参数更新产生贡献。**

从上面的分析可以看出，在一定程度上margin越大就意味着模型越严苛，需要考虑的负例增多自然Loss也相对大，margin越小意味着模型越宽容，需要考虑的负例较少，Loss也相对小（Loss的变化也可从实验结果看出）。

## 总结

本部分实验中：

1. 填写了模型的代码，实现了TransE模型。
2. 应用TransE模型对wikidata中的一些三元组进行了预测，并分析了一些其给出的预测结果。
3. 以2作为评测标准，分别试验了更改模型超参数p_norm和margin对模型造成的影响，并给出相应分析。

get的知识点：

1. TransE模型的实现
2. MarginLoss的分析
3. pytorch的一些包更了解了··
4. 对embedding理解加深