Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code details #6

Open
niartnelis opened this issue Sep 18, 2019 · 18 comments
Open

code details #6

niartnelis opened this issue Sep 18, 2019 · 18 comments

Comments

@niartnelis
Copy link

I didn't get the data in pkl format, so I want to ask some details in the code:
1.dataloader:
for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts
what are the meanings of these variables?
2. pos is or not pre-characterized to the sentence.Is this part of speech tagging?
3. The adjacency matrix in the GCN should obtain the dependency of the related words. Is this the syntax dependency obtained by using the spacy?

@tsujuifu
Copy link
Owner

tsujuifu commented Sep 18, 2019

Hi, thanks for interested in my work.

Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup)
I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL)

I try to clarify your problems here

  1. for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts
BS: batch_size
SL: sentence_length
ED: embedding_dimension
idx: not so important, you can just ignore it
inp: the input sentence ([BS, SL, ED])
pos: the part-of-speech tag of each word ([BS, SL])
dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL])
dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) 

ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL])
wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL])
  1. pos is the part-of-speech tag of each word and it's from spaCy

  2. the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree)

@niartnelis
Copy link
Author

thank you for your reply!!!

@niartnelis
Copy link
Author

niartnelis commented Sep 19, 2019 via email

@akashicMarga
Copy link

@niartnelis did you figure out the data preprocessing part? and dataset format?

@LuoXukun
Copy link

Thank you very much! I think I can achieve it now.

@LuoXukun
Copy link

Can you tell me what does -1 means in wgt_ne and wgt_rel?

@zhhhzhang
Copy link

Can you tell me what does -1 means in wgt_ne and wgt_rel?

Hi, could you share the pre-preprocessing code and data? THx

@zhhhzhang
Copy link

Hi, thanks for interested in my work.

Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup)
I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL)

I try to clarify your problems here

  1. for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts
BS: batch_size
SL: sentence_length
ED: embedding_dimension
idx: not so important, you can just ignore it
inp: the input sentence ([BS, SL, ED])
pos: the part-of-speech tag of each word ([BS, SL])
dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL])
dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) 

ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL])
wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL])
  1. pos is the part-of-speech tag of each word and it's from spaCy
  2. the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree)

Hi, Have you reproduce the pre-preprocessing code and data?
Could you share it?
THx

@zhhhzhang
Copy link

zhhhzhang commented Apr 5, 2020 via email

@nttmac
Copy link

nttmac commented Apr 8, 2020

I am sorry that I havn't evaluate whether my reproduction is correct, and I will share it with you after I finish it.

------------------ 原始邮件 ------------------ 发件人: "zhhhzhang"<notifications@github.com>; 发送时间: 2020年4月4日(星期六) 晚上10:58 收件人: "tsujuifu/pytorch_graph-rel"<pytorch_graph-rel@noreply.github.com>; 抄送: "扭转乾坤"<861392914@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [tsujuifu/pytorch_graph-rel] code details (#6) Hi, thanks for interested in my work. Sorry for missing the pre-preprocessing code and data. (I left my previous lab and forgot to backup) I will reproduce and update it later. (since I just start my PhD and it's quite busy these days, maybe after CVPR or ACL) I try to clarify your problems here for idx, inp, pos, dep_fw, dep_bw, ans_ne, wgt_ne, ans_rel, wgt_rel in ld_ts BS: batch_size SL: sentence_length ED: embedding_dimension idx: not so important, you can just ignore it inp: the input sentence ([BS, SL, ED]) pos: the part-of-speech tag of each word ([BS, SL]) dep_fw: the dependencty adjacency matrix (forward edge) of each word-pair ([BS, SL, SL]) dep_bw: the dependencty adjacency matrix (backward edge) of each word-pair ([BS, SL, SL]) ans_ne, ans_rel: the output tag of name entity of each word and relation of each word-pair ([BS, SL] ans [BS, SL, SL]) wgt_ne, wgt_rel: the loss weight of name entity of each word and relation of each word-pair, 1 for those contains name entity or relation, otherwise 0 ([BS, SL], [BS, SL, SL]) pos is the part-of-speech tag of each word and it's from spaCy the dependency parsing also comes from spaCy (please notice that there can be forward and backward edges of the dependency tree) Hi, Have you reproduce the pre-preprocessing code and data? Could you share it? THx — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

你好,请问你复现了吗?我有个疑问,在第二阶段,GCN有num_rel个,相加会使得数值很大,会不会train飞了?

@zhhhzhang
Copy link

zhhhzhang commented Apr 8, 2020 via email

@LuoXukun
Copy link

LuoXukun commented Apr 18, 2020

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True):
"""
Tag the source json in NYT based on graphrel
Args:
isTrain: type. Train is true and test is false. Please ignore it.
relation_id_path: The relation-id dict file path.
Return:
Write the json file into the file.
[{
text: [seq_length]. The text word list.
pos: [seq_length]. The part-of-speech tag of each word.
dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair.
dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair.
ans_ne: [seq_length]. The output tag of name entity of each word.
ans_rel: [seq_length, seq_length]. The output relation of each word-pair.
#wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.
#wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.
relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}]
}]
"""
`

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation
    
    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)
    
    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     
    
    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep
        
        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
        
        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)
        
        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')
        
    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text):
"""
Normalize the unicode string.
Args:
text: unicode string
Return:
"""
`

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split):
"""
Find all loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index_list: the list of index pairs.
"""

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split):
"""
Find the loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index1: start index
index2: end index
"""

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that:
(1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself.
(2) You need to reconstruct dep_fw and dep_bw because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step.
(3) self.input_path: The original NYT dataset. self.output_path: You output file path.
(4) You should install the libraries you need.
(5) The original NYT dataset is available at here.
(6) 感觉有用的话点个赞呗。

@niuweicai
Copy link

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True):
"""
Tag the source json in NYT based on graphrel
Args:
isTrain: type. Train is true and test is false. Please ignore it.
relation_id_path: The relation-id dict file path.
Return:
Write the json file into the file.
[{
text: [seq_length]. The text word list.
pos: [seq_length]. The part-of-speech tag of each word.
dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair.
dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair.
ans_ne: [seq_length]. The output tag of name entity of each word.
ans_rel: [seq_length, seq_length]. The output relation of each word-pair.
#wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.
#wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.
relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}]
}]
"""
`

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation
    
    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)
    
    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     
    
    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep
        
        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
        
        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)
        
        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')
        
    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text):
"""
Normalize the unicode string.
Args:
text: unicode string
Return:
"""
`

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split):
"""
Find all loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index_list: the list of index pairs.
"""

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split):
"""
Find the loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index1: start index
index2: end index
"""

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that:
(1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself.
(2) You need to reconstruct ans_ne and ans_rel because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step.
(3) self.input_path: The original NYT dataset. self.output_path: You output file path.
(4) You should install the libraries you need.
(5) The original NYT dataset is available at here.
(6) 感觉有用的话点个赞呗。

可否发一下这个预处理的完整代码,我看到这些代码是包含在一个类中的。我的邮箱605851710@qq.com 十分感谢!

@rsanshierli
Copy link

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True):
"""
Tag the source json in NYT based on graphrel
Args:
isTrain: type. Train is true and test is false. Please ignore it.
relation_id_path: The relation-id dict file path.
Return:
Write the json file into the file.
[{
text: [seq_length]. The text word list.
pos: [seq_length]. The part-of-speech tag of each word.
dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair.
dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair.
ans_ne: [seq_length]. The output tag of name entity of each word.
ans_rel: [seq_length, seq_length]. The output relation of each word-pair.
#wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.
#wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.
relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}]
}]
"""
`

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation
    
    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)
    
    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     
    
    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep
        
        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
        
        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)
        
        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')
        
    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text):
"""
Normalize the unicode string.
Args:
text: unicode string
Return:
"""
`

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split):
"""
Find all loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index_list: the list of index pairs.
"""

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split):
"""
Find the loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index1: start index
index2: end index
"""

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that:
(1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself.
(2) You need to reconstruct dep_fw and dep_bw because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step.
(3) self.input_path: The original NYT dataset. self.output_path: You output file path.
(4) You should install the libraries you need.
(5) The original NYT dataset is available at here.
(6) 感觉有用的话点个赞呗。

Can you share a complete code for me? I have bugs in using your code. I don't understand, thank you. If can, please 470294527@qq.com

@JerAex
Copy link

JerAex commented Jul 7, 2020

I have reproduced a dataset in JSON format with the following code, which is not guaranteed to be the same as the author's implementation.

def tag_graphrel(self, relation_id_path, isTrain = True):
"""
Tag the source json in NYT based on graphrel
Args:
isTrain: type. Train is true and test is false. Please ignore it.
relation_id_path: The relation-id dict file path.
Return:
Write the json file into the file.
[{
text: [seq_length]. The text word list.
pos: [seq_length]. The part-of-speech tag of each word.
dep_fw: [seq_length, seq_length]. The dependencty adjacency matrix forward edge) of each word-pair.
dep_bw: [seq_length, seq_length]. The dependencty adjacency matrix (backward edge) of each word-pair.
ans_ne: [seq_length]. The output tag of name entity of each word.
ans_rel: [seq_length, seq_length]. The output relation of each word-pair.
#wgt_ne: [seq_length]. The loss weight of name entity of each word, 1 for those contains name entity or relation, otherwise 0.
#wgt_rel: [seq_length, seq_length]. The loss weight of relation of each word-pair, 1 for those contains name entity or relation, otherwise 0.
relationMentions: The gold relational triples. [{"label", "label_id", "em1Text", "em2Text"}]
}]
"""
`

    # spacy model
    nlp = spacy.load('en')

    datas = []
    EpochCount = 0
    relation_ids = {"PAD": 0, "None": 1} # 0 for padding and 1 for None relation
    
    # Load the data from source json.
    with open(self.input_path, "r", encoding="utf-8") as fr:
        for line in fr.readlines():
            line = json.loads(line)
            datas.append(line)
    
    # Get the relation-id dict and entity-id dict.
    if relation_id_path is None:
        print("Please provide the relation_id file path!")
        exit()
    if os.path.exists(relation_id_path):
        print("There have been the relation_is_file, let's use it!")
        with open(relation_id_path, mode="r", encoding="utf-8") as f:
            for line_id, line in enumerate(f):
                relation_ids = json.loads(line)
    else:
        print("There is not the relation_id_file, let's make it according to the dataset!")
        for data in datas:
            for relation in data["relationMentions"]:
                if self.normalize_text(relation["label"]) != "None":
                    if relation["label"] not in relation_ids.keys():
                        relation_ids[relation["label"]] = len(relation_ids)
        with open(relation_id_path, mode="w", encoding="utf-8") as f:
            relation_ids_str = json.dumps(relation_ids, ensure_ascii=False)
            f.write(relation_ids_str + "\n")     
    
    print("The number of relations: ", len(relation_ids))
    print("Relations to id: ", relation_ids)

    fw = open(self.output_path, "w+", encoding="utf-8")

    for data in datas:
        EpochCount += 1
        text_tag = {}

        sentText = self.normalize_text(data["sentText"]).rstrip('\n').rstrip("\r")
        sentDoc = nlp(sentText)
        # text:       [seq_length].       The text word list.
        sentWords = [token.text for token in sentDoc]
        text_tag["text"] = sentWords

        text_tag["pos"] = []
        text_tag["dep_fw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["dep_bw"] = [[-1] * len(sentWords) for i in range(len(sentWords))]
        for token in sentDoc:
            # pos:        [seq_length].       The part-of-speech tag of each word.
            text_tag["pos"].append(token.pos)
            # dep_fw:     [seq_length, seq_length].       The dependencty adjacency matrix (forward edge) of each word-pair.
            # dep_bw:     [seq_length, seq_length].       The dependencty adjacency matrix (backward edge) of each word-pair.
            if token.i >= token.head.i:
                text_tag["dep_fw"][token.i][token.head.i] = token.dep
            else:
                text_tag["dep_bw"][token.i][token.head.i] = token.dep
        
        # ans_ne:     [seq_length].                   The output tag og name entity of each word.
        text_tag["ans_ne"] = ["O"] * len(sentWords)
        for entity in data["entityMentions"]:
            entity_doc = nlp(self.normalize_text(entity["text"]))
            entity_list = [token.text for token in entity_doc]
            entity_idxs = self.find_all_index(sentWords, entity_list)
            for index in entity_idxs:
                if index[1] - index[0] == 1:
                    text_tag["ans_ne"][index[0]] = "S-" + entity["label"]
                elif index[1] - index[0] == 2:
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
                elif index[1] - index[0] > 2:
                    for i in range(index[0], index[1]):
                        text_tag["ans_ne"][i] = "I-" + entity["label"]
                    text_tag["ans_ne"][index[0]] = "B-" + entity["label"]
                    text_tag["ans_ne"][index[1] - 1] = "E-" + entity["label"]
        
        # ans_rel:    [seq_length, seq_length].       The output relation of each word-pair.
        # relationMentions:               The gold relational triples.
        text_tag["ans_rel"] = [[1] * len(sentWords) for i in range(len(sentWords))]
        text_tag["relationMentions"] = []
        for relation in data["relationMentions"]:
            entity1_list = [token.text for token in nlp(self.normalize_text(relation["em1Text"]))]
            entity2_list = [token.text for token in nlp(self.normalize_text(relation["em2Text"]))]
            entity1_idxs = self.find_all_index(sentWords, entity1_list)
            entity2_idxs = self.find_all_index(sentWords, entity2_list)

            for en1_idx in entity1_idxs:
                for en2_idx in entity2_idxs:
                    for i in range(en1_idx[0], en1_idx[1]):
                        for j in range(en2_idx[0], en2_idx[1]):
                            text_tag["ans_rel"][i][j] = relation_ids[relation["label"]]

            relation_item = {}
            if self.normalize_text(relation["label"]) != "None":
                relation_item["label"] = relation["label"]
                relation_item["label_id"] = relation_ids[relation["label"]]
                relation_item["em1Text"] = entity1_list
                relation_item["em2Text"] = entity2_list
                text_tag["relationMentions"].append(relation_item)

        if EpochCount % 10000 == 0:
            print("Epoch ", EpochCount)
        
        fw.write(json.dumps(text_tag, ensure_ascii=False) + '\n')
        
    fw.close()
    print("Successfully transfered the file!\n")
    return

`

def normalize_text(self, text):
"""
Normalize the unicode string.
Args:
text: unicode string
Return:
"""
`

    return unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8')

`

def find_all_index(self, sen_split, word_split):
"""
Find all loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index_list: the list of index pairs.
"""

`

    start, end, offset = -1, -1, 0
    #print("sen_split: ", sen_split)
    #print("word_split: ", word_split)
    index_list = []
    while(True):
        if len(index_list) != 0:
            offset = index_list[-1][1]
        start, end = self.find_index(sen_split[offset:], word_split)
        if start == -1 and end == -1: break
        if end <= start: break
        start += offset
        end += offset
        index_list.append((start, end))
    return index_list

`

def find_index(self, sen_split, word_split):
"""
Find the loaction of entity in sentence.
Args:
sen_split: the sentence array.
word_split: the entity array.
Return:
index1: start index
index2: end index
"""

`

    index1 = -1
    index2 = -1
    for i in range(len(sen_split)):
        if str(sen_split[i]) == str(word_split[0]):
            flag = True
            k = i
            for j in range(len(word_split)):
                if word_split[j] != sen_split[k]:
                    flag = False
                if k < len(sen_split) - 1:
                    k+=1
            if flag:
                index1 = i
                index2 = i + len(word_split)
                break
    return index1, index2

`

Note that:
(1) I do not generate wgt_ne and wgt_rel, since you can adapt it through the Args "weight" in loss function nn.CrossEntropyLoss(), you can check the document by yourself.
(2) You need to reconstruct dep_fw and dep_bw because there are some too big value generated by spaCy. Map them to smaller integers from 0 for the sake of the next training step.
(3) self.input_path: The original NYT dataset. self.output_path: You output file path.
(4) You should install the libraries you need.
(5) The original NYT dataset is available at here.
(6) 感觉有用的话点个赞呗。

能否发一下这个预处理的完整代码,我的邮箱1979453046@qq.com 非常感谢!

@LuoXukun
Copy link

这就是我预处理的全部代码,经验证是可以跑通的,不行的话请自行debug。

@niuweicai
Copy link

niuweicai commented Feb 11, 2022 via email

@fancy999
Copy link

还要吗

---Original--- From: "Huang @.> Date: Sat, Jan 29, 2022 18:10 PM To: @.>; Cc: @.@.>; Subject: Re: [tsujuifu/pytorch_graph-rel] code details (#6) @.?谢谢! — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.>

嗯嗯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants