wikievents 等英文数据集实验 #51

xxllp · 2022-09-01T07:48:09Z

准备实验个英文数据集不知道作者是否在wikievents 上面跑出结果因为看 scripts 里面的预训练模型名称都是中文的 ~~~

Spico197 · 2022-09-01T07:50:26Z

后期适配了wikievents数据集，不过没有调整参数。预训练模型改成英文即可。

https://github.com/Spico197/DocEE/blob/main/scripts/run_ptpcg_wikievents_wTgg.sh

xxllp · 2022-09-01T07:52:33Z

我这边换成了英文的但是数据读取报了个错误

   inlcude_complementary_ents=self.include_complementary_ents_flag,
  File "/data/xxl/DocEE/dee/helper/dee.py", line 143, in __init__
    annguid, mspan, str(sent_mrange), sent_text
ValueError: GUID: scenario_en_kairos_14 span range is not correct, span=Prayuth Chan - ocha, range=(11, 15), sent=['[UNK]', 's', '[UNK]', 'o', 'f', '[UNK]', 'e', 'a', 'r', 'l', 'y', '[UNK]', '[UNK]', 'u', 'e', 's', 'd', 'a', 'y', '[UNK]', 't', 'h', 'e', 'r', 'e', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', '[UNK]', 'c', 'l', 'a', 'i', 'm', '[UNK]', 'o', 'f', '[UNK]', 'r', 'e', 's', 'p', 'o', 'n', 's', 'i', 'b', 'i', 'l', 'i', 't', 'y', '[UNK]', '.', '[UNK]', '[UNK]', 'r', 'a', 'y', 'u', 't', 'h', '[UNK]', '[UNK]', 'h', 'a', 'n', '[UNK]', '-', '[UNK]', 'o', 'c', 'h', 'a', '[UNK]', ',', '[UNK]', 't', 'h', 'e', '[UNK]', 'h', 'e', 'a', 'd', '[UNK]', 'o', 'f', '[UNK]', '[UNK]', 'h', 'a', 'i', 'l', 'a', 'n', 'd', '[UNK]', '’', '[UNK]', 's', '[UNK]', 'm', 'i', 'l', 'i', 't', 'a', 'r', 'y', '[UNK]', 'g', 'o', 'v', 'e', 'r', 'n', 'm', 'e', 'n', 't', '[UNK]', ',', '[UNK]', 's', 'a', 'i', 'd', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 't', 'h', 'e', '[UNK]', 'a', 'u', 't', 'h', 'o', 'r', 'i', 't', 'i', 'e', 's', '[UNK]', 'w', 'e', 'r', 'e', '[UNK]', 's', 'e', 'a', 'r', 'c', 'h', 'i', 'n', 'g', '[UNK]', 'f', 'o', 'r', '[UNK]', 'a', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 's', 'e', 'e', 'n', '[UNK]', 'o', 'n', '[UNK]', 'c', 'l', 'o', 's', 'e', 'd', '[UNK]', '-', '[UNK]', 'c', 'i', 'r', 'c', 'u', 'i', 't', '[UNK]', 'f', 'o', 'o', 't', 'a', 'g', 'e', '[UNK]', 'b', 'u', 't', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 'i', 't', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', 't', '[UNK]', 'c', 'l', 'e', 'a', 'r', '[UNK]', 'w', 'h', 'o', '[UNK]', 't', 'h', 'e', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 'w', 'a', 's', '[UNK]', ',', '[UNK]', 'n', 'e', 'w', 's', '[UNK]', 'a', 'g', 'e', 'n', 'c', 'i', 'e', 's', '[UNK]', 'r', 'e', 'p', 'o', 'r', 't', 'e', 'd', '[UNK]', '.']

xxllp · 2022-09-01T07:53:12Z

看起来是将单词分成字母了

Spico197 · 2022-09-01T07:57:40Z

run_mode为wikievents_w_tgg时，doc_lang为en，默认使用空格作为tokenize的依据。您dee包的版本是0.3.2吗？

DocEE/dee/utils.py

Lines 144 to 157 in d6b585e

    
               elif self.doc_lang == "en": 
        
                   self.dee_tokenize = self.dee_space_tokenize 
        
           def dee_space_tokenize(self, text): 
        
               """perform space tokenization""" 
        
               tokens = text.split() 
        
               out_tokens = [] 
        
               for token in tokens: 
        
                   if token in self.vocab: 
        
                       out_tokens.append(token) 
        
                   else: 
        
                       out_tokens.append(self.unk_token) 
        
               return out_tokens

xxllp · 2022-09-01T08:00:26Z

我是github 下的代码版本是对的

xxllp · 2022-09-01T08:04:21Z

是不是这个wikievent的数据处理的脚本哪里有点问题

Spico197 · 2022-09-01T08:11:58Z

线下测试的时候是可以正常跑通的。如果方便的话麻烦提供多一点信息给我，或者您也在本地debug一下。

xxllp · 2022-09-01T08:27:03Z

报错信息在上面
这个英文的句子 tokenizer.dee_tokenize 结果如下这个是正常的？
['[UNK]', 's', '[UNK]', 'o', 'f', '[UNK]', 'e', 'a', 'r', 'l', 'y', '[UNK]', '[UNK]', 'u', 'e', 's', 'd', 'a', 'y', '[UNK]', 't', 'h', 'e', 'r', 'e', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', '[UNK]', 'c', 'l', 'a', 'i', 'm', '[UNK]', 'o', 'f', '[UNK]', 'r', 'e', 's', 'p', 'o', 'n', 's', 'i', 'b', 'i', 'l', 'i', 't', 'y', '[UNK]', '.', '[UNK]', '[UNK]', 'r', 'a', 'y', 'u', 't', 'h', '[UNK]', '[UNK]', 'h', 'a', 'n', '[UNK]', '-', '[UNK]', 'o', 'c', 'h', 'a', '[UNK]', ',', '[UNK]', 't', 'h', 'e', '[UNK]', 'h', 'e', 'a', 'd', '[UNK]', 'o', 'f', '[UNK]', '[UNK]', 'h', 'a', 'i', 'l', 'a', 'n', 'd', '[UNK]', '’', '[UNK]', 's', '[UNK]', 'm', 'i', 'l', 'i', 't', 'a', 'r', 'y', '[UNK]', 'g', 'o', 'v', 'e', 'r', 'n', 'm', 'e', 'n', 't', '[UNK]', ',', '[UNK]', 's', 'a', 'i', 'd', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 't', 'h', 'e', '[UNK]', 'a', 'u', 't', 'h', 'o', 'r', 'i', 't', 'i', 'e', 's', '[UNK]', 'w', 'e', 'r', 'e', '[UNK]', 's', 'e', 'a', 'r', 'c', 'h', 'i', 'n', 'g', '[UNK]', 'f', 'o', 'r', '[UNK]', 'a', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 's', 'e', 'e', 'n', '[UNK]', 'o', 'n', '[UNK]', 'c', 'l', 'o', 's', 'e', 'd', '[UNK]', '-', '[UNK]', 'c', 'i', 'r', 'c', 'u', 'i', 't', '[UNK]', 'f', 'o', 'o', 't', 'a', 'g', 'e', '[UNK]', 'b', 'u', 't', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 'i', 't', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', 't', '[UNK]', 'c', 'l', 'e', 'a', 'r', '[UNK]', 'w', 'h', 'o', '[UNK]', 't', 'h', 'e', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 'w', 'a', 's', '[UNK]', ',', '[UNK]', 'n', 'e', 'w', 's', '[UNK]', 'a', 'g', 'e', 'n', 'c', 'i', 'e', 's', '[UNK]', 'r', 'e', 'p', 'o', 'r', 't', 'e', 'd', '[UNK]', '.']

Spico197 · 2022-09-01T08:28:46Z

不正常，应该是以空格切分

xxllp · 2022-09-01T08:37:49Z

是的看起来是分割的时候有问题

xxllp · 2022-09-01T08:52:07Z

这个我刚才改好了但是后续的训练发现几轮下来预测的结果统计全部都是0哈是英文的结果哪里没对齐吗

Spico197 · 2022-09-01T09:06:32Z

我在本地重新试了一下，不应该有数据分割的问题，应该是可以直接正常训练的。方便告知一下您做了哪些改动吗？
全是0其实在WikiEvents上挺正常的，因为数据量太小，建议搭配预训练模型使用。如果像DuEE-fin和ChFinAnn一样使用随机初始化的embedding的话效果会非常差劲

xxllp · 2022-09-01T09:12:58Z

我是将 BertTokenizerForDocEE 里面的self.dee_tokenize = self.dee_space_tokenize 不判断语言了
这个应该是在判断的时候识别还是中文的~~

你意思是加载哪个预训练模型，初始化加载的是bert 哈还是你训练后的模型吗这个没看到有吧

Spico197 · 2022-09-01T09:17:33Z

这个太奇怪了，我重新clone了repo，并且重新生成了数据，并没有遇到这个问题orz
脚本中有一个use_bertflag，可以改成True，会使用BERT+CRF的encoding方案。不过有可能会OOM，所以需要相应改下batch size等参数

xxllp · 2022-09-01T09:24:33Z

了解了~~~ 可能我本地代码哪里改了导致的这个

xxllp · 2022-09-01T09:39:12Z

我感觉问题不在这个地方应该是这个 self.dee_space_tokenize 后的结果很多都是 unk 实体里面也是的

xxllp · 2022-09-01T09:39:28Z

bert 也没啥结果

Spico197 · 2022-09-01T10:28:32Z

您是用cased还是uncased模型？如果是UNK比较多的话可以把所有字符串lower一下，然后用uncased，或者直接用cased试试

xxllp · 2022-09-02T00:56:56Z

用的uncased模型

xxllp · 2022-09-02T01:07:38Z

试了下貌似还是一样的几轮都是 0 不知道你本地最后跑出来的F1是多少

Spico197 · 2022-09-02T07:14:06Z

我只开debug模式测试了可以训练，暂无训练结果

xxllp · 2022-09-02T07:39:33Z

这样我感觉这个英文的数据集肯定是需要哪里继续改

xxllp · 2022-09-05T07:10:09Z

换了个数据集是有结果了但是结果不是很高这块要是想把unk的去掉如何整比较好

xxllp added the question Further information is requested label Sep 1, 2022

Spico197 closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikievents 等英文数据集实验 #51

wikievents 等英文数据集实验 #51

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022 •

edited

Loading

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 2, 2022

xxllp commented Sep 2, 2022

Spico197 commented Sep 2, 2022

xxllp commented Sep 2, 2022

xxllp commented Sep 5, 2022

wikievents 等英文数据集实验 #51

wikievents 等英文数据集实验 #51

Comments

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022 • edited Loading

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

xxllp commented Sep 1, 2022

Spico197 commented Sep 1, 2022

xxllp commented Sep 2, 2022

xxllp commented Sep 2, 2022

Spico197 commented Sep 2, 2022

xxllp commented Sep 2, 2022

xxllp commented Sep 5, 2022

Spico197 commented Sep 1, 2022 •

edited

Loading