加载GPT2LMHeadModel编码错误啊 #43

aRookieMan · 2020-06-09T10:34:24Z

想单独加载出来，发现报错了，咋回事 ...

aRookieMan · 2020-06-09T11:45:54Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

michellemashutian · 2020-08-03T07:40:36Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

aRookieMan · 2020-08-03T07:43:58Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

你用我图里面的代码读进模型，然后就可以使用huggingface/transformers的例子操作了

michellemashutian · 2020-08-03T08:10:30Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

你用我图里面的代码读进模型，然后就可以使用huggingface/transformers的例子操作了

我读进去了，但是不知道怎么取。。我就想把那个词向量搞出来，变成word[空格]embedding的格式，不好意思啊，可能问题有点弱智。。。

aRookieMan · 2020-08-03T08:14:15Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

你用我图里面的代码读进模型，然后就可以使用huggingface/transformers的例子操作了

我读进去了，但是不知道怎么取。。我就想把那个词向量搞出来，变成word[空格]embedding的格式，不好意思啊，可能问题有点弱智。。。

根据huggingface/transformers的例子就好了啊：

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

# Encode text
input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples

这个repo是基于huggingface的。推荐你阅读huggingface/transformers这个库，首页就有Quick tour。
初学者，理解。

michellemashutian · 2020-08-03T08:25:30Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

你用我图里面的代码读进模型，然后就可以使用huggingface/transformers的例子操作了

我读进去了，但是不知道怎么取。。我就想把那个词向量搞出来，变成word[空格]embedding的格式，不好意思啊，可能问题有点弱智。。。

根据huggingface/transformers的例子就好了啊：
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

# Encode text
input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
这个repo是基于huggingface的。推荐你阅读huggingface/transformers这个库，首页就有Quick tour。
初学者，理解。

嗯，我知道这个，所以我如果要搞词向量出来，我就一个一个encode词么，我以为有那种直接取的操作。谢谢！！

aRookieMan · 2020-08-03T08:29:33Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

你用我图里面的代码读进模型，然后就可以使用huggingface/transformers的例子操作了

我读进去了，但是不知道怎么取。。我就想把那个词向量搞出来，变成word[空格]embedding的格式，不好意思啊，可能问题有点弱智。。。

根据huggingface/transformers的例子就好了啊：
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

# Encode text
input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
这个repo是基于huggingface的。推荐你阅读huggingface/transformers这个库，首页就有Quick tour。
初学者，理解。
嗯，我知道这个，所以我如果要搞词向量出来，我就一个一个encode词么，我以为有那种直接取的操作。谢谢！！

额 ... 我明白你的意思是。BERT是动态语言模型，token对应的emb只是浅层的，过了这12层MHA的词向量才是准确的词向量。如果你想提取出最初的token对应的emb，这个理论上是可以直接提取的，你分析一下它的nn层有哪些，把第一层截取出来就行了。我没试过哦，加油！

michellemashutian · 2020-08-03T08:47:58Z

最新的transformer只需要输入目录，并且把名改成pytorch_model.bin
huggingface/transformers#1620 (comment)

我想问一下，怎么把里面的词向量和词取出来。。。

你用我图里面的代码读进模型，然后就可以使用huggingface/transformers的例子操作了

我读进去了，但是不知道怎么取。。我就想把那个词向量搞出来，变成word[空格]embedding的格式，不好意思啊，可能问题有点弱智。。。

根据huggingface/transformers的例子就好了啊：
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

# Encode text
input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
这个repo是基于huggingface的。推荐你阅读huggingface/transformers这个库，首页就有Quick tour。
初学者，理解。
嗯，我知道这个，所以我如果要搞词向量出来，我就一个一个encode词么，我以为有那种直接取的操作。谢谢！！
额 ... 我明白你的意思是。BERT是动态语言模型，token对应的emb只是浅层的，过了这12层MHA的词向量才是准确的词向量。如果你想提取出最初的token对应的emb，这个理论上是可以直接提取的，你分析一下它的nn层有哪些，把第一层截取出来就行了。我没试过哦，加油！

好像这样就可以了。里面的29522是词的indx，如果那个vocab文件是按照indx排列的话，就可以生成我那个文件了。

model = BertModel.from_pretrained(model_path)
embeds = model.get_input_embeddings()
print(embeds)
input_idx = Variable(torch.LongTensor([29522]))
print(embeds(input_idx))

aRookieMan closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

加载GPT2LMHeadModel编码错误啊 #43

加载GPT2LMHeadModel编码错误啊 #43

aRookieMan commented Jun 9, 2020

aRookieMan commented Jun 9, 2020

michellemashutian commented Aug 3, 2020

aRookieMan commented Aug 3, 2020

michellemashutian commented Aug 3, 2020

aRookieMan commented Aug 3, 2020

michellemashutian commented Aug 3, 2020

aRookieMan commented Aug 3, 2020

michellemashutian commented Aug 3, 2020

加载GPT2LMHeadModel编码错误啊 #43

加载GPT2LMHeadModel编码错误啊 #43

Comments

aRookieMan commented Jun 9, 2020

aRookieMan commented Jun 9, 2020

michellemashutian commented Aug 3, 2020

aRookieMan commented Aug 3, 2020

michellemashutian commented Aug 3, 2020

aRookieMan commented Aug 3, 2020

michellemashutian commented Aug 3, 2020

aRookieMan commented Aug 3, 2020

michellemashutian commented Aug 3, 2020