-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add __len__ & get_vecs_by_tokens to Vectors #561
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some motivations to the text? I seems that the get_vecs_by_tokens() function is similar to Field.process() (pad() + numericalize()). Any reasons that you feel get_vecs_by_tokens() function should stay with Vector class?
Also we are now encouraging complete docs and unit tests for a PR. Thanks.
Thank you for your reply! I'm using torchtext to rewrite a GluonNLP-based program(https://d2l.ai/chapter_natural-language-processing/similarity-analogy.html),I want to use the Vectors in a more flexible way. For example: # GluonNLP implement
def knn(W, x, k):
cos = nd.dot(W, x.reshape((-1,))) / (
(nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
return topk, [cos[i].asscalar() for i in topk]
def get_similar_tokens(query_token, k, embed):
topk, cos = knn(embed.idx_to_vec,
embed.get_vecs_by_tokens([query_token]), k+1)
for i, c in zip(topk[1:], cos[1:]): # 除去输入词
print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))
def get_analogy(token_a, token_b, token_c, embed):
vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
x = vecs[1] - vecs[0] + vecs[2]
topk, cos = knn(embed.idx_to_vec, x, 1)
return embed.idx_to_token[topk[0]]
# trochtext implement
def knn(W, x, k):
cos = torch.mm(W, x.reshape(-1, 1)) / (
(torch.sum(W * W, dim=1, keepdim=True) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
_, topk = torch.topk(cos, k=k, dim=0)
return topk, [cos[i].item() for i in topk]
def get_similar_tokens(query_token, k, embed):
topk, cos = knn(embed.vectors,
embed[query_token], k+1)
for i, c in zip(topk[1:], cos[1:]):
print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))
def get_analogy(token_a, token_b, token_c, embed):
vecs_a, vecs_b, vecs_c = embed[token_a], embed[token_b], embed[token_c]
x = vecs_b - vecs_a + vecs_c
topk, cos = knn(embed.vectors, x, 1)
return embed.itos[topk[0]] I didn't add unit test beacause this was a small change. I will add it if you think this change is helpful. |
You could retest it tomorrow. I guess the server may be down. |
Add __len and get_vecs_by_tokens functions to Vectors
I'm using torchtext to rewrite a GluonNLP-based program(https://d2l.ai/chapter_natural-language-processing/similarity-analogy.html),I want to use the Vectors in a more flexible way. For example: