Add len & get_vecs_by_tokens to Vectors #561

sangyx · 2019-07-19T02:02:18Z

Add __len and get_vecs_by_tokens functions to Vectors

I'm using torchtext to rewrite a GluonNLP-based program(https://d2l.ai/chapter_natural-language-processing/similarity-analogy.html)，I want to use the Vectors in a more flexible way. For example:

# GluonNLP implement
def knn(W, x, k):
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+1)
    for i, c in zip(topk[1:], cos[1:]):  # 除去输入词
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[topk[0]]

# torchtext implement
def knn(W, x, k):
    cos = torch.mm(W, x.reshape(-1, 1)) / (
        (torch.sum(W * W, dim=1, keepdim=True) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
    _, topk = torch.topk(cos, k=k, dim=0)
    return topk, [cos[i].item() for i in topk]

def get_similar_tokens(query_token, k, embed):
        topk, cos = knn(embed.vectors,
                    embed[query_token], k+1)
    for i, c in zip(topk[1:], cos[1:]): 
        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs_a, vecs_b, vecs_c = embed[token_a], embed[token_b], embed[token_c]
    x = vecs_b - vecs_a + vecs_c
    topk, cos = knn(embed.vectors, x, 1)
    return embed.itos[topk[0]]

zhangguanheng66

Can you add some motivations to the text? I seems that the get_vecs_by_tokens() function is similar to Field.process() (pad() + numericalize()). Any reasons that you feel get_vecs_by_tokens() function should stay with Vector class?

Also we are now encouraging complete docs and unit tests for a PR. Thanks.

sangyx · 2019-07-20T10:51:24Z

Thank you for your reply! I'm using torchtext to rewrite a GluonNLP-based program(https://d2l.ai/chapter_natural-language-processing/similarity-analogy.html)，I want to use the Vectors in a more flexible way. For example:

# GluonNLP implement
def knn(W, x, k):
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+1)
    for i, c in zip(topk[1:], cos[1:]):  # 除去输入词
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[topk[0]]

# trochtext implement
def knn(W, x, k):
    cos = torch.mm(W, x.reshape(-1, 1)) / (
        (torch.sum(W * W, dim=1, keepdim=True) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
    _, topk = torch.topk(cos, k=k, dim=0)
    return topk, [cos[i].item() for i in topk]

def get_similar_tokens(query_token, k, embed):
        topk, cos = knn(embed.vectors,
                    embed[query_token], k+1)
    for i, c in zip(topk[1:], cos[1:]): 
        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs_a, vecs_b, vecs_c = embed[token_a], embed[token_b], embed[token_c]
    x = vecs_b - vecs_a + vecs_c
    topk, cos = knn(embed.vectors, x, 1)
    return embed.itos[topk[0]]

I didn't add unit test beacause this was a small change. I will add it if you think this change is helpful.

torchtext/vocab.py

zhangguanheng66 · 2019-07-22T13:29:27Z

You could retest it tomorrow. I guess the server may be down.

sangyx added 2 commits July 19, 2019 09:57

add len & get_vecs_by_tokens to Vectors

b6c7a6d

fix the build error

6eb23c0

zhangguanheng66 requested changes Jul 19, 2019

View reviewed changes

zhangguanheng66 requested changes Jul 21, 2019

View reviewed changes

torchtext/vocab.py Show resolved Hide resolved

sangyx added 2 commits July 22, 2019 08:38

add docs and test

7d9ca07

fix line too long

d1e6755

sangyx and others added 4 commits July 23, 2019 08:42

retest

983aeee

retest

f626263

Merge remote-tracking branch 'upstream/master'

cb03154

Merge branch 'master' into master

3298474

zhangguanheng66 approved these changes Jul 24, 2019

View reviewed changes

zhangguanheng66 merged commit 9a2f6be into pytorch:master Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add len & get_vecs_by_tokens to Vectors #561

Add len & get_vecs_by_tokens to Vectors #561

Uh oh!

sangyx commented Jul 19, 2019 •

edited by zhangguanheng66

Loading

Uh oh!

zhangguanheng66 left a comment •

edited

Loading

Uh oh!

sangyx commented Jul 20, 2019

Uh oh!

Uh oh!

zhangguanheng66 commented Jul 22, 2019

Uh oh!

Uh oh!

Add __len__ & get_vecs_by_tokens to Vectors #561

Add __len__ & get_vecs_by_tokens to Vectors #561

Uh oh!

Conversation

sangyx commented Jul 19, 2019 • edited by zhangguanheng66 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangguanheng66 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sangyx commented Jul 20, 2019

Uh oh!

Uh oh!

zhangguanheng66 commented Jul 22, 2019

Uh oh!

Uh oh!

Add len & get_vecs_by_tokens to Vectors #561

Add len & get_vecs_by_tokens to Vectors #561

sangyx commented Jul 19, 2019 •

edited by zhangguanheng66

Loading

zhangguanheng66 left a comment •

edited

Loading