Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __len__ & get_vecs_by_tokens to Vectors #561

Merged
merged 8 commits into from
Jul 24, 2019
Merged

Add __len__ & get_vecs_by_tokens to Vectors #561

merged 8 commits into from
Jul 24, 2019

Conversation

sangyx
Copy link
Contributor

@sangyx sangyx commented Jul 19, 2019

Add __len and get_vecs_by_tokens functions to Vectors

I'm using torchtext to rewrite a GluonNLP-based program(https://d2l.ai/chapter_natural-language-processing/similarity-analogy.html),I want to use the Vectors in a more flexible way. For example:

# GluonNLP implement
def knn(W, x, k):
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+1)
    for i, c in zip(topk[1:], cos[1:]):  # 除去输入词
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[topk[0]]

# torchtext implement
def knn(W, x, k):
    cos = torch.mm(W, x.reshape(-1, 1)) / (
        (torch.sum(W * W, dim=1, keepdim=True) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
    _, topk = torch.topk(cos, k=k, dim=0)
    return topk, [cos[i].item() for i in topk]

def get_similar_tokens(query_token, k, embed):
        topk, cos = knn(embed.vectors,
                    embed[query_token], k+1)
    for i, c in zip(topk[1:], cos[1:]): 
        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs_a, vecs_b, vecs_c = embed[token_a], embed[token_b], embed[token_c]
    x = vecs_b - vecs_a + vecs_c
    topk, cos = knn(embed.vectors, x, 1)
    return embed.itos[topk[0]]

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some motivations to the text? I seems that the get_vecs_by_tokens() function is similar to Field.process() (pad() + numericalize()). Any reasons that you feel get_vecs_by_tokens() function should stay with Vector class?

Also we are now encouraging complete docs and unit tests for a PR. Thanks.

@sangyx
Copy link
Contributor Author

sangyx commented Jul 20, 2019

Thank you for your reply! I'm using torchtext to rewrite a GluonNLP-based program(https://d2l.ai/chapter_natural-language-processing/similarity-analogy.html),I want to use the Vectors in a more flexible way. For example:

# GluonNLP implement
def knn(W, x, k):
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+1)
    for i, c in zip(topk[1:], cos[1:]):  # 除去输入词
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[topk[0]]

# trochtext implement
def knn(W, x, k):
    cos = torch.mm(W, x.reshape(-1, 1)) / (
        (torch.sum(W * W, dim=1, keepdim=True) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
    _, topk = torch.topk(cos, k=k, dim=0)
    return topk, [cos[i].item() for i in topk]

def get_similar_tokens(query_token, k, embed):
        topk, cos = knn(embed.vectors,
                    embed[query_token], k+1)
    for i, c in zip(topk[1:], cos[1:]): 
        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))

def get_analogy(token_a, token_b, token_c, embed):
    vecs_a, vecs_b, vecs_c = embed[token_a], embed[token_b], embed[token_c]
    x = vecs_b - vecs_a + vecs_c
    topk, cos = knn(embed.vectors, x, 1)
    return embed.itos[topk[0]]

I didn't add unit test beacause this was a small change. I will add it if you think this change is helpful.

torchtext/vocab.py Show resolved Hide resolved
@zhangguanheng66
Copy link
Contributor

You could retest it tomorrow. I guess the server may be down.

@zhangguanheng66 zhangguanheng66 merged commit 9a2f6be into pytorch:master Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants