Improve support for word representations #58

elyase · 2015-04-21T12:38:37Z

It would be great to be able to load pretrained word representations, at least the ones from Glove[1] and word2vec [2].

In the same fashion it would be useful to have a most_similar method able to efficiently retrieve the top n similar words.

[1] http://nlp.stanford.edu/projects/glove/
[2] https://code.google.com/p/word2vec/

The text was updated successfully, but these errors were encountered:

mfilipov · 2015-05-06T11:18:42Z

spacy.tokens.Token.repvec loads pretrained word representation.
I agree that it would be very useful to be able to efficiently retrieve similar words...

honnibal · 2015-05-06T12:16:18Z

On Wednesday, May 6, 2015, mfilipov notifications@github.com wrote:

spacy.tokens.Token.repvec loads pretrained word representation.

This should have a better API -- currently you need to precompile the word
vectors.

I agree that it would be very useful to be able to efficiently retrieve
similar words...

This is an open research problem. Consider that there are (n^2-n)/2
combinations, with n of order 10^6 for our vocab.

The folks over at gensim are thinking about this too: piskvorky/gensim#51

One solution is to stick a cache in front of the query system. We could probably serve a majority of similarity queries from a small cache.

—
Reply to this email directly or view it on GitHub
#58 (comment).

geovedi · 2015-05-07T22:53:50Z

I don't know how Matthew generated the built-in embedding vectors, but to me the quality is okay.

But if you're not happy with it and want to use your own embeddings, there's spacy.vocab.write_binary_vectors you can use, it takes bzip2 format file as input and set the output to .../spacy/en/data/vocab/vec.bin. I've never done it and just read from the code, so I might be wrong. (EDIT: apparently there's a script to do that)

Also remember, spaCy expecting 300 as the size of each vector as it's hardcoded here and here.

For most_similar function, that's easy to implement, but I'd rather do it somewhere outside spaCy. You can also use gensim to load spaCy vectors!

import spacy.en
from gensim.models.word2vec import Word2Vec, Vocab

nlu = spacy.en.English()
model = Word2Vec(size=300)

for i, lex in enumerate(nlu.vocab):
    model.vocab[lex.orth_] = Vocab(index=i, count=None)
    model.index2word.append(lex.orth_)

model.syn0norm = np.asarray(map(lambda x: x.repvec, nlu.vocab))

loaded to model.syn0norm because spaCy vectors are already normalised and to avoid gensim L2-normalisation for all-zeros vectors.

In [249]: model.most_similar(u'space')
Out[249]:
[(u'Space', 1.0),
 (u'SPACE', 1.0),
 (u'SPACES', 0.7741692662239075),
 (u'spaces', 0.7741692662239075),
 (u'Spaces', 0.7741692662239075),
 (u'workspace', 0.6425580978393555),
 (u'hyperspace', 0.578804612159729),
 (u'CYBERSPACE', 0.5667369961738586),
 (u'cyberspace', 0.5667369961738586),
 (u'Cyberspace', 0.5667369961738586)]

But if you just want most_similar function, read doc on how compute vector similarity or you can do something like this...

import numpy as np
import spacy.en

nlu = spacy.en.English()
vectors = np.asarray(map(lambda x: x.repvec, nlu.vocab))
vocab = map(lambda x: x.orth_, nlu.vocab)

def most_similar(word, topn=5):
    if isinstance(word, str):
        word = unicode(word)
    dists = np.dot(vectors, nlu.vocab[word].repvec)
    return map(lambda x: (vocab[x], dists[x]), np.argsort(dists)[::-1][:topn])

Example:

In [189]: most_similar(u'query')
Out[189]:
[(u'query', 0.99999994),
 (u'Query', 0.99999994),
 (u'queries', 0.74670404),
 (u'keystroke', 0.65798581),
 (u'signup', 0.62638754)]

Extra

Maybe you want to know the similar words from the same sentence. That's also easy, you can have a function like this..

def most_similar_in_sentence(tokens, word, pos_tags=[], topn=5):
    if isinstance(word, str):
        word = unicode(word)
    if pos_tags:
        vocab = filter(lambda x: x.pos_ in pos_tags, tokens)
    else:
        vocab = tokens
    dists = np.dot(np.asarray(map(lambda x: x.repvec, vocab)), nlu.vocab[word].repvec)
    return map(lambda x: (vocab[x].orth_, dists[x]), np.argsort(dists)[::-1][:topn])

Example

In [199]: text = u"""One solution is to stick a cache in front of the query system. We could probably serve a majority of similarity queries from a small cache."""

In [200]: tokens = nlu(text)

In [201]: most_similar_in_sentence(tokens, u'query')
Out[201]:
[(u'query', 0.99999994),
 (u'queries', 0.74670404),
 (u'cache', 0.49115649),
 (u'cache', 0.49115646),
 (u'solution', 0.36885005)]

Maybe you want to filter based on POS tag.

In [202]: most_similar_in_sentence(tokens, u'query', pos_tags=['VERB'])
Out[202]:
[(u'stick', 0.29887787),
 (u'could', 0.23798984),
 (u'is', 0.2375059),
 (u'serve', 0.23126227)]

In [203]: most_similar_in_sentence(tokens, u'query', pos_tags=['NOUN', 'VERB', 'ADJ', 'NUM'], topn=10)
Out[203]:
[(u'query', 0.99999994),
 (u'queries', 0.74670416),
 (u'cache', 0.49115649),
 (u'cache', 0.49115646),
 (u'solution', 0.36885005),
 (u'system', 0.32158077),
 (u'stick', 0.29887787),
 (u'similarity', 0.29796919),
 (u'One', 0.24750805),
 (u'could', 0.23798984)]

Hope that helps!

honnibal · 2015-05-09T12:33:46Z

The vectors are taken from here: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

These embeddings were computed from dependencies, using a parser that's very similar to SpaCy's (the Goldberg and Nivre (2012) model. SpaCy uses some updates on this model, most importantly Brown cluster features.)

The dependency-based embeddings seem substantially better to me. At least, they agree more closely with my expectations for what sort of regularities these vectors should be capturing.

Once I get around to generating embeddings myself, I'd like to have a set of vectors keyed by (lemma, POS tag) tuples, instead of the current string-keyed vectors, which rely on somewhat arbitrary text pre-processing. I'd also like to do named entity linking, and have vectors in the same space for entities.

This stuff is in the roadmap, but somewhere behind constituency parsing, and the named entity plans.

honnibal · 2015-09-24T09:54:14Z

See additional documentaton here: http://spacy.io/tutorials/load-new-word-vectors/

LopezGG · 2016-11-30T00:20:44Z

@honnibal : Can you share the code snippet to load Omer Levy's dep embedding

lock · 2018-05-09T05:39:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal closed this as completed Sep 24, 2015

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for word representations #58

Improve support for word representations #58

elyase commented Apr 21, 2015

mfilipov commented May 6, 2015

honnibal commented May 6, 2015

geovedi commented May 7, 2015

honnibal commented May 9, 2015

honnibal commented Sep 24, 2015

LopezGG commented Nov 30, 2016

lock bot commented May 9, 2018

Improve support for word representations #58

Improve support for word representations #58

Comments

elyase commented Apr 21, 2015

mfilipov commented May 6, 2015

honnibal commented May 6, 2015

geovedi commented May 7, 2015

Extra

honnibal commented May 9, 2015

honnibal commented Sep 24, 2015

LopezGG commented Nov 30, 2016

lock bot commented May 9, 2018