-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for word representations #58
Comments
spacy.tokens.Token.repvec loads pretrained word representation. |
On Wednesday, May 6, 2015, mfilipov notifications@github.com wrote:
This is an open research problem. Consider that there are (n^2-n)/2 The folks over at gensim are thinking about this too: piskvorky/gensim#51 One solution is to stick a cache in front of the query system. We could probably serve a majority of similarity queries from a small cache.
|
I don't know how Matthew generated the built-in embedding vectors, but to me the quality is okay. But if you're not happy with it and want to use your own embeddings, there's Also remember, spaCy expecting For import spacy.en
from gensim.models.word2vec import Word2Vec, Vocab
nlu = spacy.en.English()
model = Word2Vec(size=300)
for i, lex in enumerate(nlu.vocab):
model.vocab[lex.orth_] = Vocab(index=i, count=None)
model.index2word.append(lex.orth_)
model.syn0norm = np.asarray(map(lambda x: x.repvec, nlu.vocab)) loaded to In [249]: model.most_similar(u'space')
Out[249]:
[(u'Space', 1.0),
(u'SPACE', 1.0),
(u'SPACES', 0.7741692662239075),
(u'spaces', 0.7741692662239075),
(u'Spaces', 0.7741692662239075),
(u'workspace', 0.6425580978393555),
(u'hyperspace', 0.578804612159729),
(u'CYBERSPACE', 0.5667369961738586),
(u'cyberspace', 0.5667369961738586),
(u'Cyberspace', 0.5667369961738586)] But if you just want import numpy as np
import spacy.en
nlu = spacy.en.English()
vectors = np.asarray(map(lambda x: x.repvec, nlu.vocab))
vocab = map(lambda x: x.orth_, nlu.vocab)
def most_similar(word, topn=5):
if isinstance(word, str):
word = unicode(word)
dists = np.dot(vectors, nlu.vocab[word].repvec)
return map(lambda x: (vocab[x], dists[x]), np.argsort(dists)[::-1][:topn]) Example: In [189]: most_similar(u'query')
Out[189]:
[(u'query', 0.99999994),
(u'Query', 0.99999994),
(u'queries', 0.74670404),
(u'keystroke', 0.65798581),
(u'signup', 0.62638754)] ExtraMaybe you want to know the similar words from the same sentence. That's also easy, you can have a function like this.. def most_similar_in_sentence(tokens, word, pos_tags=[], topn=5):
if isinstance(word, str):
word = unicode(word)
if pos_tags:
vocab = filter(lambda x: x.pos_ in pos_tags, tokens)
else:
vocab = tokens
dists = np.dot(np.asarray(map(lambda x: x.repvec, vocab)), nlu.vocab[word].repvec)
return map(lambda x: (vocab[x].orth_, dists[x]), np.argsort(dists)[::-1][:topn]) Example In [199]: text = u"""One solution is to stick a cache in front of the query system. We could probably serve a majority of similarity queries from a small cache."""
In [200]: tokens = nlu(text)
In [201]: most_similar_in_sentence(tokens, u'query')
Out[201]:
[(u'query', 0.99999994),
(u'queries', 0.74670404),
(u'cache', 0.49115649),
(u'cache', 0.49115646),
(u'solution', 0.36885005)] Maybe you want to filter based on POS tag. In [202]: most_similar_in_sentence(tokens, u'query', pos_tags=['VERB'])
Out[202]:
[(u'stick', 0.29887787),
(u'could', 0.23798984),
(u'is', 0.2375059),
(u'serve', 0.23126227)]
In [203]: most_similar_in_sentence(tokens, u'query', pos_tags=['NOUN', 'VERB', 'ADJ', 'NUM'], topn=10)
Out[203]:
[(u'query', 0.99999994),
(u'queries', 0.74670416),
(u'cache', 0.49115649),
(u'cache', 0.49115646),
(u'solution', 0.36885005),
(u'system', 0.32158077),
(u'stick', 0.29887787),
(u'similarity', 0.29796919),
(u'One', 0.24750805),
(u'could', 0.23798984)] Hope that helps! |
The vectors are taken from here: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ These embeddings were computed from dependencies, using a parser that's very similar to SpaCy's (the Goldberg and Nivre (2012) model. SpaCy uses some updates on this model, most importantly Brown cluster features.) The dependency-based embeddings seem substantially better to me. At least, they agree more closely with my expectations for what sort of regularities these vectors should be capturing. Once I get around to generating embeddings myself, I'd like to have a set of vectors keyed by (lemma, POS tag) tuples, instead of the current string-keyed vectors, which rely on somewhat arbitrary text pre-processing. I'd also like to do named entity linking, and have vectors in the same space for entities. This stuff is in the roadmap, but somewhere behind constituency parsing, and the named entity plans. |
See additional documentaton here: http://spacy.io/tutorials/load-new-word-vectors/ |
@honnibal : Can you share the code snippet to load Omer Levy's dep embedding |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
It would be great to be able to load pretrained word representations, at least the ones from Glove[1] and word2vec [2].
In the same fashion it would be useful to have a
most_similar
method able to efficiently retrieve the top n similar words.[1] http://nlp.stanford.edu/projects/glove/
[2] https://code.google.com/p/word2vec/
The text was updated successfully, but these errors were encountered: