Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queston: Difference btw Spacy WordVec and Gensim/Google WordVec #338

Closed
ArdavanA opened this issue Apr 13, 2016 · 8 comments
Closed

Queston: Difference btw Spacy WordVec and Gensim/Google WordVec #338

ArdavanA opened this issue Apr 13, 2016 · 8 comments

Comments

@ArdavanA
Copy link

Hi ,

Thanks a lot for your fantastic tool, keep up with the good work!
I want to ask you the difference between the Google word vector library ( https://code.google.com/archive/p/word2vec/ ) and the one you use in Spacy.

Kind regards

@elyase
Copy link
Contributor

elyase commented Apr 14, 2016

Google's wordvec is able to generate word vectors from text. Spacy makes it easy to load these and other word vectors so that you can use them in your NLP tasks.

By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model but you can also load Google's word2vec or Glove vectors. Please see this blog post for more details on how to do that:

https://spacy.io/docs/tutorials/load-new-word-vectors

@ArdavanA
Copy link
Author

Thanks Yasser

@honnibal
Copy link
Member

Easiest way to load GloVe vectors is now:

import spacy

nlp = spacy.load('en', vectors='en_glove_cc_300_1m')

This will load a subset of the GloVe common crawl vectors --- it'll give you vectors for 1m words. This is a large vocabulary and you should get high coverage with this, without the crazy memory requirements of the original unpruned data.

This function isn't well documented yet, because we've only recently stabilised the API. I'll fix the blog post.

@deeprnd
Copy link

deeprnd commented May 11, 2016

this doesn't work and throws exception:

name = 'en_glove_cc_300_1m'
def get_lang_class(name):
lang = re.split('[^a-zA-Z0-9_]', name, 1)[0]
if lang not in LANGUAGES:
raise RuntimeError('Language not supported: %s' % lang)
RuntimeError: Language not supported: en_glove_cc_300_1m

the reason is the regex should be just '_', which will work fine both for 'en' and for 'en_glove_cc_300_1m' returning the desired 'en'

However even after fixing the regex there is another exception:

name = 'en_glove_cc_300_1m', via = None
def get_package_by_name(name=None, via=None):
if name is None:
return
lang = get_lang_class(name)
try:
return sputnik.package(about.title, about.version,
name, data_path=via)
except PackageNotFoundException as e:
raise RuntimeError("Model '%s' not installed. Please run 'python -m "
"%s.download' to install latest compatible "
"model." % (name, lang.module))
RuntimeError: Model 'en_glove_cc_300_1m' not installed. Please run 'python -m >spacy.en.download' to install latest compatible model.

running "python -m spacy.en.download --force all" doesn't help

running version 0.101.0
any thoughts?

@daylen
Copy link
Contributor

daylen commented May 16, 2016

Ran into the same issue. Per @aie0's suggestion I switched lang = re.split('[^a-zA-Z0-9_]', name, 1)[0] to lang = re.split('_', name, 1)[0]. Also, I did nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors') insead of nlp = spacy.load('en', vectors='en_glove_cc_300_1m'). The extra _vectors did the trick for me.

@honnibal
Copy link
Member

This should all be cleaned up in 1.0 — the GloVe vectors are installed by default, and it's much easier to use different vectors.

@anujshah1003
Copy link

i always get this error even after installing the 'en':
ValueError: Word vectors set to length 0. This may be because the data is not installed. If you haven't already, run
python -m spacy.en.download all
to install the data.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants