New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_vocab with custom word embedding #201

Open
DaehanKim opened this Issue Jan 8, 2018 · 7 comments

Comments

Projects
None yet
7 participants
@DaehanKim
Copy link

DaehanKim commented Jan 8, 2018

Hi :)
I want to use customized bio-word embedding to do some text classification.

And I can't find how.

Some old tutorial says there is 'wv_dir' keyword argument, which I tried and failed :

TypeError                                 Traceback (most recent call last)
<ipython-input-48-ac09f554719e> in <module>()
      1 test_field = data.Field()
      2 lang_data = datasets.LanguageModelingDataset(path='pr_data/processed_neg.txt',text_field=test_field)
----> 3 voc = torchtext.vocab.Vocab(wv_dir='bio_wordemb/PubMed-and-PMC-w2v.txt')
      4 
      5 # test_field.build_vocab(lang_data,wv_dir='bio_wordemb/PubMed-and-PMC-w2v.txt')

TypeError: __init__() got an unexpected keyword argument 'wv_dir'

Just like we can load pretrained GloVe embedding using TEXTFIELD.build_vocab(data, vectors='glove.6B.100d'), is there similar way to load customized embedding?

Any help would be much appreciated. Thanks!

@jekbradbury jekbradbury added the bug label Jan 9, 2018

@Eric-Wallace

This comment has been minimized.

Copy link

Eric-Wallace commented Jan 14, 2018

The wv_dir has been removed to my knowledge. In my opinion, the best way to get custom vectors is to actually just make changes to the vocab.py file. See for example how FastText is added.

class FastText(Vectors):

@DaehanKim

This comment has been minimized.

Copy link

DaehanKim commented Jan 16, 2018

Thank You. I built a custom extractor and solved the problem.

@DaehanKim DaehanKim closed this Jan 16, 2018

@jekbradbury jekbradbury reopened this Jan 21, 2018

@tomekkorbak

This comment has been minimized.

Copy link

tomekkorbak commented Mar 26, 2018

@jekbradbury, would it be valuable to work on a PR with better support for using custom word embeddings? If so, I'd be happy to work on this.

@t-vi

This comment has been minimized.

Copy link

t-vi commented Jun 6, 2018

Unless I'm missing something subtle, the obvious right thing to do here is to do

                if os.path.isfile(vector):
                    vectors[idx] = Vectors(vector, **kwargs)
                else:
                    # ... do the predefined stuff currently there but better use if ... elif instead...

in Vocab.load_vectors when you have a string argument.

Needless to say, I'd be happy to send a PR with updated docs etc.

@pablogps

This comment has been minimized.

Copy link

pablogps commented Sep 10, 2018

Is there a way to use a pre-trained embedding that does not involve changing the source code? This seems like pretty important functionality! As a fast patch I would be very grateful if a pre-trained Spanish embedding were included in the default choices.

@bebound

This comment has been minimized.

Copy link

bebound commented Dec 14, 2018

This works, xxx.vecshould be the standard word2vec format file.

from torchtext.vocab import Vectors

vectors = Vectors(name='xxx.vec', cache='./')
TEXT.build_vocab(train, val, test, vectors=vectors)
@pablogps

This comment has been minimized.

Copy link

pablogps commented Dec 14, 2018

Yes, you are right! There is a bit more info here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment