Various Vocab and Vector Enhancements #102

nelson-liu · 2017-09-03T00:19:46Z

This PR provides a few enhancements to the Vocab and Vector classes:

Adds some missing default values to the Vocab docstring.
Adds a test to ensure that the min_freq and specials Vocab construction args work
Test that non-standard unicode input works with Vocab
Replace print statements with logging statements. This could be a bit controversial, but I think it'd be prudent to adopt logging statements project-wide as they offer finer-grain control to users on what gets logged to file/stdout, whereas if we just use print they have no such choice.
Split Vocab tests into one that tests Vocab functionality and one that tests Vector functionality (these were previously merged/conflated).
Add FastText vectors (en and simple, for English and Simple English) to the list of pretrained vectors (more rationale on this in the next bulletpoint).
Remove the test files that were previously used in .vector_cache, and instead download real vectors from the internet for use in the test. Previously, this was difficult to feasibly do on CI since downloading glove vectors takes forever (nlp.stanford.edu is not very fast / the files are large). I got around this issue by testing FastText downloads instead, since they're hosted on Amazon S3 (much faster!) and the filesize is a lot smaller.
this is probably the most controversial change, so let's discuss. Previously, we were opening the vectors as binary files and calling .decode('utf8') for every word, which is quite slow. I've replaced this with io.open(path, encoding='utf8'), which just naturally opens the file for reading in utf8.
The upside to this is that it's much much faster. The downside is that we can't skip lines with bad unicode. However, I'm not convinced that skipping malformed lines is even a good thing --- the pretrained vectors shouldn't have bad unicode, and in the future when we let users load their own embeddings they should have sorted out the unicode issues by themselves. Have you had experiences where you needed that behavior (skipping malformed lines) in order to properly read the vectors?

This PR has a lot of changes all over the place, which probably makes it a bit tough to review --- sorry about that. I've tried to keep features relatively isolated in respective commits.

jekbradbury · 2017-09-04T07:35:52Z

Yes, I think either one of the pretrained GloVe files or a pretrained word2vec file I've used in the past has a line with malformed Unicode, but I can't find it now. We should definitely move to your fix, but maybe it's possible to bail and start the slower decoding on failure?

nelson-liu · 2017-09-04T08:20:59Z

Yes, I think either one of the pretrained GloVe files or a pretrained word2vec file I've used in the past has a line with malformed Unicode, but I can't find it now.

I recall having this issue as well at some point, but I can't remember which set of pretrained vectors it was / how i fixed it...I'll try to see if i can find it in a bit.

maybe it's possible to bail and start the slower decoding on failure?

Good idea, done.

nelson-liu · 2017-09-04T08:49:18Z

So there was a glove release that claimed to fix some utf-8 issues (more details at https://groups.google.com/forum/#!topic/globalvectors/C9Hh4VzcsQQ). Maybe this is what we were remembering?

Also, this topic on the glove mailing list claims that there are anomalous lines in twitter.25d. I edited the tests to use these vectors and ran them on python 2.7/3.5/3.6 and found nothing odd.

Fix fasttext unzipping Remove copy, since we're copying same src and dest Copy downloaded txt or vec to fname_txt Skip headers / 1-dimensional vectors Actually skip headers Use rstrip and explicitly split on space

Fix test script syntax error Fix allowed failures Add travis_wait to slow tests to prevent build timeouts Instead of using travis_wait, just write output to stdout in slow tests Fix conflict in travis.yml Use TorchtextTestCase in test_vocab Revert "Use TorchtextTestCase in test_vocab" This reverts commit f899511. Use TorchtextTestCase in test_vocab Add project_root as TorchtextTestCase member Delete charngram cache after tests on CI Fix .travis.yml to include slow tests Try using non-container builds for python 2.7 slow tests (more mem) Remove glove and fasttext vectors on CI after test

nelson-liu · 2017-09-08T02:05:48Z

to explain for future reference what i've done with the allowed failures in the travis config:

Testing glove + charngram is pretty slow, since it takes awhile to download and unzip the vectors. As a result, I set them as "allowed failures" --- if all the other tests pass, CI will go green. However, these allowed failures will still run, and coverage for the commit will be updated when they finish. This lets us have tests for w2v / charngram vectors + accurate coverage reporting + still maintain speedy build times for PRs where we don't even touch vector behavior.

test/test_vocab.py

+            assert_allclose(vectors[v.stoi['<unk>']], np.zeros(300))
+        # Delete the vectors after we're done to save disk space on CI
+        if os.environ.get("TRAVIS") == "true":
+            os.remove(os.path.join(self.project_root, ".vector_cache",


This reverts commit f28ebdf.

nelson-liu · 2017-09-10T00:52:27Z

to summarize the latest round of changes:

cherry-picked #114 and added a test. so merging this will merge that PR as well.

Vocab.stoi didn't have an entry for "<unk>". While I know in practice it essentially is there (since the defaultdict would return 0 for "<unk>" due to it not being there), I added an "<unk>": 0 mapping explicitly so stoi is the mirror of itos, as one would expect.

I got rid of the lambda in Vocab, and added an __eq__ function to compare different Vocab instances. I added a test to assert that creating a Vocab, pickling it, and loading it back gives you an equivalent Vocab.

Sorry for these last changes, this PR should now really be ready to look at. Ran into the <unk> issue when i was finishing up forthcoming tests for Field and realized I forgot to fix the lambda while i was at it.

jekbradbury self-requested a review September 4, 2017 07:40

nelson-liu force-pushed the vocab_fixes branch from c1397d4 to 6db1c5d Compare September 4, 2017 07:50

nelson-liu and others added 24 commits September 7, 2017 17:15

Test vocab min_freq and specials vocab args, as well as unicode input

826ef37

Add missing default to vocab docstring, ensure vector dim consistency

d00b4ed

Replace print statements with logging statements

c8dde58

Remove unused main in test_vocab

933c0cd

Separate basic and vector tests, and download actual vectors in tests

65b77b9

Use io.open instead of decoding every word

8374a19

Add .vector_cache to .gitignore

c4489d5

Add en and simple en fasttext vectors, use simple en in tests

69a3c66

Fix fasttext unzipping Remove copy, since we're copying same src and dest Copy downloaded txt or vec to fname_txt Skip headers / 1-dimensional vectors Actually skip headers Use rstrip and explicitly split on space

Remove unused .vector_cache folder from git repo

873315b

Fix logging in tests

25ad0aa

Slight changes to logging statement for consistency

eb28d2d

Build vocab and get vectors twice to test caching

62fc855

Change expected vectors variable name for fasttext vs twitter

a020cf3

Fix charngram always returning unk vector

b28db8e

Add slow test marker

2b41b6e

Add (slow) tests for loading glove and charngram vectors

f087318

Back off to decoding bytes if reading whole file as utf8 fails

f40e24e

Change glove test to use twitter 25d, since that may have unicode issues

87739d2

Edit expected glove vectors for twitter.25d

2b8a8bc

Edit size of expected unk vector

c9574e8

Fix charngram 2.7 failure due to joining unicode and byte strings

04c062c

Manually convert necessary literals to unicode

0416f5d

Fix python 2.7.8 array char requirement

c3d0bdf

Use str function for 2/3 array argument compatibility, and add a note

2f6d9cb

nelson-liu force-pushed the vocab_fixes branch from 5fc9c26 to 2f6d9cb Compare September 8, 2017 00:16

nelson-liu mentioned this pull request Sep 8, 2017

Fix a bug of the set_vectors function #114

Closed

jihunchoi and others added 6 commits September 8, 2017 10:48

Fix a bug of the set_vectors function

b1d2cd5

Add docstring for Vocab.set_vectors

8248b59

Fix typo in Vocab unk_init docstrings, add default info

8d0e036

Use .get when checking environment variables in vocab test

8dfb934

Add example of vocab and stoi interaction for Vocab set_vectors

99df770

Add test for Vocab.set_vectors

f979d79

bmccann mentioned this pull request Sep 9, 2017

Custom vectors and refactor #115

Merged

bmccann reviewed Sep 9, 2017

View reviewed changes

nelson-liu added 8 commits September 9, 2017 12:59

Verify cached vectors exist before removing them

f28ebdf

Revert "Verify cached vectors exist before removing them"

730692e

This reverts commit f28ebdf.

Give unk token an entry in Vocab stoi

8ac8edf

Test stoi is properly generated in Vocab

9157547

Replace defaultdict lambda with static function to return unk index

dfcbeb9

Initialize vectors member to be None in vocab

c43c2fb

Add __eq__ function for vocab

11e11a8

Add test to assert the Vocab is serializable

428b236

nelson-liu added 2 commits September 9, 2017 19:51

Make defaultdict argument module-level for py2

8964348

Fix lint

edb5fed

jekbradbury approved these changes Sep 11, 2017

View reviewed changes

jekbradbury merged commit 1eebbbf into pytorch:master Sep 11, 2017

nelson-liu mentioned this pull request Sep 13, 2017

.vector_cache is it needed in the repo? #101

Closed

nelson-liu deleted the vocab_fixes branch September 13, 2017 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various Vocab and Vector Enhancements #102

Various Vocab and Vector Enhancements #102

nelson-liu commented Sep 3, 2017 •

edited

Loading

jekbradbury commented Sep 4, 2017

nelson-liu commented Sep 4, 2017

nelson-liu commented Sep 4, 2017

nelson-liu commented Sep 8, 2017 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

nelson-liu commented Sep 10, 2017 •

edited

Loading

Various Vocab and Vector Enhancements #102

Various Vocab and Vector Enhancements #102

Conversation

nelson-liu commented Sep 3, 2017 • edited Loading

jekbradbury commented Sep 4, 2017

nelson-liu commented Sep 4, 2017

nelson-liu commented Sep 4, 2017

nelson-liu commented Sep 8, 2017 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

nelson-liu commented Sep 10, 2017 • edited Loading

nelson-liu commented Sep 3, 2017 •

edited

Loading

nelson-liu commented Sep 8, 2017 •

edited

Loading

nelson-liu commented Sep 10, 2017 •

edited

Loading