Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Vocab and Vector Enhancements #102

Merged
merged 41 commits into from
Sep 11, 2017

Conversation

nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Sep 3, 2017

This PR provides a few enhancements to the Vocab and Vector classes:

  1. Adds some missing default values to the Vocab docstring.

  2. Adds a test to ensure that the min_freq and specials Vocab construction args work

  3. Test that non-standard unicode input works with Vocab

  4. Replace print statements with logging statements. This could be a bit controversial, but I think it'd be prudent to adopt logging statements project-wide as they offer finer-grain control to users on what gets logged to file/stdout, whereas if we just use print they have no such choice.

  5. Split Vocab tests into one that tests Vocab functionality and one that tests Vector functionality (these were previously merged/conflated).

  6. Add FastText vectors (en and simple, for English and Simple English) to the list of pretrained vectors (more rationale on this in the next bulletpoint).

  7. Remove the test files that were previously used in .vector_cache, and instead download real vectors from the internet for use in the test. Previously, this was difficult to feasibly do on CI since downloading glove vectors takes forever (nlp.stanford.edu is not very fast / the files are large). I got around this issue by testing FastText downloads instead, since they're hosted on Amazon S3 (much faster!) and the filesize is a lot smaller.

  8. this is probably the most controversial change, so let's discuss. Previously, we were opening the vectors as binary files and calling .decode('utf8') for every word, which is quite slow. I've replaced this with io.open(path, encoding='utf8'), which just naturally opens the file for reading in utf8.
    The upside to this is that it's much much faster. The downside is that we can't skip lines with bad unicode. However, I'm not convinced that skipping malformed lines is even a good thing --- the pretrained vectors shouldn't have bad unicode, and in the future when we let users load their own embeddings they should have sorted out the unicode issues by themselves. Have you had experiences where you needed that behavior (skipping malformed lines) in order to properly read the vectors?

This PR has a lot of changes all over the place, which probably makes it a bit tough to review --- sorry about that. I've tried to keep features relatively isolated in respective commits.

@jekbradbury
Copy link
Contributor

Yes, I think either one of the pretrained GloVe files or a pretrained word2vec file I've used in the past has a line with malformed Unicode, but I can't find it now. We should definitely move to your fix, but maybe it's possible to bail and start the slower decoding on failure?

@nelson-liu
Copy link
Contributor Author

Yes, I think either one of the pretrained GloVe files or a pretrained word2vec file I've used in the past has a line with malformed Unicode, but I can't find it now.

I recall having this issue as well at some point, but I can't remember which set of pretrained vectors it was / how i fixed it...I'll try to see if i can find it in a bit.

maybe it's possible to bail and start the slower decoding on failure?

Good idea, done.

@nelson-liu
Copy link
Contributor Author

So there was a glove release that claimed to fix some utf-8 issues (more details at https://groups.google.com/forum/#!topic/globalvectors/C9Hh4VzcsQQ). Maybe this is what we were remembering?

Also, this topic on the glove mailing list claims that there are anomalous lines in twitter.25d. I edited the tests to use these vectors and ran them on python 2.7/3.5/3.6 and found nothing odd.

nelson-liu and others added 24 commits September 7, 2017 17:15
Fix fasttext unzipping

Remove copy, since we're copying same src and dest

Copy downloaded txt or vec to fname_txt

Skip headers / 1-dimensional vectors

Actually skip headers

Use rstrip and explicitly split on space
Fix test script syntax error

Fix allowed failures

Add travis_wait to slow tests to prevent build timeouts

Instead of using travis_wait, just write output to stdout in slow tests

Fix conflict in travis.yml

Use TorchtextTestCase in test_vocab

Revert "Use TorchtextTestCase in test_vocab"

This reverts commit f899511.

Use TorchtextTestCase in test_vocab

Add project_root as TorchtextTestCase member

Delete charngram cache after tests on CI

Fix .travis.yml to include slow tests

Try using non-container builds for python 2.7 slow tests (more mem)

Remove glove and fasttext vectors on CI after test
@nelson-liu
Copy link
Contributor Author

nelson-liu commented Sep 8, 2017

to explain for future reference what i've done with the allowed failures in the travis config:

Testing glove + charngram is pretty slow, since it takes awhile to download and unzip the vectors. As a result, I set them as "allowed failures" --- if all the other tests pass, CI will go green. However, these allowed failures will still run, and coverage for the commit will be updated when they finish. This lets us have tests for w2v / charngram vectors + accurate coverage reporting + still maintain speedy build times for PRs where we don't even touch vector behavior.

assert_allclose(vectors[v.stoi['<unk>']], np.zeros(300))
# Delete the vectors after we're done to save disk space on CI
if os.environ.get("TRAVIS") == "true":
os.remove(os.path.join(self.project_root, ".vector_cache",

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Sep 10, 2017

to summarize the latest round of changes:

cherry-picked #114 and added a test. so merging this will merge that PR as well.

Vocab.stoi didn't have an entry for "<unk>". While I know in practice it essentially is there (since the defaultdict would return 0 for "<unk>" due to it not being there), I added an "<unk>": 0 mapping explicitly so stoi is the mirror of itos, as one would expect.

I got rid of the lambda in Vocab, and added an __eq__ function to compare different Vocab instances. I added a test to assert that creating a Vocab, pickling it, and loading it back gives you an equivalent Vocab.

Sorry for these last changes, this PR should now really be ready to look at. Ran into the <unk> issue when i was finishing up forthcoming tests for Field and realized I forgot to fix the lambda while i was at it.

@jekbradbury jekbradbury merged commit 1eebbbf into pytorch:master Sep 11, 2017
@nelson-liu nelson-liu deleted the vocab_fixes branch September 13, 2017 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants