-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various Vocab and Vector Enhancements #102
Conversation
Yes, I think either one of the pretrained GloVe files or a pretrained word2vec file I've used in the past has a line with malformed Unicode, but I can't find it now. We should definitely move to your fix, but maybe it's possible to bail and start the slower decoding on failure? |
c1397d4
to
6db1c5d
Compare
I recall having this issue as well at some point, but I can't remember which set of pretrained vectors it was / how i fixed it...I'll try to see if i can find it in a bit.
Good idea, done. |
So there was a glove release that claimed to fix some utf-8 issues (more details at https://groups.google.com/forum/#!topic/globalvectors/C9Hh4VzcsQQ). Maybe this is what we were remembering? Also, this topic on the glove mailing list claims that there are anomalous lines in twitter.25d. I edited the tests to use these vectors and ran them on python 2.7/3.5/3.6 and found nothing odd. |
Fix fasttext unzipping Remove copy, since we're copying same src and dest Copy downloaded txt or vec to fname_txt Skip headers / 1-dimensional vectors Actually skip headers Use rstrip and explicitly split on space
5fc9c26
to
2f6d9cb
Compare
Fix test script syntax error Fix allowed failures Add travis_wait to slow tests to prevent build timeouts Instead of using travis_wait, just write output to stdout in slow tests Fix conflict in travis.yml Use TorchtextTestCase in test_vocab Revert "Use TorchtextTestCase in test_vocab" This reverts commit f899511. Use TorchtextTestCase in test_vocab Add project_root as TorchtextTestCase member Delete charngram cache after tests on CI Fix .travis.yml to include slow tests Try using non-container builds for python 2.7 slow tests (more mem) Remove glove and fasttext vectors on CI after test
to explain for future reference what i've done with the allowed failures in the travis config: Testing glove + charngram is pretty slow, since it takes awhile to download and unzip the vectors. As a result, I set them as "allowed failures" --- if all the other tests pass, CI will go green. However, these allowed failures will still run, and coverage for the commit will be updated when they finish. This lets us have tests for w2v / charngram vectors + accurate coverage reporting + still maintain speedy build times for PRs where we don't even touch vector behavior. |
test/test_vocab.py
Outdated
assert_allclose(vectors[v.stoi['<unk>']], np.zeros(300)) | ||
# Delete the vectors after we're done to save disk space on CI | ||
if os.environ.get("TRAVIS") == "true": | ||
os.remove(os.path.join(self.project_root, ".vector_cache", |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This reverts commit f28ebdf.
to summarize the latest round of changes: cherry-picked #114 and added a test. so merging this will merge that PR as well.
I got rid of the lambda in Sorry for these last changes, this PR should now really be ready to look at. Ran into the |
This PR provides a few enhancements to the
Vocab
andVector
classes:Adds some missing default values to the
Vocab
docstring.Adds a test to ensure that the
min_freq
andspecials
Vocab
construction args workTest that non-standard unicode input works with
Vocab
Replace print statements with logging statements. This could be a bit controversial, but I think it'd be prudent to adopt logging statements project-wide as they offer finer-grain control to users on what gets logged to file/stdout, whereas if we just use print they have no such choice.
Split
Vocab
tests into one that testsVocab
functionality and one that testsVector
functionality (these were previously merged/conflated).Add
FastText
vectors (en
andsimple
, for English and Simple English) to the list of pretrained vectors (more rationale on this in the next bulletpoint).Remove the test files that were previously used in
.vector_cache
, and instead download real vectors from the internet for use in the test. Previously, this was difficult to feasibly do on CI since downloading glove vectors takes forever (nlp.stanford.edu
is not very fast / the files are large). I got around this issue by testing FastText downloads instead, since they're hosted on Amazon S3 (much faster!) and the filesize is a lot smaller.this is probably the most controversial change, so let's discuss. Previously, we were opening the vectors as binary files and calling
.decode('utf8')
for every word, which is quite slow. I've replaced this withio.open(path, encoding='utf8')
, which just naturally opens the file for reading inutf8
.The upside to this is that it's much much faster. The downside is that we can't skip lines with bad unicode. However, I'm not convinced that skipping malformed lines is even a good thing --- the pretrained vectors shouldn't have bad unicode, and in the future when we let users load their own embeddings they should have sorted out the unicode issues by themselves. Have you had experiences where you needed that behavior (skipping malformed lines) in order to properly read the vectors?
This PR has a lot of changes all over the place, which probably makes it a bit tough to review --- sorry about that. I've tried to keep features relatively isolated in respective commits.