Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset fixes and tests #105

Merged
merged 13 commits into from
Sep 8, 2017
Merged

Conversation

nelson-liu
Copy link
Contributor

This PR adds some tests to the Dataset and Example classes. It also sets up a TorchtextTestCase base test case class that is used as a central repository for writing synthetic test data to a temporary directory when running tests.

I've also made a few minor changes to the class behavior:

  1. Previously the dataset code would silently skip input fields that didn't exist in the JSON data. I'm curious if there's a real dataset that this makes easier to work with --- it certainly seems like it could be a pretty big source of bugs for the user, so I've changed it to error when the dataset tries to read a JSON key not actually in the data.

  2. Reading CSV data with Unicode characters previously didn't work with Python 2, since the csv module in python 2 doesn't support unicode input. I've fixed it (by encoding to utf8, running it through the CSV reader, then decoding back to unicode) and added a non-regression test for it.

  3. Making examples from CSV and TSV lines would previously remove the last character if it was \n. I've changed this to use line.rstrip("\n").

Copy link
Contributor

@jekbradbury jekbradbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like all these changes. I wish there were a nicer way to deal with the Python 2 CSV thing, but it looks like what you have works consistently.

@jekbradbury jekbradbury merged commit 3aa4d13 into pytorch:master Sep 8, 2017
@nelson-liu nelson-liu deleted the dataset_tests branch September 8, 2017 00:13
jekbradbury added a commit that referenced this pull request Sep 8, 2017
jekbradbury added a commit that referenced this pull request Sep 8, 2017
@nelson-liu nelson-liu restored the dataset_tests branch September 8, 2017 00:24
@nelson-liu nelson-liu deleted the dataset_tests branch September 8, 2017 00:25
bmccann pushed a commit to bmccann/text that referenced this pull request Sep 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants