Skip to content

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Jun 12, 2019

This adds test for the CIFAR10 and CIFAR100 datasets as mentioned in #1006.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot for the awesome PR!

Just waiting for CI to finish and then this is good to merge

@pmeier
Copy link
Collaborator Author

pmeier commented Jun 12, 2019

Seems like the pickle.load() within the dataset is not correct for python2. I don't have a environment set up to test python2. @fmassa Could you check this?

@fmassa
Copy link
Member

fmassa commented Jun 12, 2019

@pmeier the pickle file seems to be Python3-only.

I see two options:

  • re-save and re-upload the pickle with pickle protocol 2 https://docs.python.org/3/library/pickle.html#data-stream-format
  • add the logic to generate dummy files in the test code, like in
    @contextlib.contextmanager
    def get_mnist_data(num_images, cls_name, **kwargs):
    def _encode(v):
    return torch.tensor(v, dtype=torch.int32).numpy().tobytes()[::-1]
    def _make_image_file(filename, num_images):
    img = torch.randint(0, 255, size=(28 * 28 * num_images,), dtype=torch.uint8)
    with open(filename, "wb") as f:
    f.write(_encode(2051)) # magic header
    f.write(_encode(num_images))
    f.write(_encode(28))
    f.write(_encode(28))
    f.write(img.numpy().tobytes())
    def _make_label_file(filename, num_images):
    labels = torch.randint(0, 10, size=(num_images,), dtype=torch.uint8)
    with open(filename, "wb") as f:
    f.write(_encode(2049)) # magic header
    f.write(_encode(num_images))
    f.write(labels.numpy().tobytes())
    tmp_dir = tempfile.mkdtemp(**kwargs)
    raw_dir = os.path.join(tmp_dir, cls_name, "raw")
    os.makedirs(raw_dir)
    _make_image_file(os.path.join(raw_dir, "train-images-idx3-ubyte"), num_images)
    _make_label_file(os.path.join(raw_dir, "train-labels-idx1-ubyte"), num_images)
    _make_image_file(os.path.join(raw_dir, "t10k-images-idx3-ubyte"), num_images)
    _make_label_file(os.path.join(raw_dir, "t10k-labels-idx1-ubyte"), num_images)
    try:
    yield tmp_dir
    finally:
    shutil.rmtree(tmp_dir)

Given that you have generated the dummy file yourself, I think it might make sense to go for option 2, and plus the data generation logic stays in the code as well.

Thoughts?

@codecov-io
Copy link

codecov-io commented Jun 12, 2019

Codecov Report

Merging #1010 into master will increase coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1010      +/-   ##
==========================================
+ Coverage   63.11%   63.24%   +0.12%     
==========================================
  Files          65       65              
  Lines        5073     5148      +75     
  Branches      757      772      +15     
==========================================
+ Hits         3202     3256      +54     
- Misses       1656     1669      +13     
- Partials      215      223       +8
Impacted Files Coverage Δ
torchvision/datasets/mnist.py 50.71% <0%> (-12.53%) ⬇️
torchvision/transforms/transforms.py 81.89% <0%> (-0.65%) ⬇️
torchvision/datasets/cifar.py 78.65% <0%> (+40.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f64dd9...1216308. Read the comment docs.

@pmeier
Copy link
Collaborator Author

pmeier commented Jun 12, 2019

Should be fixed now. As a general thought: Do we want to do the fake data generation on-the-fly for all datasets? If yes, I think we should move the generation to a separate module to avoid cluttering the actual tests.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa
Copy link
Member

fmassa commented Jun 12, 2019

Do we want to do the fake data generation on-the-fly for all datasets? If yes, I think we should move the generation to a separate module to avoid cluttering the actual tests.

Yes, I think we should be doing it on-the-fly, and I agree that they should eventually be moved to a separate file.

@fmassa fmassa merged commit 67bfb96 into pytorch:master Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants