Skip to content

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Sep 30, 2021

packages=find_packages(exclude=('test',)),
package_data={
package_name: ['*.dll', '*.dylib', '*.so']
package_name: ['*.dll', '*.dylib', '*.so', '*.categories']
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to some file type here, so that these file types are packaged with everything else. They are plain text files, but I've opted to give them a "custom" extension to not accidentally include anything else.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

The only minor thing I would change is the np import, otherwise good to merge. Do you prefer to change that in a follow-up PR or directly here?

from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import re

import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need numpy here? You only use it to cast the array to int64, which you can also do in numpy with a string (or directly add a dtype in torch.as_tensor / torch.tensor)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we do. The box_coord array has dtype uint16, which torch.as_tensor cannot handle. So either we use numpy to convert to a common dtype first, or we could also convert the values to a list and build a tensor from that. What would you prefer? In any case I'll add a comment explaining what is going on.


image = decoder(image_buffer) if decoder else image_buffer

ann = read_mat(ann_buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ann contain anything else or just box_coord and obj_contour?

Copy link
Collaborator Author

@pmeier pmeier Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just box_coord and obj_contour

anns_dp = TarArchiveReader(anns_dp)
anns_dp = Filter(anns_dp, self._is_ann)

dp = KeyZipper(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: is this efficient compared to our current Caltech implementation based on the old-style datasets?

Copy link
Collaborator Author

@pmeier pmeier Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to answer if this is efficient or not.

Without the Shuffler we could use Zipper here and this is basically what we are doing now. If both datapipes, i.e. images_dp and anns_dp, are already aligned, KeyZipper only adds the overhead of evaluating key_fn and ref_key_fn for every sample.

As an alternative we could put the Shuffler after we use a Zipper and remove the KeyZipper all together. I was wrong: the two archives are not perfectly aligned so we need the KeyZipper anyway.

create_categories_file(HERE, self.name, categories)


if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we want to keep this for all datasets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this is that we should keep track of how we generated the category files. With this it is easy to regenerate if we need to. Just call python -m torchvision.prototype.datasets._builtin.caltech. Maybe we could also do an aggregation to be able to generate "everything", but I would keep that for the future.

@pmeier pmeier changed the title add prototype for Caltech256 dataset add prototypes for Caltech(101|256) datasets Sep 30, 2021
@pmeier pmeier merged commit 055708d into pytorch:main Sep 30, 2021
@pmeier pmeier deleted the datasets/caltech branch September 30, 2021 12:29
facebook-github-bot pushed a commit that referenced this pull request Oct 1, 2021
Summary:
* add prototype for `Caltech256` dataset

* silence mypy

Reviewed By: prabhat00155, NicolasHug

Differential Revision: D31309545

fbshipit-source-id: b1b597c2ac152a77fa5c80f361359d993212ce47
mszhanyi pushed a commit to mszhanyi/vision that referenced this pull request Oct 19, 2021
* add prototype for `Caltech256` dataset

* silence mypy
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add prototype for `Caltech256` dataset

* silence mypy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants