New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omniglot Dataset #323

Merged
merged 13 commits into from Jan 28, 2018

Conversation

Projects
None yet
3 participants
@activatedgeek
Contributor

activatedgeek commented Nov 6, 2017

This is the loader for Omniglot Dataset.

One of the use cases of this dataset is for One-Shot Learning where we sample a pair from the dataset and train a Neural Network to learn the similarity metric between the pair.

P.S.: It is amazing how simple it is to write data loaders!

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 6, 2017

I am particularly concerned as to how to implement the __len__ function for the randomized pair. Should I generate a preset number of pairs and just return them on every call to __getitem__?

@activatedgeek activatedgeek changed the title from [WIP] Omniglot Dataset to Omniglot Dataset Nov 6, 2017

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 6, 2017

I precompute the random set of pairs with an added argument of pair_count to the new OmniglotRandomPair class constructor. I think this should solve the random pair generation problem and make it deterministic after the object is instantiated.

@alykhantejani

Thanks for the PR - I've left some inline comments

.gitignore Outdated
docs/build
.idea/

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

this should be removed as is specific to your dev environment (pycharm)

This comment has been minimized.

@activatedgeek

activatedgeek Nov 8, 2017

Contributor

Sorry about that. Should have already put that in my global gitignore.

print('Files already downloaded and verified')
return

for fzip in self.zips_md5:

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

this downloads both background and evaluation even though we allow the user to specify which set they want in the constructor. It would be good to just download the background or evaluation set as is done in other datasets. The user will typically have two instances of this object anyway, one for each split.

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

Also, I think if the files already exists, the download funtction will terminate early so the force_extract branch won't be run?

This comment has been minimized.

@activatedgeek

activatedgeek Nov 8, 2017

Contributor

Oh yes, definitely both belonged separately. The force_extract was for an adversary scenario I had imagined but perhaps it doesn't make sense. I realize it was not worth the added complexity.

' You can use download=True to download it')

self.target_folder = os.path.join(self.root, self._get_target_folder())
self.alphabets_ = list_dir(self.target_folder)

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

if alphabets should be private, it should have a leading underscore i.e. _alphabets not a trailing one

This comment has been minimized.

@activatedgeek

activatedgeek Nov 8, 2017

Contributor

Oh! I kind of interpreted the other way, hence the notation.

['images_background', '68d2efa1b9178cc56df9314c21c6e718'],
['images_evaluation', '6b91aef0f799c5bb55b94e3f2daec811'],
# Provision in future
# ['images_background_small1', 'e704a628b5459e08445c13499850abc4'],

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

Please can you remove the commented out bits and open an issue to add these once this is merged?


self.target_folder = os.path.join(self.root, self._get_target_folder())
self.alphabets_ = list_dir(self.target_folder)
self.characters_ = list(

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

same comment here about trailing underscore

self.target_folder = os.path.join(self.root, self._get_target_folder())
self.alphabets_ = list_dir(self.target_folder)
self.characters_ = list(
reduce(

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

nit: the reduce(lambda...) can be replaced by sum([[join(alphabet, character) for c in join(target, alphabet)] for alphabet in _alphabets], []) i.e. you can use sum to do the flatten

This comment has been minimized.

@activatedgeek

activatedgeek Nov 8, 2017

Contributor

Neat! That is more succinct.

]
for idx, character in enumerate(self.characters_)
]
self.flat_character_images_ = list(reduce(lambda x, y: x + y, self.character_images_))

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

nit: same comment here, you can use sum(self.character_images_, []) instead

)
self.character_images_ = [
[
tuple([image, idx])

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

this can just be (image, idx), no need to explicitly call tuple

return 'images_background' if self.background is True else 'images_evaluation'


class OmniglotRandomPair(Omniglot):

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

Is this a standard dataset? I couldn't see any reference to it on https://github.com/brendenlake/omniglot

If it's a dataset for your use case only, I'd be less inclined to keep it in torchvision. However if it's used widely, then its also fine by me

This comment has been minimized.

@activatedgeek

activatedgeek Nov 8, 2017

Contributor

This dataset is being used for One Shot Learning which has a verification sub-task in the case of Metric Learning. I'll confess that I have a high bias here given that I have only read the papers where the verification task of comparing two images was needed. But I believe, it is valuable to anyone who will be using this closely related task to One Shot Learning.

"Widely used" is something that I cannot completely vouch for.

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

The implementation of this dataset seems a bit complex for what it's trying to achieve. If all you want are random pairs of images and whether they are the same character or not, then why not just sample on (img, idx) pair from self._flattened_character_images and then sample another continuously until they either match or not depending on what is_match is.

There's no need to precompute pairs, just make __len__ return then length of self._flattened_character_images and then when this is used with the pytorch DataLoader, it will stop sampling after __len__ samples.

This comment has been minimized.

@activatedgeek

activatedgeek Nov 9, 2017

Contributor

If I keep __len__ to be equal to the length of self._flattened_character_images, that is kind of not really the true length, hence I did not want to conflate those two ideas and put a limitation on the true nature of what I am trying to achieve. (The total possible number of pairs are very large and hence the pair_count limit).

I wanted to pre-compute the samples so that every time I use the data loader on the instance, I get the exact same set of pairs. Now that this has come up, I realized that I should also add a random_seed argument to make sure it is truly reproducible. Would that make sense?

If the above makes sense, having self. _character_images just makes the pair generation easier. Also, correct me if I am wrong, if I use this approximate way of choosing randomly from this list of self. _flat_character_images, to ensure that I choose a sample pair which come from the same character, I have a feeling that it would rather on average take more number of iterations to arrive at a favorable collision (when is_match = 1). But the way I have chosen to do it, I theoretically tend to do this in lesser number of expected iterations.

@activatedgeek activatedgeek force-pushed the activatedgeek:omniglot-dataset branch from 2dfc445 to 594247d Nov 8, 2017

@alykhantejani

Thanks for the updates, I've left some more comments inline, mostly around the OmniglotRandomPair dataset

zip_file.extractall(self.root)

def _get_target_folder(self):
return 'images_background' if self.background is True else 'images_evaluation'

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

you can remove the is True here


self.target_folder = os.path.join(self.root, self._get_target_folder())
self._alphabets = list_dir(self.target_folder)
self._characters = sum(

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

nit: this can be done on one line, which IMO makes it more readable, by choosing less verbose variable names and importing join directly i.e.:

from os.path import join
self._characters = sum([[join(a, c) for c in list_dir(join(self.target_folder, a))] for a in self._alphabets], [])
def __init__(self, root, pair_count=10000, background=True,
transform=None, target_transform=None,
download=False):
super(self.__class__, self).__init__(root, background=background,

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

usually we dont use self.__class__ but instead do super(OmniglotRandomPair, self).__init__(...)

This comment has been minimized.

@activatedgeek

activatedgeek Nov 9, 2017

Contributor

Fair enough. Though, any major reason why self.__class__ would be a bad practice?

],
[]
)
self._character_images = [

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

This doesn't look like it's used in this class, why not just wrap this with the sum and use it everywhere you use _flat_character_images. You can probably also make this span a few less lines by bunching the [[ etc.

This comment has been minimized.

@activatedgeek

activatedgeek Nov 9, 2017

Contributor

I wanted to have an easy way to sample images from a given kind of character id for the randomized pair generation. I discuss this more in the other comment.

return 'images_background' if self.background is True else 'images_evaluation'


class OmniglotRandomPair(Omniglot):

This comment has been minimized.

@alykhantejani

alykhantejani Nov 8, 2017

Collaborator

The implementation of this dataset seems a bit complex for what it's trying to achieve. If all you want are random pairs of images and whether they are the same character or not, then why not just sample on (img, idx) pair from self._flattened_character_images and then sample another continuously until they either match or not depending on what is_match is.

There's no need to precompute pairs, just make __len__ return then length of self._flattened_character_images and then when this is used with the pytorch DataLoader, it will stop sampling after __len__ samples.

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 10, 2017

Hey @alykhantejani, do you mind checking some updates to the randomized pair generation?

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 13, 2017

@alykhantejani Writing the comments on the approach here because I couldn't find it anywhere when I had responded to your review earlier (weirdly)

If all you want are random pairs of images and whether they are the same character or not, then why not just sample on (img, idx) pair from self._flattened_character_images and then sample another continuously until they either match or not depending on what is_match is.

I believe this way, it is certainly harder to arrive at a collision when is_match=1. The way I have maintained the internal data helps me quickly arrive at a pair for both is_match = {0,1} because once I flatten the array, I lose information about the character class.

There's no need to precompute pairs, just make __len__ return then length of self._flattened_character_images and then when this is used with the pytorch DataLoader, it will stop sampling after __len__ samples.

The problem with this approach would be that I am not staying true to what I wanted to achieve with the class. The total number of pairs combinatorially possible are huge and that is why I introduced a pair_count defaulting to 10000 and then set that as the length. It gives the user enough flexibility to sample training pairs as well as setup a One-Shot task. If I limit it to the length of the flattened character list, it gives out the wrong perception and might confuse the user. I have also added random_seed to allow reproducibility.

@alykhantejani

This comment has been minimized.

Collaborator

alykhantejani commented Nov 15, 2017

Hey @activatedgeek, sorry for the response.

Pinging @fmassa in case he has any opinions/thoughts on the random pair dataset

@fmassa

This comment has been minimized.

Member

fmassa commented Nov 16, 2017

Hi @activatedgeek , sorry for the delay in replying.

So, my first thought about the Pair dataset is the following: we should try not to add sampling logic into the dataset itself, which should only hold information on how to load individual elements, and let the sampling be performed by the Sampler/DataLoader.
It might look non-trivial in this case, but here is one possibility:

We introduce a new Dataset class, in a similar spirit of ConcatDataset, that performs n samples of the data and returns whatever you want?
Here is a draft implementation (untested):

class MultiDataset(object):
    def __init__(self, dataset, num_outputs=1, transforms=None):
        self.dataset = dataset
        self.num_outputs = num_outputs
        self.transforms = transforms

    def __getitem__(self, idx):
        # here comes the logic to convert a 1d index into a 
        # self.num_output indices, each of size len(self.dataset)
        individual_idx = []
        for i in range(self.num_outputs):
            individual_idx.append(idx % len(self.dataset))
            idx = idx // len(self.dataset)
        
        result = []
        for i in reversed(idx):
            result.append(self.dataset[i])

        if self.transforms is not None:
            result = self.transforms(result)

        return result    

    def __len__(self):
        return len(self.dataset) ** self.num_outputs

This way, you generate on-the-fly an arbitrarily large dataset, that can accommodate pairs/triplets/etc of elements of the same dataset. Plus, the logic of how to combine the different targets of each dataset becomes something that the user should do (via the transforms in the MultiDataset, each individual dataset has its own transforms as well).
And, if you want a biased sampler (say to have a roughly equal number of matches / non-matches) this could be handled by the sampler maybe?

This is just a rough idea, but let me know what you think.

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 17, 2017

@fmassa That is a great idea. I was in fact wondering the same because I recently came across an obvious requirement of similar kind for the ImageNet/Mini-ImageNet datasets as well. It didn't feel right to create custom rules every now and then.

So here is what I will do - the Omniglot data loader will return a tuple (image, character_class_id) and I will wrap up this PR.

I will take up a MultiDataset like utility wrapper in another PR.

Does that sound good?

@fmassa

This comment has been minimized.

Member

fmassa commented Nov 17, 2017

Sounds good! Also, giving how generic this dataset is, might be worth considering sending it to pytorch/tnt, but we can discuss that later

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 17, 2017

@fmassa @alykhantejani Can you please verify if everything is in order now?

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 26, 2017

Any updates here?

cc @fmassa @alykhantejani

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Nov 28, 2017

Hi, I am wondering what the hold up is about? Is there something that I could further help with? It has been quite a while that this PR has been up and I would really appreciate if we could make some progress here. (Or closed if it doesn't align well).

cc @alykhantejani @fmassa

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Dec 9, 2017

@alykhantejani @fmassa Any possibility of this getting merged?

@fmassa

This comment has been minimized.

Member

fmassa commented Dec 11, 2017

@activatedgeek sorry for the delay, I'll have a look at it today.

@fmassa

This comment has been minimized.

Member

fmassa commented Dec 11, 2017

This looks very good, thanks! Once the merge conflicts are addressed, I think this can be merged.

activatedgeek added some commits Dec 22, 2017

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Jan 3, 2018

Hey @alykhantejani @fmassa , can we merge this? I can continue on a discussion on #338 after this.

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Jan 24, 2018

Pinging @fmassa and @alykhantejani here again. Can we please merge this?

@fmassa fmassa merged commit dac9efa into pytorch:master Jan 28, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@fmassa

This comment has been minimized.

Member

fmassa commented Jan 28, 2018

Awesome! Thanks a lot @activatedgeek !

@activatedgeek activatedgeek deleted the activatedgeek:omniglot-dataset branch Jan 28, 2018

@activatedgeek

This comment has been minimized.

Contributor

activatedgeek commented Jan 28, 2018

Hey @fmassa thanks! Do you mind taking a look at #338 as well?

@fmassa fmassa referenced this pull request Jan 30, 2018

Closed

OMNIGLOT Dataset #46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment