Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document origin of preprocessing mean / std #1965

Merged
merged 5 commits into from
Mar 31, 2020

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 11, 2020

I finally came about addressing #1439. Quick recap of the discussion: the origin of the mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225] we use for the normalization transforms on almost every model is only partially known. We know that they were calculated them on a random subset of the train split of the ImageNet2012 dataset. Which images were used or even the sample size as well as the used transformation are unfortunately lost.

I've tried to reproduce them and found that we probably resized each image to 256 and center cropped it to 224 afterwards. In #1439 my calculated stds differed significantly from the values we used. This resulted from the fact that I previously used sqrt(mean([var(img) for img in dataset])) while we probably used mean([std(img) for img in dataset]). You can find the script I've used for all calculations here.


Fortunately, varying the num_samples with seed=0 (python imagenet_normalization.py $IMAGENET_ROOT --num-samples N --seed 0)

num_samples mean std
1000 [0.483, 0.454, 0.401] [0.226, 0.223, 0.222]
2000 [0.482, 0.451, 0.396] [0.225, 0.222, 0.221]
5000 [0.484, 0.454, 0.401] [0.225, 0.221, 0.221]
10000 [0.485, 0.454, 0.401] [0.225, 0.221, 0.220]
20000 [0.484, 0.453, 0.400] [0.224, 0.220, 0.219]

as well as varying the seed with num_samples=1000 (python imagenet_normalization.py $IMAGENET_ROOT --num-samples 1000 --seed S)

seed mean std
0 [0.483, 0.454, 0.401] [0.226, 0.223, 0.222]
1 [0.485, 0.455, 0.402] [0.223, 0.218, 0.217]
27 [0.479, 0.449, 0.398] [0.225, 0.220, 0.219]
314 [0.480, 0.454, 0.403] [0.223, 0.218, 0.217]
4669 [0.490, 0.458, 0.406] [0.224, 0.219, 0.219]

does not change the mean and std by much. Running on the complete dataset without specific seed (python imagenet_normalization.py $IMAGENET_ROOT) results in

mean=[0.484, 0.454, 0.403], std=[0.225, 0.220, 0.220]

and thus

mean_diff=[1e-3, 2e-3, 3e-3], std_diff=[4e-3, 4e-3, 5e-3] 

@fmassa How do you want me to document this?

@fmassa
Copy link
Member

fmassa commented Mar 13, 2020

Very nice explanation!

I think the location where you pointed out in the docs is where I would put it.

One thing we could do is to mention something like


... in the docs
An example of such normalization can be found in the imagenet example here

The mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225] values have been obtained by computing the mean / std over a subset of ImageNet 2012 training images.
...

and then mention that the exact subset has been lost. It would also be useful to mention the equations, and if you think it's too long in the text you can refer to the issue and this PR.

Thoughts?

@pmeier
Copy link
Collaborator Author

pmeier commented Mar 13, 2020

I'll give it a shot and get back to you if I'm happy with it. Should I document the transformations I used as the "correct" ones or should I also point out that they might have been different?

@fmassa
Copy link
Member

fmassa commented Mar 13, 2020

I think that the transforms that you provided seem to be pretty accurate and I think they describe what we originally done, so just mention it as the "correct" transforms

docs/source/models.rst Outdated Show resolved Hide resolved
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good, just a small typo and the this is good to merge

docs/source/models.rst Outdated Show resolved Hide resolved
@fmassa
Copy link
Member

fmassa commented Mar 20, 2020

Also, can you rebase your branch on top of master? This will fix the CU100 errors

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Philip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants