Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deliver a fast, accessible snapshot of ImageNet #2

Open
ajbouh opened this issue Mar 27, 2018 · 9 comments
Open

Deliver a fast, accessible snapshot of ImageNet #2

ajbouh opened this issue Mar 27, 2018 · 9 comments

Comments

@ajbouh
Copy link
Contributor

ajbouh commented Mar 27, 2018

We need this to support an end-to-end benchmark at scale.

This issue is related to propelml/propel#417, which tracks a similar effort that's dependent on js-ipf.

@diasdavid has posted an initial import of ImageNet: propelml/propel#417 (comment)

This import requires use of directory sharding (as he outlines in his comment). This ipfs/kubo#4871 implies that it won't work in the current implementation of IPTF without some adjustments.

The current implementation of ipfs ls doesn't support sharding, so we'll also need to wait on ipfs/kubo#4874 before we can experiment with this snapshot in IPTF.

@ajbouh
Copy link
Contributor Author

ajbouh commented Mar 27, 2018

I pulled the latest go-ipfs code and I've been trying to get ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 to succeed for the better part of this afternoon.

I learned from @Stebalien that --resolve-type=false is needed to avoid fetching the first block of every file in the directory. Though the current sharded directory implementation seems to ignore this option. @Stebalien wrote a fix for this: ipfs/kubo#4884

Running with this fix I've made some progress, but the maximum size of ipfs bitswap wantlist always seems to be 1. I'm on a wired connection to a 250Mbps cable modem and ipfs ls ... has printed nothing even after many minutes of waiting.

A huge thank you to @Stebalien for his help debugging this so far, but perhaps sharded directories aren't quite ready for primetime yet? @ry @diasdavid

@ajbouh
Copy link
Contributor Author

ajbouh commented Mar 28, 2018

More data from experiments after my machine downloaded all the directory shards:

Initial time to list the directory was about 70 seconds.

~/gopath/src/github.com/ipfs/go-ipfs$ time ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 | wc -l
1281167

real    1m9.764s
user    0m11.004s
sys    0m0.396s

Rerunning was about 46 seconds each time.

~/gopath/src/github.com/ipfs/go-ipfs$ time ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 | wc -l
1281167

real    0m45.935s
user    0m11.256s
sys    0m0.444s

Trying to just list the first 10 took about the same amount of time as listing all of them.

~/gopath/src/github.com/ipfs/go-ipfs$ time ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 | head -n 10
 … snip …
real    0m46.718s
user    0m10.868s
sys    0m0.460s

@ajbouh
Copy link
Contributor Author

ajbouh commented Mar 28, 2018

Also worth recording that the ImageNet dataset has some issues:

CMYK JPEG files sometimes screw up image loaders (e.g. MATLAB imread), and for the ILSVRC CLS-LOC training set, it takes long time to sort out these CMYK files. To this end, we list all known CMYK JPEG files as follows:

n01739381_1309.JPEG
n02077923_14822.JPEG
n02447366_23489.JPEG
n02492035_15739.JPEG
n02747177_10752.JPEG
n03018349_4028.JPEG
n03062245_4620.JPEG
n03347037_9675.JPEG
n03467068_12171.JPEG
n03529860_11437.JPEG
n03544143_17228.JPEG
n03633091_5218.JPEG
n03710637_5125.JPEG
n03961711_5286.JPEG
n04033995_2932.JPEG
n04258138_17003.JPEG
n04264628_27969.JPEG
n04336792_7448.JPEG
n04371774_5854.JPEG
n04596742_4225.JPEG
n07583066_647.JPEG
n13037406_4650.JPEG

Also, n02105855_2933.JPEG is actually a PNG file, which may crash the image loader as well if not configured generic enough.

https://da-data.blogspot.com/2016/02/cleaning-imagenet-dataset-collected.html

@ajbouh
Copy link
Contributor Author

ajbouh commented Mar 29, 2018

Based on the experiments above, it seems like we need to either fix remote sharded directories or work around it with manual sharding.

Fixing remote directory shard enumeration

  • Speed up fetch by having more than element in the ipfs bitswap wantlist at a time.

@Stebalien We discussed this briefly. Who is the right person to follow up with about this?

Once we have the root directory structure cached locally, we still need to be able to enumerate files without loading the entire list into memory first. There are two possible ways to do this:

$ export ROOT_CID=...
$ ipfs ls --resolve-type=false $ROOT_CID > manifest
$ MANIFEST_CID=$(ipfs add manifest)
$ ROOT_CID=$(ipfs object patch $ROOT_CID add-link $MANIFEST_CID)

Manual sharding

Alternatively, I can write a script to manually re-shard the ImageNet dataset into directories of < 1k entries.

cc @diasdavid @lgierth Thoughts?

@daviddias
Copy link

@victorbjelkholm has been the champion on helping me getting this dataset fully pinned on the IPFS Infrastructure.

@victorbjelkholm, what about spinning a node with a large enough disk and give user access to @ajbouh so that he can pin things directly? I'm sure this would save a bunch of round trips.

@victorb
Copy link

victorb commented Mar 29, 2018

what about spinning a node with a large enough disk

Sure thing. All I need is how big of disk is required, what kind of instance (would be on AWS) and a public key to give access and I'll get it created in a few moments.

@daviddias
Copy link

@victorbjelkholm Disk should be bigger than 500Gb, 1TB just to be safe. I let the decision on where to host for you, however, beware of the bandwidth costs, this machine will be piping out this dataset multiple times.

@ajbouh
Copy link
Contributor Author

ajbouh commented Jul 17, 2018

@victorbjelkholm Any progress on this?

@ajbouh
Copy link
Contributor Author

ajbouh commented Nov 17, 2020

Looks like the relevant IPFS conversation has continued on ipfs/kubo#6523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants