Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for neuroglancer multi-resolution sharded meshes #340

Merged
merged 25 commits into from
Jun 8, 2020

Conversation

perlman
Copy link
Contributor

@perlman perlman commented May 7, 2020

This is an initial effort to read neuroglancer's multi-resolution sharded mesh format.

**TODO **

  • Request specific levels of detail
    • ... and only request the relevant fragments over http
  • Convert from "stored model" space coordinates into correct model coordinates
  • Convert from "model space" into physical space
  • Concatenate meshes into a single mesh
  • Match function of other cloud-volume mesh functions
  • [~] Check for missing manifests
    • (Via bulk prefetching of all manifests?)
    • Added an exists method
  • General code cleanup...

** Other issues (that probably won't be addressed) **

  • read-only
    • Eventual write support would be done via unsharded storage & a compaction stage
  • Only scale parameters in the homogenous transform are supported
  • Support for unsharded storage of multi-resolution meshes (much of the code would be similar, e.g. the mesh fragments are the same)
  • Fuse meshes? (not sure how important this is)

Sample code

import cloudvolume
test_ids = [ 710435991, 8392766771, 8393399736 ]
vol = cloudvolume.CloudVolume('https://storage.googleapis.com/fafb-ffn1-20190805/segmentation', mip=0, cache=False)
# Returns {'seg': Mesh}
for lod in reversed(range(4)):
   meshes = vol.mesh.get(test_ids, lod=lod, concat=True)
   for segment in meshes:
       mesh = meshes[segment]
       mesh.viewer()

@william-silversmith
Copy link
Contributor

This is great Eric! Looking forward to seeing where this goes. :)

@william-silversmith
Copy link
Contributor

Linked to issue #267

@perlman
Copy link
Contributor Author

perlman commented May 9, 2020

This code can now read the sharded meshes from both FAFB-FFN1 & FlyEM Hemibrain.

@william-silversmith: I added a quick hack to get the manifest location back with its contents, and am currently creating a new SimpleStorage for retrieving the meshes. How would you suggest reworking this to be a bit cleaner?

@william-silversmith: What should the mesh.get() function return for lod meshes? (The current return value is a list of fragments, one per lod.)

Much thanks to @jbms for adding clarification on a few key points in the spec.

@william-silversmith
Copy link
Contributor

[I] am currently creating a new SimpleStorage for retrieving the meshes. How would you suggest reworking this to be a bit cleaner?

I think this is fine for now. The multi-LOD spec departs from the ordinary sharded format so those departures would have to be stored somewhere... Perhaps a subclass of ShardReader would be slightly cleaner, but the performance cost of creating a new SimpleStorage is very low, the connection pool reuses connections. A ShardReader subclass would be able to avoid returning the offset from get_data, but it would also have to duplicate some code to maintain encapsulation.

The major reworking I'd be looking forward to would be threaded retrieval. I've been working on making that possible for ShardReader.exists, and the techniques can be ported to ShardReader.get_data. Once that happens, we can dramatically increase the performance here.

What should the mesh.get() function return for lod meshes? (The current return value is a list of fragments, one per lod.)

My feeling is that retrieving all the LODs all the time is kind of a waste of the benefits of the format. What about adding a get(segids, lod=scalar or list) flag? Similarly, for fusing, you can fuse the meshes retrieved within a given LOD. If you wanted to get a bit more sophisticated, perhaps we could have a way of retrieving meshes with respect to a point such that it cobbles together a cohesive view of the final mesh based on distance to that point similarly to neuroglancer.

Thanks for your excellent work! I like your code too Eric. <3 :) I'll give it a try a bit later today.

@perlman
Copy link
Contributor Author

perlman commented May 9, 2020

A ShardReader subclass would be able to avoid returning the offset from get_data, but it would also have to duplicate some code to maintain encapsulation.

Makes sense. Maybe we can put half of that work into a private method in SharedReader, which can then be called from the subclass? (The index lookup is an example of code I wouldn't want to be replicated.)

My feeling is that retrieving all the LODs all the time is kind of a waste of the benefits of the format. What about adding a get(segids, lod=scalar or list) flag?

Done (and each LOD is downloaded as it's own range request). Default is still all meshes defined in the mesh manifest.

Similarly, for fusing, you can fuse the meshes retrieved within a given LOD.

Next step. :)

If you wanted to get a bit more sophisticated, perhaps we could have a way of retrieving meshes with respect to a point such that it cobbles together a cohesive view of the final mesh based on distance to that point similarly to neuroglancer.

I'll leave this to someone else. ;-) I'd be content with a function to do mesh selection by resolution (e.g., I want the best mesh for 32 x 32 z 40).

@william-silversmith
Copy link
Contributor

Maybe we can put half of that work into a private method in SharedReader, which can then be called from the subclass?

Sure! If you're up for it, give it a shot. If it gets awkward though, don't worry too much about the duplication of logic. I like to follow the "rule of three" where I'll sometimes let some code be duplicated until the third use so the commonalities become more clear. Just do whatever looks most elegant to you, which might include keeping the offset parameter.

If you wanted to get a bit more sophisticated, perhaps we could have a way of retrieving meshes with respect to a point such that it cobbles together a cohesive view of the final mesh based on distance to that point similarly to neuroglancer.

I'll leave this to someone else. ;-) I'd be content with a function to do mesh selection by resolution (e.g., I want the best mesh for 32 x 32 z 40).

That's fine, I was just spitballing. It's possible that feature is really only useful for visualization.

Thanks again for all the work you put into this Eric!

@william-silversmith
Copy link
Contributor

Good news bad news time! Good news: I parallelized get_data so now it should be possible to dramatically speed this function up. Bad news: I suppose we need to re-integrate the return_offset flag.

@perlman perlman changed the title WIP: Support for neuroglancer multi-resolution sharded meshes Support for neuroglancer multi-resolution sharded meshes Jun 4, 2020
@william-silversmith
Copy link
Contributor

This is looking pretty good! We can always brush up the performance aspect later. Would you mind writing some test(s) for it? If you point me to a particular shard file, I can upload it to gs://seunglab-test/ so it can be forever tested.

It doesn't need to be anything too crazy, an integration test is sufficient.

@perlman
Copy link
Contributor Author

perlman commented Jun 6, 2020

This is looking pretty good! We can always brush up the performance aspect later. Would you mind writing some test(s) for it? If you point me to a particular shard file, I can upload it to gs://seunglab-test/ so it can be forever tested.

It doesn't need to be anything too crazy, an integration test is sufficient.

This sounds like a good idea. I don't have have the ability to generate any data -- would it make sense just to pull one of the shard files from any of the public neuroglancer datasets? (e.g. hemibrain).

@william-silversmith
Copy link
Contributor

william-silversmith commented Jun 6, 2020 via email

@perlman
Copy link
Contributor Author

perlman commented Jun 7, 2020

This now includes a minimal test case, with a reference to the files accessed. When I have chance I can try to pick some good & bad IDs that all hash to the same shard file.

@william-silversmith
Copy link
Contributor

This is hella cool. Thank you so much for your work!!

@william-silversmith william-silversmith merged commit dc2b8f4 into seung-lab:master Jun 8, 2020
@jefferis
Copy link

jefferis commented Jun 8, 2020

Hurrah! Thank you both very much (especially @perlman in this case!).

@william-silversmith
Copy link
Contributor

I'm going to make a new release sometime today. :)

@perlman
Copy link
Contributor Author

perlman commented Jun 8, 2020

I am sure there will be follow up PRs to fix bugs & add other useful interfaces (like a bbox() defined by the oct-tree regions). I'll see what pops up when @jefferis and others use it. ;)

@jefferis
Copy link

@perlman @william-silversmith: a couple of queries for one or both of you after using this in earnest a bit.

  1. In contrast (I think) to some other mesh formats, the get method cannot accept strings (only ints). Would it be hard to change this? I often find 64 bit int support fragile across programming languages (e.g. base R only has int32, so r to python calls doesn't support int64).
  2. Again some of the other mesh sources I have used (flywire) have a save method. Here there is none possibly because of the issue of multi-lod meshes. Is it worth adding? Alternatively is there a general pattern for saving (e.g. in obj) that is guranteed to work across mesh formats?

@fcollman
Copy link
Collaborator

I'll add my two cents on point 2. I think once you have vertices and faces there are a lot of mesh libraries out there for saving meshes. I think of cloud-volume as the interface for getting/saving highly distributed files for cloud based connectomics. If you want to save versions of meshes as simple files, I think there are existing packages that cover a large range of file formats. Meshparty for example uses trimesh, which enables it to save in obj (or ply, or many others).

So in that case meshparty calls cloud-volume to get a mesh, uses that to instantiate a class that is derived from a trimesh base class. We can then use that to save meshes in a variety of formats, though as we develop more analytic layers on top of the mesh we have created an hdf based format to store these extra analytics. One immediate relevant example, is a place to store the extra edges that link disconnected portions of proofread meshes. Others would include skeleton representations and the mappings between the skeleton and the mesh.

@jefferis
Copy link

jefferis commented Jun 13, 2020

I'll add my two cents on point 2. I think once you have vertices and faces there are a lot of mesh libraries out there for saving meshes.

Thanks, Forrest. That seems absolutely reasonable.

It's just that I wrote some wrapper code to fetch/save meshes from FlyWire and I was imagining that it would just work when I switched to hemibrain as a source, but it doesn't because the interface that cloudvolume presents seems to vary according to the type of mesh source – and to me that was a bit surprising. So it would be good to know if there is a recommended pattern for fetch+save.

PS MeshParty is v nice, but I've found it a bit awkward to install, so for people in the group who just want to fetch and save, I'd rather have it as an optional dependency.

@fcollman
Copy link
Collaborator

agreed about the install process, we have work to do there. Mostly I think by making more dependancies optional and writing better cross-platform instructions.

regarding interfaces, I can see how the LOD dictionary is both necessary and confusing. I would agree with you that a more uniform looking interface is a better design ideal. For example, making the highest LOD the default, and only returning that by default, and allowing people to specify LOD, or specify that they actually want the full dictionary.

@william-silversmith
Copy link
Contributor

william-silversmith commented Jun 13, 2020

@jefferis It looks like this was an oversight. I'll make sure to add save back in there. The implementation of save is very simple and worth understanding:

def save(self, segids, filepath=None, file_format='ply'):
    segids = toiter(segids)

    mesh = self.get(segids, fuse=True, remove_duplicate_vertices=True)

    if file_format == 'obj':
      data = mesh.to_obj()
    elif file_format == 'ply':
      data = mesh.to_ply()
    elif file_format == 'precomputed':
      data = mesh.to_precomputed()
    else:
      raise NotImplementedError('Only .obj, .ply, and precomputed are currently supported.')

    if not filepath:
      filepath = str(segids[0]) + "." + file_format
      if len(segids) > 1:
        filepath = "{}_{}.{}".format(segids[0], segids[-1], file_format)

    try:
      filepath.write(data)
    except AttributeError:
      with open(filepath, 'wb') as f:
        f.write(data)

As you can see, each Mesh object comes equipped with a way to convert it into an obj or ply byte string. You can then do what you will with it. I've often thought that the save function is somewhat superfluous and too tightly coupled with get, but it is convenient for someone without a deep understanding of CloudVolume.

@perlman
Copy link
Contributor Author

perlman commented Jun 13, 2020

In contrast (I think) to some other mesh formats, the get method cannot accept strings (only ints). Would it be hard to change this? I often find 64 bit int support fragile across programming languages (e.g. base R only has int32, so r to python calls doesn't support int64).

I'll let @william-silversmith chime in. uint64 are used in many functions, all of which would need to add the hook to convert. Would it make sense to keep a small wrapper, in python, with your R code for doing the string to uint64 conversion?

It's just that I wrote some wrapper code to fetch/save meshes from FlyWire and I was imagining that it would just work when I switched to hemibrain as a source, but it doesn't because the interface that cloudvolume presents seems to vary according to the type of mesh source – and to me that was a bit surprising. So it would be good to know if there is a recommended pattern for fetch+save.

https://github.com/seung-lab/cloud-volume/blob/master/cloudvolume/datasource/precomputed/mesh/multilod.py#L72
vs
https://github.com/seung-lab/cloud-volume/blob/master/cloudvolume/datasource/graphene/mesh/unsharded.py#L137

With the exception of fusing (not implemented yet), the return "should" be the same{ segid: Mesh }.

Please send me a snippit of how you are accessing the meshes and I'll see what I can do. (It's quite likely I broke something in my continual tweaking.)

Alternatively is there a general pattern for saving (e.g. in obj) that is guranteed to work across mesh formats?

I am shoving the data into a standard cloud-volume Mesh object, which ought to behave the same as the others... Maybe there is an issue with the lack of normals?

regarding interfaces, I can see how the LOD dictionary is both necessary and confusing. I would agree with you that a more uniform looking interface is a better design ideal. For example, making the highest LOD the default, and only returning that by default, and allowing people to specify LOD, or specify that they actually want the full dictionary.

@fcollman: Using lod=0 by default is what I converged on with the 'final' version, with no explicit support for multiple lods. If someone has a use case for getting all LODs I'd implement a separate function and use a more efficient range request to fetch them all at once.

@perlman
Copy link
Contributor Author

perlman commented Jun 13, 2020

@jefferis It looks like this was an oversight. I'll make sure to add save back in there. The implementation of save is very simple and worth understanding

Aha! This function was never on my radar as something to implement.

Should it come for free in a base class?

It looks like I failed at having UnshardedLegacyPrecomputedMeshSource extend any existing class...

@william-silversmith
Copy link
Contributor

I've been using UnshardedLegacyPrecomputedMeshSource as a base class in a few places now, but the more places we use it the more worried I become about the inheritance structure becoming brittle and hard to reason about. Adding it as a base class will fix this case though.

@perlman
Copy link
Contributor Author

perlman commented Jun 14, 2020

I've been using UnshardedLegacyPrecomputedMeshSource as a base class in a few places now, but the more places we use it the more worried I become about the inheritance structure becoming brittle and hard to reason about. Adding it as a base class will fix this case though.

It sounds like a BaseMeshSource would be useful? I'm not entirely sure which functions should be defined there, but it would go a long way towards having a standard interface for mesh access.

@william-silversmith
Copy link
Contributor

william-silversmith commented Jun 14, 2020 via email

@jefferis
Copy link

thanks everyone for additional info and @william-silversmith thanks for the save example.

Just a note that for my own libraries, for user facing code equivalent to this:

def save(self, segids, filepath=None, file_format='ply'):
    segids = toiter(segids)

I have a function that might replace toiter, which converts reasonable inputs (including string and numeric) to int64 ids with some basic validation checks. But my libraries are mostly intended for end users whereas I guess CloudVolume is a mix of developer/end user targeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new capability is born.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants