Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connected components #25

Merged
merged 12 commits into from
Jul 28, 2023
Merged

Connected components #25

merged 12 commits into from
Jul 28, 2023

Conversation

LorenzLamm
Copy link
Collaborator

@LorenzLamm LorenzLamm commented Jul 25, 2023

This PR is about two things:

  1. Added functionality to compute connected components of output segmentation. This will assign different labels to different components to differentiate between different membranes. Also, it is useful to use this in combination with ColabSeg, which recommends connected components as inputs.
  2. Dataloading: As suggested in remove flags from tomogram loading #23 and Pyto tomoloading #22 (review) , I adjusted the dataloading: First, tomogram loading does not depend on flags whether to return a header anymore, but instead returns a Tomogram object which always contains data, header and voxel size. Second, I adjusted the data storing to be more efficient. Depending on the tomogram data type, it will try to find the most efficient .mrc file mode to store the data in. Thus, particularly segmentations can be stored much more efficiently (e.g. 2GB to 500MB). That's still way too much, but the best we can do with the mrc file format I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description of connected components to docs.

tomo = load_tomogram(tomo_path)
labels = load_tomogram(seg_path)
tomo = load_tomogram(tomo_path).data
labels = load_tomogram(seg_path).data
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script does not care about the tomogram header, so only the data is read in

sw_roi_size=sliding_window_size,
)


@cli.command(name="components", no_args_is_help=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now possible to compute connected components from the segmentation.
The command can be either embedded in the segmentation directly (above) or separately after computation of the segmentation.

@@ -209,12 +225,30 @@ def write_nifti(out_file: str, image: np.ndarray) -> None:
sitk.WriteImage(out_image, out_file)


@dataclass
class Tomogram:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new dataclass contains data, header and voxel size of the tomogram

return data, tomogram.voxel_size
return data

_dtype_to_mode = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the different data modes used by the mrcfile package

}


def convert_dtype(tomogram: np.ndarray) -> np.ndarray:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through potential mrcfile data types and trying to find the most efficient one for the input tomogram.

else:
data = tomogram
header = None
data = convert_dtype(data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, store_tomogram used the numpy array data type to choose the mrc file data type (automatically in mrcfile package). Now, it tries to find the most efficient data representation.

setattr(out_mrc.header, attr, getattr(header, attr))
out_mrc.header.mode = dtype_mode
if voxel_size is not None:
out_mrc.voxel_size = voxel_size
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The voxel_size argument is mainly used for rescaling the tomogram, where pixel size changes, but the original header still contains the original pixel size.

structure = np.ones((3, 3, 3))
labeled_array, num_features = ndimage.label(binary_seg, structure=structure)

# remove small clusters
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still relatively slow, particularly for many different connected components.

@LorenzLamm LorenzLamm marked this pull request as ready for review July 26, 2023 09:17
Copy link
Member

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @LorenzLamm - looking great as always, a few non-blocking comments below - please move forward as you see best fit and merge when you're happy!

Comment on lines +40 to +50
if size_thres is not None:
print(
"Removing components smaller than",
size_thres,
"voxels. (This can take a while)",
)
sizes = ndimage.sum(binary_seg, labeled_array, range(1, num_features + 1))
too_small = np.nonzero(sizes < size_thres)[0] + 1 # features labeled from 1
for feat_nr in too_small[::-1]: # iterate in reverse order
labeled_array[labeled_array == feat_nr] = 0
labeled_array[labeled_array > feat_nr] -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done it in 2D with a method which may be faster - see
https://github.com/teamtomo/fidder/blob/3ac64b13db256b598bf8951eb9a66137c04b86b6/src/fidder/predict/probabilities_to_mask.py#L5-L32
and specifically this utility function
https://github.com/teamtomo/fidder/blob/3ac64b13db256b598bf8951eb9a66137c04b86b6/src/fidder/utils.py#L128-L156

it's fast but memory requirements go way up for large numbers of regions, in 3D this may be limiting extremely quickly

maybe @kevinyamauchi has tips for doing this quickly in 3D?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I tried it, and it did not give a huge boost in speed, unfortunately. But as you say, memory consumption is getting huge quickly. I tested it with a relatively noisy tomogram with 300+ components, leading to 300x1GB channels. Was kind of feasible on our cluster, but I guess for many users it won't be :/

Comment on lines +235 to +245
data : np.ndarray
The 3D array data representing the tomogram.
header : Any
The header information from the tomogram file.
voxel_size : Any, optional
The voxel size of the tomogram.
"""

data: np.ndarray
header: Any
voxel_size: Optional[Any] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great as a stopgap but I don't love the Any types here - longer term it might be worth constructing a specific model for the information we care about from the header

class Header(BaseModel):
    bla: float
    bla2: tuple[int, int, int]

    @classmethod
    def from_file(cls, path: PathLike):
        with mrcfile.open(path, header_only=True, permissive=True) as mrc:
            # get stuff from header
            ...
            cls(**stuff)

what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's much cleaner, thanks! This would also resolve the issue of skipping several header objects in the store_tomogram function, because we can easily control what properties are stored in the header.
Not sure which ones are the best header items to keep. Need to figure out what's necessary e.g. for compatibility with IMOD or Amira. But will try to clean this up in a follow up :)

Comment on lines +130 to +131
store_connected_components: bool = False,
connected_component_thres: int = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be factored out into a separate program membrain-seg label rather than added to the program for segmenting - it feels like a postprocessing step and people might want to use it on pre-existing segmentations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently, both options are implemented: Users can either decide to directly output connected components from the membrain segment function, or use the membrain components function to process already existing segmentation files.

@LorenzLamm LorenzLamm merged commit 25e7ee3 into main Jul 28, 2023
11 checks passed
@LorenzLamm LorenzLamm deleted the connected_components branch July 28, 2023 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants