-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connected components #25
Conversation
… segmentation time.
docs/Usage/Segmentation.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description of connected components to docs.
tomo = load_tomogram(tomo_path) | ||
labels = load_tomogram(seg_path) | ||
tomo = load_tomogram(tomo_path).data | ||
labels = load_tomogram(seg_path).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script does not care about the tomogram header, so only the data is read in
sw_roi_size=sliding_window_size, | ||
) | ||
|
||
|
||
@cli.command(name="components", no_args_is_help=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now possible to compute connected components from the segmentation.
The command can be either embedded in the segmentation directly (above) or separately after computation of the segmentation.
@@ -209,12 +225,30 @@ def write_nifti(out_file: str, image: np.ndarray) -> None: | |||
sitk.WriteImage(out_image, out_file) | |||
|
|||
|
|||
@dataclass | |||
class Tomogram: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new dataclass contains data, header and voxel size of the tomogram
return data, tomogram.voxel_size | ||
return data | ||
|
||
_dtype_to_mode = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the different data modes used by the mrcfile package
} | ||
|
||
|
||
def convert_dtype(tomogram: np.ndarray) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through potential mrcfile data types and trying to find the most efficient one for the input tomogram.
else: | ||
data = tomogram | ||
header = None | ||
data = convert_dtype(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, store_tomogram
used the numpy array data type to choose the mrc file data type (automatically in mrcfile package). Now, it tries to find the most efficient data representation.
setattr(out_mrc.header, attr, getattr(header, attr)) | ||
out_mrc.header.mode = dtype_mode | ||
if voxel_size is not None: | ||
out_mrc.voxel_size = voxel_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The voxel_size argument is mainly used for rescaling the tomogram, where pixel size changes, but the original header still contains the original pixel size.
structure = np.ones((3, 3, 3)) | ||
labeled_array, num_features = ndimage.label(binary_seg, structure=structure) | ||
|
||
# remove small clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still relatively slow, particularly for many different connected components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @LorenzLamm - looking great as always, a few non-blocking comments below - please move forward as you see best fit and merge when you're happy!
if size_thres is not None: | ||
print( | ||
"Removing components smaller than", | ||
size_thres, | ||
"voxels. (This can take a while)", | ||
) | ||
sizes = ndimage.sum(binary_seg, labeled_array, range(1, num_features + 1)) | ||
too_small = np.nonzero(sizes < size_thres)[0] + 1 # features labeled from 1 | ||
for feat_nr in too_small[::-1]: # iterate in reverse order | ||
labeled_array[labeled_array == feat_nr] = 0 | ||
labeled_array[labeled_array > feat_nr] -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done it in 2D with a method which may be faster - see
https://github.com/teamtomo/fidder/blob/3ac64b13db256b598bf8951eb9a66137c04b86b6/src/fidder/predict/probabilities_to_mask.py#L5-L32
and specifically this utility function
https://github.com/teamtomo/fidder/blob/3ac64b13db256b598bf8951eb9a66137c04b86b6/src/fidder/utils.py#L128-L156
it's fast but memory requirements go way up for large numbers of regions, in 3D this may be limiting extremely quickly
maybe @kevinyamauchi has tips for doing this quickly in 3D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I tried it, and it did not give a huge boost in speed, unfortunately. But as you say, memory consumption is getting huge quickly. I tested it with a relatively noisy tomogram with 300+ components, leading to 300x1GB channels. Was kind of feasible on our cluster, but I guess for many users it won't be :/
data : np.ndarray | ||
The 3D array data representing the tomogram. | ||
header : Any | ||
The header information from the tomogram file. | ||
voxel_size : Any, optional | ||
The voxel size of the tomogram. | ||
""" | ||
|
||
data: np.ndarray | ||
header: Any | ||
voxel_size: Optional[Any] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great as a stopgap but I don't love the Any
types here - longer term it might be worth constructing a specific model for the information we care about from the header
class Header(BaseModel):
bla: float
bla2: tuple[int, int, int]
@classmethod
def from_file(cls, path: PathLike):
with mrcfile.open(path, header_only=True, permissive=True) as mrc:
# get stuff from header
...
cls(**stuff)
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's much cleaner, thanks! This would also resolve the issue of skipping several header objects in the store_tomogram
function, because we can easily control what properties are stored in the header.
Not sure which ones are the best header items to keep. Need to figure out what's necessary e.g. for compatibility with IMOD or Amira. But will try to clean this up in a follow up :)
store_connected_components: bool = False, | ||
connected_component_thres: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be factored out into a separate program membrain-seg label
rather than added to the program for segmenting - it feels like a postprocessing step and people might want to use it on pre-existing segmentations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently, both options are implemented: Users can either decide to directly output connected components from the membrain segment
function, or use the membrain components
function to process already existing segmentation files.
This PR is about two things: