-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293
Comments
Thanks @patrontheo . I'm not finding anywhere that discusses coco format, that the IDs must be sequential, though the examples that are often shown are that way. It seems to be going bad in this function: https://github.com/voxel51/fiftyone/blob/develop/fiftyone/utils/coco.py#L1408-L1450 I guess the intent is to give all classes a name even if they happened to be missed in the categories dict. However if they are not intended to be sequential then you run into the problem here. I think a reasonable thing to do is if the max ID is more than x times the number of categories, assume we don't have to fill in 0 through N and skip that whole for-loop. Maybe, x=10 or so? Are you able to contribute that fix? We always appreciate new (and existing) contributors! |
There may be downstream effects since that function returns a list of classes and would now return a dict. Might need some fixing up for this case. |
Thanks for the pointer, I'll try to have a look tomorrow, and make a PR. Let's say we have: categories = [
{
"id": 10,
"name": "Solar Panel",
"supercategory": "root"
}
] The function Do you have any idea why this is wanted ? |
I'm not entirely sure to be honest, based on cursory research. Would have to dig/ask around some more. But because it doesn't really hurt anything (except in this case), I'd rather not change the behavior in general - on the off chance someone is relying on that behavior |
@swheaton I see some options to fix this:
|
@patrontheo thanks for continuing to look into it.
feel free to put up a draft PR if you have any code so we can work there instead of in text here. would also be great to have this scenario as a test if we're going to support it. |
The fact that It's totally fine to change the internal implementation details to store a dict mapping category IDs -> class label strings in all cases. The only "public" invariants that need to be maintained are:
Here's some test code that can be used to verify that the public-facing elements of the COCO I/O interface are working as desired: import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset(
"coco-2017",
split="validation",
max_samples=50,
)
# CHANGE IN BEHAVIOR: this can just contain a sorted list of the category names
# it doesn't need to include interpolated '#' strings like it currently does
assert len(dataset.default_classes) == 80
# This should contain the category mappings from the input COCO JSON file
assert len(dataset.info["categories"]) == 80
print(dataset.info["categories"])
"""
[
{'supercategory': 'person', 'id': 1, 'name': 'person'},
{'supercategory': 'vehicle', 'id': 2, 'name': 'bicycle'},
...
{'supercategory': 'indoor', 'id': 90, 'name': 'toothbrush'}
]
"""
# This should check for `dataset.info["categories"]` in the above format
# If found, the exported category IDs should be retained
# If not found, a new category map that uses 1,2,...,n should be generated
dataset.export(
export_dir="/tmp/coco1",
dataset_type=fo.types.COCODetectionDataset,
# this is optional, as COCODetectionDatasetExporter.log_collection() will
# automatically pull this from the dataset during export
# info=dataset.info,
)
# If an explicit `classes` list is provided, only those classes are exported
# This should still respect `dataset.info["categories"]` if available, exporting
# only the specified `classes` but retaining the predefined category IDs
dataset.export(
export_dir="/tmp/coco2",
dataset_type=fo.types.COCODetectionDataset,
classes=["cat"],
) |
Thanks a lot, I'll try to draft a PR when I get a bit of time |
Thanks we will take a look!! |
Describe the problem
When adding a dataset into fiftyone with the following code, the cell run until it crashes the notebook because of an out-of-memory error.
This happens if the dataset contains random category ids (instead of ids ranging from
0
tonum_categories-1
).Fiftyone should either handle this case (random category ids), or write a clear error message (and definitely not eat all the memory until it gets killed).
Code to reproduce issue
Example of a json file that will lead to this issue (you have to have an image with filename
image.jpg
):System information
python --version
): 3.11.6fiftyone --version
): 0.23.7Willingness to contribute
The FiftyOne Community encourages bug fix contributions. Would you or another
member of your organization be willing to contribute a fix for this bug to the
FiftyOne codebase?
from the FiftyOne community
The text was updated successfully, but these errors were encountered: