[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293

patrontheo · 2024-04-19T14:29:09Z

Describe the problem

When adding a dataset into fiftyone with the following code, the cell run until it crashes the notebook because of an out-of-memory error.
This happens if the dataset contains random category ids (instead of ids ranging from 0 to num_categories-1).
Fiftyone should either handle this case (random category ids), or write a clear error message (and definitely not eat all the memory until it gets killed).

Code to reproduce issue

ds = fo.Dataset.from_dir(
    name=f"name",
    overwrite=True, 
    persistent=True, 
    dataset_type=fo.types.COCODetectionDataset,
    data_path='path_to_images',  
    labels_path='path_to_json', 
    label_types=['segmentations'],
    use_polylines=False, 
    include_id=True,
)

Example of a json file that will lead to this issue (you have to have an image with filename image.jpg):

{
  "categories": [
    {
      "id": 74673216,
      "name": "Solar Panel",
      "supercategory": "root"
    }
  ],
  "images": [
    {
      "id": 789,
      "file_name": "image.jpg",
      "height": 1024,
      "width": 1024,
      "extra": {}
    }
  ],
  "annotations": [
       {
      "id": 15567,
      "image_id": 789,
      "category_id": 74673216,
      "segmentation": [
        [
          445.6344437096958,
          354.34683544084845,
          452.1216184704017,
          348.23038026732564,
          411.9939701135346,
          303.93243513679033,
          406.06287473360334,
          310.2342501039047
        ]
      ],
      "area": 521.9080653650577,
      "bbox": [
        406.06287473360334,
        303.93243513679033,
        46.05874373679836,
        50.41440030405812
      ],
      "iscrowd": 0,
      "extra": {}
    }
  ]
}

System information

OS Platform and Distribution (e.g., Linux Ubuntu 22.04): Ubuntu 20.04.6
Python version (python --version): 3.11.6
FiftyOne version (fiftyone --version): 0.23.7
FiftyOne installed from (pip or source): pip

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another
member of your organization be willing to contribute a fix for this bug to the
FiftyOne codebase?

Yes. I can contribute a fix for this bug independently
Yes. I would be willing to contribute a fix for this bug with guidance
from the FiftyOne community
No. I cannot contribute a bug fix at this time

The text was updated successfully, but these errors were encountered:

swheaton · 2024-04-19T18:04:30Z

Thanks @patrontheo . I'm not finding anywhere that discusses coco format, that the IDs must be sequential, though the examples that are often shown are that way.

It seems to be going bad in this function: https://github.com/voxel51/fiftyone/blob/develop/fiftyone/utils/coco.py#L1408-L1450

I guess the intent is to give all classes a name even if they happened to be missed in the categories dict. However if they are not intended to be sequential then you run into the problem here.

I think a reasonable thing to do is if the max ID is more than x times the number of categories, assume we don't have to fill in 0 through N and skip that whole for-loop. Maybe, x=10 or so?

Are you able to contribute that fix? We always appreciate new (and existing) contributors!

swheaton · 2024-04-19T18:08:22Z

There may be downstream effects since that function returns a list of classes and would now return a dict. Might need some fixing up for this case.

patrontheo · 2024-04-19T19:41:56Z

Thanks for the pointer, I'll try to have a look tomorrow, and make a PR.
However I'm not sure why the for loop is useful. Why add classes that are not present in the given categories ?

Let's say we have:

categories = [
    {
      "id": 10,
      "name": "Solar Panel",
      "supercategory": "root"
    }
  ]

The function parse_coco_categories returns:
classes: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'Solar Panel']
supercategory_map: {'Solar Panel': {'id': 10, 'name': 'Solar Panel', 'supercategory': 'root'}}

Do you have any idea why this is wanted ?

swheaton · 2024-04-19T20:52:44Z

I'm not entirely sure to be honest, based on cursory research. Would have to dig/ask around some more.

But because it doesn't really hurt anything (except in this case), I'd rather not change the behavior in general - on the off chance someone is relying on that behavior

patrontheo · 2024-04-20T17:34:48Z

@swheaton
In _get_object_label_and_attributes in coco.py, the label is obtained using label = classes[self.category_id]. Then, if the category_is very large, classes[self.category_id] will return IndexError: list index out of range (that's probably why classes is created using the for loop).

I see some options to fix this:

If the given category ids are not sequential (or are very large), raise an error, explaining that the annotation file should be changed to match this criteria. This could be a bit frustrating for the user that fiftyone does not handle this case.
- We could provide the user a function that he can call, and that would read the annotation file, remap the ids to be sequential, and save the new annotation file.
- We could pass a flag remap_ids(False by default), that would remap the ids to something sequential (although I would need some directions on how to pass the flag and in which functions).
If the given category ids are not sequential, we could remap them to 0 to num_categories-1 in parse_coco_categories. Although if the user try to get objects using the category_id there is in the annotation file, it would not work since we changed the ids. Maybe if a remapping is performed, we could print a warning for the user ?

swheaton · 2024-04-22T14:48:41Z

@patrontheo thanks for continuing to look into it.
These 2 things seem to fix it.

If we have decided to not fill the classes list from 0..max(ids) as discussed above, then we can return a dict from id->name instead. When we go to index via category ID, it will work just the same.
In COCODetectionDatasetImporter.setup() around line 560, we have to set info["classes"] = list(classes.values()) if classes is a dict, so that we set the default_classes property on the dataset correctly (we'll get a document validation error otherwise)

feel free to put up a draft PR if you have any code so we can work there instead of in text here.

would also be great to have this scenario as a test if we're going to support it.

brimoor · 2024-04-22T16:33:46Z

The fact that parse_coco_categories() returns an "interpolated" list is an artifact of a peculiarity of the original COCO dataset, which contained 80 classes but the category IDs ranged from 0-90.

It's totally fine to change the internal implementation details to store a dict mapping category IDs -> class label strings in all cases.

The only "public" invariants that need to be maintained are:

When importing a COCO dataset, dataset.default_classes should be populated with the classes list from the COCO JSON file
- this list currently contains the "interpolated" list if category IDs are not sequential, but this can be changed to contain just a sorted list of the actual category names
When importing a COCO dataset, dataset.info should be populated with the info from the COCO JSON file
- this is already true; just needs to stay that way!
When exporting a COCO dataset, if a categories list is available via dataset.info or provided via the info argument of the COCO exporter, those category IDs should be used
- currently, I believe preexisting category IDs are not actually used by the exporter; I'm sneaking in a feature request here while you're in there 🤓

Here's some test code that can be used to verify that the public-facing elements of the COCO I/O interface are working as desired:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    max_samples=50,
)

# CHANGE IN BEHAVIOR: this can just contain a sorted list of the category names
# it doesn't need to include interpolated '#' strings like it currently does
assert len(dataset.default_classes) == 80

# This should contain the category mappings from the input COCO JSON file
assert len(dataset.info["categories"]) == 80
print(dataset.info["categories"])
"""
[
    {'supercategory': 'person', 'id': 1, 'name': 'person'},
    {'supercategory': 'vehicle', 'id': 2, 'name': 'bicycle'},
    ...
    {'supercategory': 'indoor', 'id': 90, 'name': 'toothbrush'}
]
"""

# This should check for `dataset.info["categories"]` in the above format
# If found, the exported category IDs should be retained
# If not found, a new category map that uses 1,2,...,n should be generated
dataset.export(
    export_dir="/tmp/coco1",
    dataset_type=fo.types.COCODetectionDataset,

    # this is optional, as COCODetectionDatasetExporter.log_collection() will
    # automatically pull this from the dataset during export
    # info=dataset.info,
)

# If an explicit `classes` list is provided, only those classes are exported
# This should still respect `dataset.info["categories"]` if available, exporting
# only the specified `classes` but retaining the predefined category IDs
dataset.export(
    export_dir="/tmp/coco2",
    dataset_type=fo.types.COCODetectionDataset,
    classes=["cat"],
)

patrontheo · 2024-04-22T16:41:52Z

Thanks a lot, I'll try to draft a PR when I get a bit of time

patrontheo · 2024-04-25T07:50:50Z

Uuuh it was a bit more complicated than I thought 😂.
@brimoor I drafted a PR (#4309), note that I'm not sure what to do regarding add_coco_labels, as explained in the PR.

swheaton · 2024-04-25T12:52:53Z

Thanks we will take a look!!

patrontheo added the bug Bug fixes label Apr 19, 2024

patrontheo mentioned this issue Apr 25, 2024

fix: coco category ids can now be not sequential - avoiding memory leak #4309

Merged

7 tasks

brimoor mentioned this issue May 6, 2024

Support non-sequential COCO category IDs #4354

Merged

brimoor closed this as completed May 9, 2024

brimoor mentioned this issue May 23, 2024

[BUG] Memory leak when using add_coco_labels for instance segmentation with coco_id_field set #4407

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293

[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293

patrontheo commented Apr 19, 2024

swheaton commented Apr 19, 2024

swheaton commented Apr 19, 2024

patrontheo commented Apr 19, 2024

swheaton commented Apr 19, 2024

patrontheo commented Apr 20, 2024

swheaton commented Apr 22, 2024

brimoor commented Apr 22, 2024 •

edited

Loading

patrontheo commented Apr 22, 2024

patrontheo commented Apr 25, 2024

swheaton commented Apr 25, 2024

[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293

[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293

Comments

patrontheo commented Apr 19, 2024

Describe the problem

Code to reproduce issue

System information

Willingness to contribute

swheaton commented Apr 19, 2024

swheaton commented Apr 19, 2024

patrontheo commented Apr 19, 2024

swheaton commented Apr 19, 2024

patrontheo commented Apr 20, 2024

swheaton commented Apr 22, 2024

brimoor commented Apr 22, 2024 • edited Loading

patrontheo commented Apr 22, 2024

patrontheo commented Apr 25, 2024

swheaton commented Apr 25, 2024

brimoor commented Apr 22, 2024 •

edited

Loading