Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory leak when importing CocoDetection Dataset with random category_id into FiftyOne #4293

Closed
1 of 3 tasks
patrontheo opened this issue Apr 19, 2024 · 10 comments · Fixed by #4309 or #4354
Closed
1 of 3 tasks
Labels
bug Bug fixes

Comments

@patrontheo
Copy link
Contributor

Describe the problem

When adding a dataset into fiftyone with the following code, the cell run until it crashes the notebook because of an out-of-memory error.
This happens if the dataset contains random category ids (instead of ids ranging from 0 to num_categories-1).
Fiftyone should either handle this case (random category ids), or write a clear error message (and definitely not eat all the memory until it gets killed).

Code to reproduce issue

ds = fo.Dataset.from_dir(
    name=f"name",
    overwrite=True, 
    persistent=True, 
    dataset_type=fo.types.COCODetectionDataset,
    data_path='path_to_images',  
    labels_path='path_to_json', 
    label_types=['segmentations'],
    use_polylines=False, 
    include_id=True,
)

Example of a json file that will lead to this issue (you have to have an image with filename image.jpg):

{
  "categories": [
    {
      "id": 74673216,
      "name": "Solar Panel",
      "supercategory": "root"
    }
  ],
  "images": [
    {
      "id": 789,
      "file_name": "image.jpg",
      "height": 1024,
      "width": 1024,
      "extra": {}
    }
  ],
  "annotations": [
       {
      "id": 15567,
      "image_id": 789,
      "category_id": 74673216,
      "segmentation": [
        [
          445.6344437096958,
          354.34683544084845,
          452.1216184704017,
          348.23038026732564,
          411.9939701135346,
          303.93243513679033,
          406.06287473360334,
          310.2342501039047
        ]
      ],
      "area": 521.9080653650577,
      "bbox": [
        406.06287473360334,
        303.93243513679033,
        46.05874373679836,
        50.41440030405812
      ],
      "iscrowd": 0,
      "extra": {}
    }
  ]
}

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 22.04): Ubuntu 20.04.6
  • Python version (python --version): 3.11.6
  • FiftyOne version (fiftyone --version): 0.23.7
  • FiftyOne installed from (pip or source): pip

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another
member of your organization be willing to contribute a fix for this bug to the
FiftyOne codebase?

  • Yes. I can contribute a fix for this bug independently
  • Yes. I would be willing to contribute a fix for this bug with guidance
    from the FiftyOne community
  • No. I cannot contribute a bug fix at this time
@patrontheo patrontheo added the bug Bug fixes label Apr 19, 2024
@swheaton
Copy link
Contributor

Thanks @patrontheo . I'm not finding anywhere that discusses coco format, that the IDs must be sequential, though the examples that are often shown are that way.

It seems to be going bad in this function: https://github.com/voxel51/fiftyone/blob/develop/fiftyone/utils/coco.py#L1408-L1450

I guess the intent is to give all classes a name even if they happened to be missed in the categories dict. However if they are not intended to be sequential then you run into the problem here.

I think a reasonable thing to do is if the max ID is more than x times the number of categories, assume we don't have to fill in 0 through N and skip that whole for-loop. Maybe, x=10 or so?

Are you able to contribute that fix? We always appreciate new (and existing) contributors!

@swheaton
Copy link
Contributor

There may be downstream effects since that function returns a list of classes and would now return a dict. Might need some fixing up for this case.

@patrontheo
Copy link
Contributor Author

Thanks for the pointer, I'll try to have a look tomorrow, and make a PR.
However I'm not sure why the for loop is useful. Why add classes that are not present in the given categories ?

Let's say we have:

categories = [
    {
      "id": 10,
      "name": "Solar Panel",
      "supercategory": "root"
    }
  ]

The function parse_coco_categories returns:
classes: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'Solar Panel']
supercategory_map: {'Solar Panel': {'id': 10, 'name': 'Solar Panel', 'supercategory': 'root'}}

Do you have any idea why this is wanted ?

@swheaton
Copy link
Contributor

I'm not entirely sure to be honest, based on cursory research. Would have to dig/ask around some more.

But because it doesn't really hurt anything (except in this case), I'd rather not change the behavior in general - on the off chance someone is relying on that behavior

@patrontheo
Copy link
Contributor Author

@swheaton
In _get_object_label_and_attributes in coco.py, the label is obtained using label = classes[self.category_id]. Then, if the category_is very large, classes[self.category_id] will return IndexError: list index out of range (that's probably why classes is created using the for loop).

I see some options to fix this:

  • If the given category ids are not sequential (or are very large), raise an error, explaining that the annotation file should be changed to match this criteria. This could be a bit frustrating for the user that fiftyone does not handle this case.
    • We could provide the user a function that he can call, and that would read the annotation file, remap the ids to be sequential, and save the new annotation file.
    • We could pass a flag remap_ids(False by default), that would remap the ids to something sequential (although I would need some directions on how to pass the flag and in which functions).
  • If the given category ids are not sequential, we could remap them to 0 to num_categories-1 in parse_coco_categories. Although if the user try to get objects using the category_id there is in the annotation file, it would not work since we changed the ids. Maybe if a remapping is performed, we could print a warning for the user ?

@swheaton
Copy link
Contributor

@patrontheo thanks for continuing to look into it.
These 2 things seem to fix it.

  1. If we have decided to not fill the classes list from 0..max(ids) as discussed above, then we can return a dict from id->name instead. When we go to index via category ID, it will work just the same.
  2. In COCODetectionDatasetImporter.setup() around line 560, we have to set info["classes"] = list(classes.values()) if classes is a dict, so that we set the default_classes property on the dataset correctly (we'll get a document validation error otherwise)

feel free to put up a draft PR if you have any code so we can work there instead of in text here.

would also be great to have this scenario as a test if we're going to support it.

@brimoor
Copy link
Contributor

brimoor commented Apr 22, 2024

The fact that parse_coco_categories() returns an "interpolated" list is an artifact of a peculiarity of the original COCO dataset, which contained 80 classes but the category IDs ranged from 0-90.

It's totally fine to change the internal implementation details to store a dict mapping category IDs -> class label strings in all cases.

The only "public" invariants that need to be maintained are:

  1. When importing a COCO dataset, dataset.default_classes should be populated with the classes list from the COCO JSON file
    • this list currently contains the "interpolated" list if category IDs are not sequential, but this can be changed to contain just a sorted list of the actual category names
  2. When importing a COCO dataset, dataset.info should be populated with the info from the COCO JSON file
    • this is already true; just needs to stay that way!
  3. When exporting a COCO dataset, if a categories list is available via dataset.info or provided via the info argument of the COCO exporter, those category IDs should be used
    • currently, I believe preexisting category IDs are not actually used by the exporter; I'm sneaking in a feature request here while you're in there 🤓

Here's some test code that can be used to verify that the public-facing elements of the COCO I/O interface are working as desired:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    max_samples=50,
)

# CHANGE IN BEHAVIOR: this can just contain a sorted list of the category names
# it doesn't need to include interpolated '#' strings like it currently does
assert len(dataset.default_classes) == 80

# This should contain the category mappings from the input COCO JSON file
assert len(dataset.info["categories"]) == 80
print(dataset.info["categories"])
"""
[
    {'supercategory': 'person', 'id': 1, 'name': 'person'},
    {'supercategory': 'vehicle', 'id': 2, 'name': 'bicycle'},
    ...
    {'supercategory': 'indoor', 'id': 90, 'name': 'toothbrush'}
]
"""

# This should check for `dataset.info["categories"]` in the above format
# If found, the exported category IDs should be retained
# If not found, a new category map that uses 1,2,...,n should be generated
dataset.export(
    export_dir="/tmp/coco1",
    dataset_type=fo.types.COCODetectionDataset,

    # this is optional, as COCODetectionDatasetExporter.log_collection() will
    # automatically pull this from the dataset during export
    # info=dataset.info,
)

# If an explicit `classes` list is provided, only those classes are exported
# This should still respect `dataset.info["categories"]` if available, exporting
# only the specified `classes` but retaining the predefined category IDs
dataset.export(
    export_dir="/tmp/coco2",
    dataset_type=fo.types.COCODetectionDataset,
    classes=["cat"],
)

@patrontheo
Copy link
Contributor Author

Thanks a lot, I'll try to draft a PR when I get a bit of time

@patrontheo
Copy link
Contributor Author

Uuuh it was a bit more complicated than I thought 😂.
@brimoor I drafted a PR (#4309), note that I'm not sure what to do regarding add_coco_labels, as explained in the PR.

@swheaton
Copy link
Contributor

Thanks we will take a look!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fixes
Projects
None yet
3 participants