Lazy dataset loading #353

tfriedel · 2023-09-04T19:17:55Z

Description

implements #316
The "images" dict in DetectionDataset and ClassificationDataset that maps from str to ndarray was replaced by LazyLoadDict where the setter just sets filenames but the getter loads the image. So it only keeps track of filenames instead of image contents which keeps memory usage low and allows larger datasets that are not required to fit in memory.

Type of change

Please delete options that are not relevant.

Bug fix (breaking change which fixes an issue)

(It is a breaking change though, as the interface changes)

How has this change been tested, please provide a testcase or example of how you tested the change?

Some unit tests were modified and it was tested with a modified version of autodistill.
I'm not claiming nothing breaks as I'm not aware of all use cases nor have I tested all. I only tested autodistill with groundedsam for object detection using masks and yolov8. You are encouraged to do further testing.
Also the issue of high memory consumption is still present with the annotations attribute of DetectionDataset. It could be addressed in a similar way or with a shelve instead of a dict.

Docs

Docs updated? What were the changes:
Doc have not been updated, but since the interface was changed (images now needs to be LazyLoadDict instead of a regular dict) any documentation regarding this still needs to be changed.

… filenames of images instead of images themselves. Images are loaded lazily, keeping memory usage low.

CLAassistant · 2023-09-04T19:18:00Z

All committers have signed the CLA.

github-actions

Hello there, thank you for opening an PR ! 🙏🏻 The team was notified and they will get back to you asap.

use LazyLoadDict and shelve for detections to reduce memory consumption

hardikdava · 2023-09-05T17:04:55Z

@tfriedel Thanks for PR. We will take a look. It might takes a bit.
@capjamesg Can you also review this as it might affects autodistill?
@onuralpszr Do you want to take care of this for review?

onuralpszr · 2023-09-05T17:10:03Z

@hardikdava taking review

capjamesg · 2023-09-06T11:10:29Z

I am excited about this change! I have run into Out of Memory errors in Colab when working with large datasets with supervision. The only Autodistill use I can find is:

dataset = sv.DetectionDataset(
    self.ontology.classes(), images_map, detections_map
)

dataset.as_yolo(
    output_folder + "/images",
    output_folder + "/annotations",
    min_image_area_percentage=0.01,
    data_yaml_path=output_folder + "/data.yaml",
)

As long as the as_yolo method has been tested and will continue to work as normal, autodistill should not be impacted.

hardikdava · 2023-09-06T11:49:40Z

Sounds fine. @capjamesg can you share colab to test the PR if possible?

hardikdava · 2023-09-06T15:23:00Z

@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.

onuralpszr · 2023-09-06T15:26:16Z

@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.

I tested with medium size data i will do test bit more and post my result today plus irl work i had to finish

onuralpszr · 2023-09-06T19:23:59Z

Initial Memory Usage results in loading of images

Before

Images size: 18520 bytes
Images size: 0.01766204833984375 MB
Images size: 1.7248094081878662e-05 GB

After

Images size: 48 bytes
Images size: 4.57763671875e-05 MB
Images size: 4.470348358154297e-08 GB

onuralpszr · 2023-09-06T19:26:30Z

I also did memray test with graphical memory profiler too see heap size and resident size
Basically we run python script for load the dataset and exit the program

Before HEAP and Resident size

After HEAP and Resident size

onuralpszr · 2023-09-06T19:27:50Z

Script I use for load the dataset and I used roboflow script the download the datasets

import sys
import supervision as sv

dataset_location = "fashion-assistant-segmentation-5"
ds = sv.DetectionDataset.from_yolo(
    images_directory_path=f"{dataset_location}/train/images",
    annotations_directory_path=f"{dataset_location}/train/labels",
    data_yaml_path=f"{dataset_location}/data.yaml",
)

# memory usage of the dataset
print(f"Dataset size: {sys.getsizeof(ds)} bytes")
# Convert to MegaBytes (MB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024 / 1024} GB")

# ds.images memory usage
print(f"Images size: {sys.getsizeof(ds.images)} bytes")
# Convert to MegaBytes (MB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024 / 1024} GB")

tfriedel · 2023-09-06T20:31:24Z

@onuralpszr
I think you can not use getsizeof to measure the size of the LazyLoadDict, because:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

from: https://docs.python.org/3/library/sys.html

Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.

onuralpszr · 2023-09-06T21:16:07Z

@onuralpszr I think you can not use getsizeof to measure the size of the LazyLoadDict, because:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

from: https://docs.python.org/3/library/sys.html

Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.

It does get bigger, I used 10k image set (before dataset)

Images size: 295000 bytes
Images size: 0.28133392333984375 MB
Images size: 0.00027474015951156616 GB

hardikdava · 2023-09-07T08:00:00Z

@tfriedel @onuralpszr I tested this PR. I would say it is better than exisiting solution. @SkalskiP please take a look also as this might needs in changing API a bit. But the solution works quite well on large dataset.

@tfriedel iterator on loaded dataset is not working at all. Can you take a look? It is quite important and used many places. It would be great if one can tests all the features of sv.DetectionDataset and sv.ClassificationDataset.

hardikdava · 2023-09-07T08:05:39Z

Code to reproduce my issue.

import supervision as sv

data = "train2017"

ds = sv.DetectionDataset.from_yolo(
    images_directory_path=f"../../dataset/coco/images/{data}",
    annotations_directory_path=f"../../dataset/coco/labels/{data}",
    data_yaml_path=f"../../supervision/data/coco128.yaml",
)

for image_name, image, labels in ds:
    print(f"{image_name} : {image.shape}, {len(labels)}")

tfriedel · 2023-09-07T08:53:15Z

@hardikdava
I can not reproduce your problem. For me this works fine. I downloaded the dataset from https://www.kaggle.com/datasets/ultralytics/coco128 and https://raw.githubusercontent.com/ultralytics/ultralytics/main/ultralytics/cfg/datasets/coco128.yaml and adapted the paths, otherwise the same code.

hardikdava · 2023-09-07T10:12:19Z

@tfriedel then this might be something wrong from my end. Just a question, can we use lazydict for detections as well? Then, the increase in memory will be solved. What do you think?

tfriedel · 2023-09-07T10:39:59Z

@hardikdava for detections see the PR in https://github.com/autodistill/autodistill/pull/48/files
where I just store them in a shelve (basically a dict that is not in memory but on-disk).
We can not use LazyLoadDict as the detections don't correspond to files (I think?). So you need a way to write them to disk. You could do that first and then use a LazyLoadDict.
I'm not super happy with the shelve solution because it's just using a temporary file (maybe on a disk with not enough space) and it can also be problematic if later that object is going to be accessed by multiple threads etc. But it's a quick fix and solved the problem for me. But it's surely worth thinking of better options for this.
Another thing that could be improved is the opening of the images in from_yolo. So if you have a large dataset, every image will be read in full only to read height & width. So even though we have a lazy loading solution, it will still be slow for a large dataset.
I usually would use Image.open() from PIL, as that would just read the height & width info from the header only and thus would be much faster. Not even sure if height & width needs to be read.
Anyway, I didn't want to change too much because I don't know your reasons for going with OpenCV and maybe adding the Pillow dependency is unwanted.

hardikdava · 2023-10-08T14:47:31Z

@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?

onuralpszr · 2023-10-08T14:51:10Z

@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?

Let me re look as well

netanelbi · 2023-10-16T05:57:31Z

in datasets.formats.yolo when loading yolo annotations the method is using cv2.imread to get image shape, i suggest using some other method to get the shape so it will be faster for large datasets

AChangXD · 2023-12-26T19:17:31Z

@onuralpszr @hardikdava Any update on this? OOM'd on my end when using sv.DetectionDataset.from_yolo

SkalskiP · 2023-12-28T08:53:33Z

Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.

capjamesg · 2024-01-03T15:25:05Z

I see that images are only opened with cv2 when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:

Load an image
Get an annotation
Save the annotation in the dataset object

Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?

AChangXD · 2024-01-08T01:41:59Z

Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.

Yeah getting this working would be amazing! This is a huge blocker for me

AChangXD · 2024-01-08T01:42:30Z

I see that images are only opened with cv2 when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:

Load an image

Get an annotation

Save the annotation in the dataset object

Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?

Seeing that it OOM's, I'm guessing it doesn't?

Alaaeldinn · 2024-02-12T13:19:52Z

@tfriedel @onuralpszr , i have tested the lazy-dataset-loading branch of supervision, autodistill repos by @tfriedel , stil can't process 2k of images dataset , OOM error, it's not how it handles image_path in dict structure but how it process reading image using cv2 and saving it , it looks like eating memory .

so to recap how things doing under the hood

reading images dir
iterate all paths and saving it in :: lazy (dict)
3.read every image with cv2.imread
annotate
save annotate -> detection map
generate supervision dataset
save images dataset , writing images in labeled dir

something is missing causing all of this OOM

@tfriedel the two repos is ton of commits behind so no shi or nms so it's a little bit tricky , @capjamesg alot of debugging but if autodistill treats how it deals with large dataset , with all features included it , would be another level

onuralpszr · 2024-02-12T13:31:10Z

@Alaaeldinn thank you for debugging and inside and as for start let me give you quick update for first. Looks like I can't update that branch myself so I merge latest develop into this PR branch and created new branch out it in roboflow/supervision repo

https://github.com/roboflow/supervision/tree/lazy-dataset-loading-updated

So you can try with shi/nms stuff too. I can also check images and OOM problem again. If you also have an idea please feel free to share or open PR if needed.

tfriedel · 2024-02-12T14:56:14Z

@Alaaeldinn @onuralpszr
Sorry for the problems you are facing. This is now over 5 months ago since I implemented and last used this. I didn't have OOM issues for my use case back then, but since the PRs are now so far behind and it's also possible that my solution was incomplete I think your approach of creating a new PR makes a lot of sense.

LinasKo

Doesn't seem to let me add an inline comment there.

In as_folder_structure, Line 633:

There's a cv2.imwrite(image_path, image). Image is now a path, so this'll fail.

tfriedel added 4 commits September 4, 2023 20:34

implements Lazy Dataset Loading roboflow#316 by only keeping track of…

5d39d0b

… filenames of images instead of images themselves. Images are loaded lazily, keeping memory usage low.

adapted tests

8d102df

added doc strings

ecb5029

added tests for LazyLoadDict

b35c3c1

github-actions bot reviewed Sep 4, 2023

View reviewed changes

code formatting, addressing issues found by pre-commit-hook

2d52138

tfriedel marked this pull request as ready for review September 4, 2023 19:35

tfriedel added a commit to PEAT-AI/autodistill that referenced this pull request Sep 4, 2023

adapted for roboflow/supervision#353

beee2de

use LazyLoadDict and shelve for detections to reduce memory consumption

This was referenced Sep 4, 2023

lazy loading of images autodistill/autodistill#48

Open

[DetectionDataset] - enable lazy dataset loading #316

Open

onuralpszr requested review from onuralpszr and capjamesg September 5, 2023 17:17

onuralpszr assigned tfriedel Sep 5, 2023

onuralpszr added this to the version: 0.15.0 milestone Sep 5, 2023

onuralpszr added version: 0.15.0 Feature to be added in `0.15.0` release api:datasets Dataset API api:utilities labels Sep 5, 2023

SkalskiP mentioned this pull request Sep 26, 2023

"Killed" during dataset convert from COCO to YOLO format #267

Closed

2 tasks

onuralpszr added version: 0.16.0 Feature to be added in `0.16.0` release hacktoberfest Open for contributions during the annual Hacktoberfest event, aimed at encouraging open-source parti and removed version: 0.15.0 Feature to be added in `0.15.0` release labels Oct 8, 2023

onuralpszr modified the milestones: version: 0.15.0, version: 0.16.0 Oct 8, 2023

yeldarby mentioned this pull request Nov 1, 2023

Update detection_base_model.py autodistill/autodistill#60

Open

SkalskiP removed the request for review from capjamesg November 27, 2023 11:06

onuralpszr mentioned this pull request Feb 11, 2024

large dataset processing #876

Closed

2 tasks

LinasKo requested changes Apr 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy dataset loading #353

Lazy dataset loading #353

tfriedel commented Sep 4, 2023 •

edited

Loading

CLAassistant commented Sep 4, 2023 •

edited

Loading

github-actions bot left a comment

hardikdava commented Sep 5, 2023

onuralpszr commented Sep 5, 2023

capjamesg commented Sep 6, 2023 •

edited

Loading

hardikdava commented Sep 6, 2023

hardikdava commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

tfriedel commented Sep 6, 2023

onuralpszr commented Sep 6, 2023 •

edited

Loading

hardikdava commented Sep 7, 2023 •

edited

Loading

hardikdava commented Sep 7, 2023

tfriedel commented Sep 7, 2023 •

edited

Loading

hardikdava commented Sep 7, 2023

tfriedel commented Sep 7, 2023 •

edited

Loading

hardikdava commented Oct 8, 2023

onuralpszr commented Oct 8, 2023

netanelbi commented Oct 16, 2023

AChangXD commented Dec 26, 2023

SkalskiP commented Dec 28, 2023

capjamesg commented Jan 3, 2024

AChangXD commented Jan 8, 2024

AChangXD commented Jan 8, 2024

Alaaeldinn commented Feb 12, 2024

onuralpszr commented Feb 12, 2024 •

edited

Loading

tfriedel commented Feb 12, 2024

LinasKo left a comment

Lazy dataset loading #353

Are you sure you want to change the base?

Lazy dataset loading #353

Conversation

tfriedel commented Sep 4, 2023 • edited Loading

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Docs

CLAassistant commented Sep 4, 2023 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

hardikdava commented Sep 5, 2023

onuralpszr commented Sep 5, 2023

capjamesg commented Sep 6, 2023 • edited Loading

hardikdava commented Sep 6, 2023

hardikdava commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

onuralpszr commented Sep 6, 2023

tfriedel commented Sep 6, 2023

onuralpszr commented Sep 6, 2023 • edited Loading

hardikdava commented Sep 7, 2023 • edited Loading

hardikdava commented Sep 7, 2023

tfriedel commented Sep 7, 2023 • edited Loading

hardikdava commented Sep 7, 2023

tfriedel commented Sep 7, 2023 • edited Loading

hardikdava commented Oct 8, 2023

onuralpszr commented Oct 8, 2023

netanelbi commented Oct 16, 2023

AChangXD commented Dec 26, 2023

SkalskiP commented Dec 28, 2023

capjamesg commented Jan 3, 2024

AChangXD commented Jan 8, 2024

AChangXD commented Jan 8, 2024

Alaaeldinn commented Feb 12, 2024

onuralpszr commented Feb 12, 2024 • edited Loading

tfriedel commented Feb 12, 2024

LinasKo left a comment

Choose a reason for hiding this comment

tfriedel commented Sep 4, 2023 •

edited

Loading

CLAassistant commented Sep 4, 2023 •

edited

Loading

capjamesg commented Sep 6, 2023 •

edited

Loading

onuralpszr commented Sep 6, 2023 •

edited

Loading

hardikdava commented Sep 7, 2023 •

edited

Loading

tfriedel commented Sep 7, 2023 •

edited

Loading

tfriedel commented Sep 7, 2023 •

edited

Loading

onuralpszr commented Feb 12, 2024 •

edited

Loading