Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy dataset loading #353

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

tfriedel
Copy link

@tfriedel tfriedel commented Sep 4, 2023

Description

implements #316
The "images" dict in DetectionDataset and ClassificationDataset that maps from str to ndarray was replaced by LazyLoadDict where the setter just sets filenames but the getter loads the image. So it only keeps track of filenames instead of image contents which keeps memory usage low and allows larger datasets that are not required to fit in memory.

Type of change

Please delete options that are not relevant.

  • Bug fix (breaking change which fixes an issue)

(It is a breaking change though, as the interface changes)

How has this change been tested, please provide a testcase or example of how you tested the change?

Some unit tests were modified and it was tested with a modified version of autodistill.
I'm not claiming nothing breaks as I'm not aware of all use cases nor have I tested all. I only tested autodistill with groundedsam for object detection using masks and yolov8. You are encouraged to do further testing.
Also the issue of high memory consumption is still present with the annotations attribute of DetectionDataset. It could be addressed in a similar way or with a shelve instead of a dict.

Docs

  • Docs updated? What were the changes:
    Doc have not been updated, but since the interface was changed (images now needs to be LazyLoadDict instead of a regular dict) any documentation regarding this still needs to be changed.

… filenames of images instead of images themselves. Images are loaded lazily, keeping memory usage low.
@CLAassistant
Copy link

CLAassistant commented Sep 4, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there, thank you for opening an PR ! 🙏🏻 The team was notified and they will get back to you asap.

@tfriedel tfriedel marked this pull request as ready for review September 4, 2023 19:35
tfriedel added a commit to PEAT-AI/autodistill that referenced this pull request Sep 4, 2023
use LazyLoadDict and shelve for detections to reduce memory consumption
@hardikdava
Copy link
Collaborator

@tfriedel Thanks for PR. We will take a look. It might takes a bit.
@capjamesg Can you also review this as it might affects autodistill?
@onuralpszr Do you want to take care of this for review?

@onuralpszr
Copy link
Collaborator

@hardikdava taking review

@onuralpszr onuralpszr added this to the version: 0.15.0 milestone Sep 5, 2023
@onuralpszr onuralpszr added version: 0.15.0 Feature to be added in `0.15.0` release api:datasets Dataset API api:utilities labels Sep 5, 2023
@capjamesg
Copy link
Collaborator

capjamesg commented Sep 6, 2023

I am excited about this change! I have run into Out of Memory errors in Colab when working with large datasets with supervision. The only Autodistill use I can find is:

dataset = sv.DetectionDataset(
    self.ontology.classes(), images_map, detections_map
)

dataset.as_yolo(
    output_folder + "/images",
    output_folder + "/annotations",
    min_image_area_percentage=0.01,
    data_yaml_path=output_folder + "/data.yaml",
)

As long as the as_yolo method has been tested and will continue to work as normal, autodistill should not be impacted.

@hardikdava
Copy link
Collaborator

Sounds fine. @capjamesg can you share colab to test the PR if possible?

@hardikdava
Copy link
Collaborator

@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.

@onuralpszr
Copy link
Collaborator

@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help.

I tested with medium size data i will do test bit more and post my result today plus irl work i had to finish

@onuralpszr
Copy link
Collaborator

Initial Memory Usage results in loading of images

Before

Images size: 18520 bytes
Images size: 0.01766204833984375 MB
Images size: 1.7248094081878662e-05 GB

After

Images size: 48 bytes
Images size: 4.57763671875e-05 MB
Images size: 4.470348358154297e-08 GB

@onuralpszr
Copy link
Collaborator

I also did memray test with graphical memory profiler too see heap size and resident size
Basically we run python script for load the dataset and exit the program

Before HEAP and Resident size

image

After HEAP and Resident size

image

@onuralpszr
Copy link
Collaborator

Script I use for load the dataset and I used roboflow script the download the datasets

import sys
import supervision as sv

dataset_location = "fashion-assistant-segmentation-5"
ds = sv.DetectionDataset.from_yolo(
    images_directory_path=f"{dataset_location}/train/images",
    annotations_directory_path=f"{dataset_location}/train/labels",
    data_yaml_path=f"{dataset_location}/data.yaml",
)

# memory usage of the dataset
print(f"Dataset size: {sys.getsizeof(ds)} bytes")
# Convert to MegaBytes (MB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024 / 1024} GB")

# ds.images memory usage
print(f"Images size: {sys.getsizeof(ds.images)} bytes")
# Convert to MegaBytes (MB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024 / 1024} GB")

@tfriedel
Copy link
Author

tfriedel commented Sep 6, 2023

@onuralpszr
I think you can not use getsizeof to measure the size of the LazyLoadDict, because:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

from: https://docs.python.org/3/library/sys.html

Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.

@onuralpszr
Copy link
Collaborator

onuralpszr commented Sep 6, 2023

@onuralpszr I think you can not use getsizeof to measure the size of the LazyLoadDict, because:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

from: https://docs.python.org/3/library/sys.html

Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection.

It does get bigger, I used 10k image set (before dataset)

Images size: 295000 bytes
Images size: 0.28133392333984375 MB
Images size: 0.00027474015951156616 GB

@hardikdava
Copy link
Collaborator

hardikdava commented Sep 7, 2023

@tfriedel @onuralpszr I tested this PR. I would say it is better than exisiting solution. @SkalskiP please take a look also as this might needs in changing API a bit. But the solution works quite well on large dataset.

@tfriedel iterator on loaded dataset is not working at all. Can you take a look? It is quite important and used many places. It would be great if one can tests all the features of sv.DetectionDataset and sv.ClassificationDataset.

@hardikdava
Copy link
Collaborator

Code to reproduce my issue.

import supervision as sv

data = "train2017"

ds = sv.DetectionDataset.from_yolo(
    images_directory_path=f"../../dataset/coco/images/{data}",
    annotations_directory_path=f"../../dataset/coco/labels/{data}",
    data_yaml_path=f"../../supervision/data/coco128.yaml",
)

for image_name, image, labels in ds:
    print(f"{image_name} : {image.shape}, {len(labels)}")

@tfriedel
Copy link
Author

tfriedel commented Sep 7, 2023

@hardikdava
I can not reproduce your problem. For me this works fine. I downloaded the dataset from https://www.kaggle.com/datasets/ultralytics/coco128 and https://raw.githubusercontent.com/ultralytics/ultralytics/main/ultralytics/cfg/datasets/coco128.yaml and adapted the paths, otherwise the same code.

@hardikdava
Copy link
Collaborator

@tfriedel then this might be something wrong from my end. Just a question, can we use lazydict for detections as well? Then, the increase in memory will be solved. What do you think?

@tfriedel
Copy link
Author

tfriedel commented Sep 7, 2023

@hardikdava for detections see the PR in https://github.com/autodistill/autodistill/pull/48/files
where I just store them in a shelve (basically a dict that is not in memory but on-disk).
We can not use LazyLoadDict as the detections don't correspond to files (I think?). So you need a way to write them to disk. You could do that first and then use a LazyLoadDict.
I'm not super happy with the shelve solution because it's just using a temporary file (maybe on a disk with not enough space) and it can also be problematic if later that object is going to be accessed by multiple threads etc. But it's a quick fix and solved the problem for me. But it's surely worth thinking of better options for this.
Another thing that could be improved is the opening of the images in from_yolo. So if you have a large dataset, every image will be read in full only to read height & width. So even though we have a lazy loading solution, it will still be slow for a large dataset.
I usually would use Image.open() from PIL, as that would just read the height & width info from the header only and thus would be much faster. Not even sure if height & width needs to be read.
Anyway, I didn't want to change too much because I don't know your reasons for going with OpenCV and maybe adding the Pillow dependency is unwanted.

@hardikdava
Copy link
Collaborator

@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?

@onuralpszr
Copy link
Collaborator

@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this?

Let me re look as well

@onuralpszr onuralpszr added version: 0.16.0 Feature to be added in `0.16.0` release hacktoberfest Open for contributions during the annual Hacktoberfest event, aimed at encouraging open-source parti and removed version: 0.15.0 Feature to be added in `0.15.0` release labels Oct 8, 2023
@netanelbi
Copy link

in datasets.formats.yolo when loading yolo annotations the method is using cv2.imread to get image shape, i suggest using some other method to get the shape so it will be faster for large datasets

@AChangXD
Copy link

@onuralpszr @hardikdava Any update on this? OOM'd on my end when using sv.DetectionDataset.from_yolo

@SkalskiP
Copy link
Collaborator

Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.

@capjamesg
Copy link
Collaborator

I see that images are only opened with cv2 when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:

  1. Load an image
  2. Get an annotation
  3. Save the annotation in the dataset object

Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?

@AChangXD
Copy link

AChangXD commented Jan 8, 2024

Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work.

Yeah getting this working would be amazing! This is a huge blocker for me

@AChangXD
Copy link

AChangXD commented Jan 8, 2024

I see that images are only opened with cv2 when they are accessed. Suppose I am labeling 10,000 images and load all them into memory sequentially. My workflow is:

  1. Load an image
  2. Get an annotation
  3. Save the annotation in the dataset object

Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while?

Seeing that it OOM's, I'm guessing it doesn't?

@onuralpszr onuralpszr mentioned this pull request Feb 11, 2024
2 tasks
@Alaaeldinn
Copy link

@tfriedel @onuralpszr , i have tested the lazy-dataset-loading branch of supervision, autodistill repos by @tfriedel , stil can't process 2k of images dataset , OOM error, it's not how it handles image_path in dict structure but how it process reading image using cv2 and saving it , it looks like eating memory .

so to recap how things doing under the hood

  1. reading images dir
  2. iterate all paths and saving it in :: lazy (dict)
    3.read every image with cv2.imread
  3. annotate
  4. save annotate -> detection map
  5. generate supervision dataset
  6. save images dataset , writing images in labeled dir

something is missing causing all of this OOM

@tfriedel the two repos is ton of commits behind so no shi or nms so it's a little bit tricky , @capjamesg alot of debugging but if autodistill treats how it deals with large dataset , with all features included it , would be another level

@onuralpszr
Copy link
Collaborator

onuralpszr commented Feb 12, 2024

@Alaaeldinn thank you for debugging and inside and as for start let me give you quick update for first. Looks like I can't update that branch myself so I merge latest develop into this PR branch and created new branch out it in roboflow/supervision repo

https://github.com/roboflow/supervision/tree/lazy-dataset-loading-updated

So you can try with shi/nms stuff too. I can also check images and OOM problem again. If you also have an idea please feel free to share or open PR if needed.

@tfriedel
Copy link
Author

@Alaaeldinn @onuralpszr
Sorry for the problems you are facing. This is now over 5 months ago since I implemented and last used this. I didn't have OOM issues for my use case back then, but since the PRs are now so far behind and it's also possible that my solution was incomplete I think your approach of creating a new PR makes a lot of sense.

Copy link
Collaborator

@LinasKo LinasKo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to let me add an inline comment there.

In as_folder_structure, Line 633:

There's a cv2.imwrite(image_path, image). Image is now a path, so this'll fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:datasets Dataset API api:utilities hacktoberfest Open for contributions during the annual Hacktoberfest event, aimed at encouraging open-source parti version: 0.16.0 Feature to be added in `0.16.0` release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants