-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy dataset loading #353
base: develop
Are you sure you want to change the base?
Lazy dataset loading #353
Conversation
… filenames of images instead of images themselves. Images are loaded lazily, keeping memory usage low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello there, thank you for opening an PR ! 🙏🏻 The team was notified and they will get back to you asap.
use LazyLoadDict and shelve for detections to reduce memory consumption
@tfriedel Thanks for PR. We will take a look. It might takes a bit. |
@hardikdava taking review |
I am excited about this change! I have run into Out of Memory errors in Colab when working with large datasets with dataset = sv.DetectionDataset(
self.ontology.classes(), images_map, detections_map
)
dataset.as_yolo(
output_folder + "/images",
output_folder + "/annotations",
min_image_area_percentage=0.01,
data_yaml_path=output_folder + "/data.yaml",
) As long as the |
Sounds fine. @capjamesg can you share colab to test the PR if possible? |
@onuralpszr have you tested the PR with large dataset? I am afraid we have to move fast with this PR as it is a blocker for integration with yoloexplorer. Let me know if you need any help. |
I tested with medium size data i will do test bit more and post my result today plus irl work i had to finish |
Initial Memory Usage results in loading of images Before Images size: 18520 bytes
Images size: 0.01766204833984375 MB
Images size: 1.7248094081878662e-05 GB After Images size: 48 bytes
Images size: 4.57763671875e-05 MB
Images size: 4.470348358154297e-08 GB |
Script I use for load the dataset and I used roboflow script the download the datasets import sys
import supervision as sv
dataset_location = "fashion-assistant-segmentation-5"
ds = sv.DetectionDataset.from_yolo(
images_directory_path=f"{dataset_location}/train/images",
annotations_directory_path=f"{dataset_location}/train/labels",
data_yaml_path=f"{dataset_location}/data.yaml",
)
# memory usage of the dataset
print(f"Dataset size: {sys.getsizeof(ds)} bytes")
# Convert to MegaBytes (MB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Dataset size: {sys.getsizeof(ds) / 1024 / 1024 / 1024} GB")
# ds.images memory usage
print(f"Images size: {sys.getsizeof(ds.images)} bytes")
# Convert to MegaBytes (MB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024} MB")
# Convert to GigaBytes (GB)
print(f"Images size: {sys.getsizeof(ds.images) / 1024 / 1024 / 1024} GB")
|
@onuralpszr
from: https://docs.python.org/3/library/sys.html Also the memray analysis is measuring memory consumption before garbage collection. You'd need to either trigger the garbage collector manually or use bigger sets and see if the memory consumption is growing even after a garbage collection. |
It does get bigger, I used 10k image set (before dataset) Images size: 295000 bytes |
@tfriedel @onuralpszr I tested this PR. I would say it is better than exisiting solution. @SkalskiP please take a look also as this might needs in changing API a bit. But the solution works quite well on large dataset. @tfriedel iterator on loaded dataset is not working at all. Can you take a look? It is quite important and used many places. It would be great if one can tests all the features of |
Code to reproduce my issue. import supervision as sv
data = "train2017"
ds = sv.DetectionDataset.from_yolo(
images_directory_path=f"../../dataset/coco/images/{data}",
annotations_directory_path=f"../../dataset/coco/labels/{data}",
data_yaml_path=f"../../supervision/data/coco128.yaml",
)
for image_name, image, labels in ds:
print(f"{image_name} : {image.shape}, {len(labels)}")
|
@hardikdava |
@tfriedel then this might be something wrong from my end. Just a question, can we use lazydict for detections as well? Then, the increase in memory will be solved. What do you think? |
@hardikdava for detections see the PR in https://github.com/autodistill/autodistill/pull/48/files |
@onuralpszr I tested code and it seems we can use this implementation. What is your thoughts on this? |
Let me re look as well |
in datasets.formats.yolo when loading yolo annotations the method is using cv2.imread to get image shape, i suggest using some other method to get the shape so it will be faster for large datasets |
@onuralpszr @hardikdava Any update on this? OOM'd on my end when using |
Hi, @AChangXD 👋🏻 ! There are no updates yet. But this PR is high on my TODO list. Over the past couple of weeks, I have been quite overwhelmed with non-Supervision work. |
I see that images are only opened with
Will I hit an OOM error at the end, or does Python do something in the background with the image objects that haven't been actively used in a while? |
Yeah getting this working would be amazing! This is a huge blocker for me |
Seeing that it OOM's, I'm guessing it doesn't? |
@tfriedel @onuralpszr , i have tested the lazy-dataset-loading branch of supervision, autodistill repos by @tfriedel , stil can't process 2k of images dataset , OOM error, it's not how it handles image_path in dict structure but how it process reading image using cv2 and saving it , it looks like eating memory . so to recap how things doing under the hood
something is missing causing all of this OOM @tfriedel the two repos is ton of commits behind so no shi or nms so it's a little bit tricky , @capjamesg alot of debugging but if autodistill treats how it deals with large dataset , with all features included it , would be another level |
@Alaaeldinn thank you for debugging and inside and as for start let me give you quick update for first. Looks like I can't update that branch myself so I merge latest develop into this PR branch and created new branch out it in roboflow/supervision repo https://github.com/roboflow/supervision/tree/lazy-dataset-loading-updated So you can try with shi/nms stuff too. I can also check images and OOM problem again. If you also have an idea please feel free to share or open PR if needed. |
@Alaaeldinn @onuralpszr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to let me add an inline comment there.
In as_folder_structure
, Line 633:
There's a cv2.imwrite(image_path, image)
. Image is now a path, so this'll fail.
Description
implements #316
The "images" dict in DetectionDataset and ClassificationDataset that maps from str to ndarray was replaced by LazyLoadDict where the setter just sets filenames but the getter loads the image. So it only keeps track of filenames instead of image contents which keeps memory usage low and allows larger datasets that are not required to fit in memory.
Type of change
Please delete options that are not relevant.
(It is a breaking change though, as the interface changes)
How has this change been tested, please provide a testcase or example of how you tested the change?
Some unit tests were modified and it was tested with a modified version of autodistill.
I'm not claiming nothing breaks as I'm not aware of all use cases nor have I tested all. I only tested autodistill with groundedsam for object detection using masks and yolov8. You are encouraged to do further testing.
Also the issue of high memory consumption is still present with the annotations attribute of DetectionDataset. It could be addressed in a similar way or with a shelve instead of a dict.
Docs
Doc have not been updated, but since the interface was changed (images now needs to be LazyLoadDict instead of a regular dict) any documentation regarding this still needs to be changed.