Export/model inference cleanup (for release) #4243

brimoor · 2024-04-08T17:14:52Z

Change log

Autosave contexts now respect the default batching strategy and can be configured to use content size-based batching
All SDK methods now use autosave=True or its equivalent rather than calling sample.save() in a loop
Renames export_media parameter to write_clips in video sample parsers, for clarity
Centralizes the application of select_fields() in compute_embeddings() and compute_patch_embeddings(), for consistency with how this is handled in the apply_model() utility.
Adds support for sample collections to fou.iter_slices()
Adds a SampleCollection._resolve_media_field() utility
Adds a fos.read_files() method to efficiently read from multiple files in a threadpool
Adds a warning when uploading large amounts of data to CVAT
Removes unneeded pylint: disable=no-member

coderabbitai · 2024-04-08T17:15:20Z

Walkthrough

The recent updates enhance FiftyOne's media handling and batching operations, introducing more flexible strategies for batch processing and media file management. Key improvements include support for specific group slices in media downloading and clearing, configurable batch sizes and strategies for saving operations, and efficiency boosts in file handling tasks. These changes aim to streamline workflows, improve performance, and offer users greater control over data processing and management.

Changes

File(s)	Summary
`.../cloud_media.rst`, `.../config.rst`, `.../core/collections.py`, `.../core/dataset.py`, `.../core/view.py`	Enhanced support for `group_slices` in media operations and introduced `batching_strategy` parameter for flexible batch processing.
`.../core/frame.py`	Corrected indentation for better code readability.
`.../core/models.py`, `.../core/storage.py`, `.../utils/annotations.py`, `.../utils/labelbox.py`, `.../utils/scale.py`, `.../utils/video.py`	Improved file handling and context management for efficiency and streamlined operations.
`.../core/stages.py`, `.../utils/iou.py`	Minor code clean-up by removing unnecessary pylint directives.
`.../utils/data/exporters.py`, `.../utils/data/parsers.py`	Renamed `export_media` parameter to `write_clips` for clarity in media handling.
`.../utils/eval/detection.py`, `.../utils/eval/segmentation.py`	Refactored evaluation result saving logic for improved clarity.
`.../utils/geojson.py`, `.../utils/image.py`, `.../utils/patches.py`, `.../utils/utils3d.py`	Optimized data loading and handling for better performance and user experience.
`tests/unittests/model_tests.py`	Added unit tests for model inference functionality to ensure reliability and accuracy.

🐇✨
In the realm of data, where bytes do play,
A rabbit hopped forth, with changes to say.
"With slices and batches, we dance and we twirl,
Through media and files, a new world unfurl.
Let's celebrate, friends, for our work shines bright,
In the garden of FiftyOne, under the moonlight."
🌕🎉

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Note: Auto-reply has been disabled for this repository by the repository owner. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 7

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 0d9af75 and 46d81c4.

Files selected for processing (26)

docs/source/teams/cloud_media.rst (6 hunks)
docs/source/user_guide/config.rst (1 hunks)
fiftyone/core/collections.py (12 hunks)
fiftyone/core/dataset.py (8 hunks)
fiftyone/core/frame.py (3 hunks)
fiftyone/core/models.py (37 hunks)
fiftyone/core/stages.py (1 hunks)
fiftyone/core/storage.py (8 hunks)
fiftyone/core/utils.py (2 hunks)
fiftyone/core/view.py (8 hunks)
fiftyone/utils/annotations.py (4 hunks)
fiftyone/utils/cvat.py (2 hunks)
fiftyone/utils/data/exporters.py (2 hunks)
fiftyone/utils/data/parsers.py (9 hunks)
fiftyone/utils/eval/detection.py (3 hunks)
fiftyone/utils/eval/segmentation.py (4 hunks)
fiftyone/utils/geojson.py (1 hunks)
fiftyone/utils/image.py (2 hunks)
fiftyone/utils/iou.py (2 hunks)
fiftyone/utils/labelbox.py (6 hunks)
fiftyone/utils/labelstudio.py (2 hunks)
fiftyone/utils/patches.py (2 hunks)
fiftyone/utils/scale.py (4 hunks)
fiftyone/utils/utils3d.py (4 hunks)
fiftyone/utils/video.py (3 hunks)
tests/unittests/model_tests.py (1 hunks)

Files not reviewed due to errors (1)

(no review received)

Files skipped from review due to trivial changes (2)

fiftyone/core/frame.py
fiftyone/core/stages.py

Additional comments not posted (76)

fiftyone/utils/patches.py (1)

55-55: Adding validation for image collections in the __init__ method enhances the robustness of the ImagePatchesExtractor class by ensuring that the samples argument is a valid image collection.

fiftyone/utils/image.py (1)

350-354: Leveraging FiftyOne's autosave feature in _transform_images_single by using view.iter_samples(autosave=save) simplifies the code and potentially optimizes the saving process by removing the need for explicit sample.save() calls.

fiftyone/utils/eval/segmentation.py (1)

356-358: Refactoring the logic for saving evaluation results based on the save flag in the evaluate_samples function improves code readability and maintainability by clearly indicating when results should be saved.
fiftyone/utils/eval/detection.py (5)
177-177: Introducing a save variable improves readability.

179-179: Ensure save controls all relevant saving operations.

Verification successful

The provided context and analysis suggest that the save variable is used to control saving operations within the function, consistent with the intent of ensuring save controls all relevant saving operations. Based on the evidence, it appears that the original review comment is aligned with the codebase's implementation.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that `save` variable is consistently used to control saving operations throughout the function.
grep -n 'eval_key is not None' fiftyone/utils/eval/detection.py
Length of output: 101

Script:
#!/bin/bash
# Search for all occurrences of `save` and any direct saving operations within the function's scope.
grep -n -A 20 -B 20 'save = eval_key is not None' fiftyone/utils/eval/detection.py
Length of output: 1332
191-191: Using autosave=save in iter_samples call aligns with the new saving logic.

208-208: Correctly applying save to conditionally update frame-level fields.

213-213: Applying save to conditionally update sample-level fields is consistent with the intended logic.
fiftyone/utils/iou.py (2)

Line range hint 1-1: Removing pylint: disable=no-member directives suggests improved code quality or refactoring.

Line range hint 1-1: Verify no new linting issues introduced by the removal of pylint: disable=no-member.

fiftyone/core/storage.py (6)

194-210: Consider using a thread pool for opening multiple files concurrently.

The current implementation of open_files sequentially opens each file, which might not be the most efficient approach, especially when dealing with a large number of files. Consider utilizing a thread pool to open multiple files concurrently, which could improve performance significantly.

226-242: Ensure proper error handling in read_files when skip_failures is False.

The read_files function is designed to read multiple files into memory, with an option to skip failures. However, it's crucial to ensure that when skip_failures is set to False, the function properly handles errors and provides meaningful feedback to the caller. Consider adding tests or checks to verify the behavior in both scenarios (with and without failures).

786-786: Optimize the move_files function for performance.

The move_files function currently relies on _run to execute _do_move_file tasks. While this is functional, there might be opportunities to optimize performance, especially for large batches of files. Consider exploring more efficient ways to move files, such as batch operations or parallel processing techniques, to enhance the function's performance.

841-841: Review the error handling strategy in delete_files.

The delete_files function includes a mechanism to skip failures based on the skip_failures flag. It's important to ensure that this error handling strategy is consistent and effective across different scenarios, including permissions issues, non-existent files, and other potential errors. Consider reviewing and testing the error handling to confirm its robustness.

956-967: Validate the mode parameter in _do_open_file.

The _do_open_file function opens a file with the specified mode. To prevent potential errors or security issues, consider validating the mode parameter to ensure it's one of the expected values (e.g., "r", "rb", "w", "wb"). This validation can help avoid unexpected behavior or vulnerabilities related to file handling.

973-984: Consider adding a size limit for files in _do_read_file.

The _do_read_file function reads the entire content of a file into memory, which could lead to memory issues for very large files. Consider implementing a size limit or a warning mechanism to alert the user when attempting to read a file that exceeds a certain size threshold. This can help prevent potential out-of-memory errors and improve the function's usability.

docs/source/teams/cloud_media.rst (4)

403-411: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [390-407]

Clarify the group_slices parameter in fo.Dataset.download_media.

The addition of the group_slices parameter in the fo.Dataset.download_media method is a valuable feature. However, the documentation could benefit from a more detailed explanation or examples of how to use this parameter, especially for users unfamiliar with the concept of group slices. Consider adding a brief example or a more detailed description to help users understand how to leverage this new functionality effectively.

423-447: Document the impact of batch_size and target_size_bytes on download performance.

The documentation for fo.Dataset.download_context introduces batch_size and target_size_bytes parameters for configuring the download strategy. It would be beneficial to include information about how these parameters impact download performance and efficiency. Providing guidance or best practices on choosing values for these parameters could help users optimize their download processes.

490-496: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [493-503]

Expand on the output of fo.Dataset.cache_stats with group_slices.

The documentation for fo.Dataset.cache_stats now includes a group_slices parameter. To enhance clarity, consider providing an example of the output of this method when group_slices is used. This could help users understand what additional information is provided by specifying group slices and how it might be useful for their workflows.

500-515: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [512-522]

Emphasize the importance of group_slices in fo.Dataset.clear_media.

The addition of the group_slices parameter to fo.Dataset.clear_media is an important update. To ensure users fully understand its significance, consider emphasizing how group_slices can be used to selectively clear media from the cache. This could involve providing a use case or scenario where clearing media for specific group slices is particularly beneficial.

fiftyone/utils/geojson.py (1)

145-158: Consider adding explicit error handling or logging within the try block of load_location_data.

While the use of a try-finally block ensures that the saving operation is attempted even if an error occurs, it might be beneficial to log errors or handle specific exceptions that could occur during the loading of location data. This would improve the robustness of the function and provide more insight into any issues that arise.

fiftyone/utils/utils3d.py (1)

511-518: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [449-515]

Clarify the usage and expected type of the skip_failures parameter in compute_orthographic_projection_images.

The skip_failures parameter is documented as a boolean, but its usage suggests it might accept string values (e.g., "ignore"). This could lead to confusion about the expected type and values for this parameter. Consider clarifying its usage in the documentation or adjusting the implementation to ensure consistency with the documented type. Ensuring clear and consistent documentation and implementation for parameters is crucial for maintainability and usability.

fiftyone/utils/video.py (3)

750-750: Consider adding a comment explaining the save variable logic for clarity.

While the logic for determining when to autosave samples is clear from the code, adding a brief comment explaining the conditions under which autosaving is enabled (i.e., when save_filepaths and sample_frames are both true, or when update_filepaths is true) could improve code readability and maintainability.

765-767: Optimization of saving logic using autosave=save is a significant improvement.

The use of autosave=save in the view.iter_samples call is a well-thought-out change that optimizes the saving logic by potentially reducing the number of database writes. This change aligns with the PR objectives of enhancing efficiency and maintainability.

762-770: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [750-767]

Ensure comprehensive testing of the updated saving logic.

Given the complexity of the _transform_videos function and the introduction of conditional autosaving, it's crucial to ensure that this logic is thoroughly tested across various scenarios. This includes testing with different combinations of save_filepaths, sample_frames, and update_filepaths to verify that samples are saved correctly and efficiently in all cases.

fiftyone/utils/scale.py (4)

23-23: The addition of import fiftyone.core.collections as foc supports the new SaveContext object. This change is necessary and correctly implemented.

211-211: Initializing the progress bar with the total number of labels and the progress parameter enhances user feedback during the import process. This change is well-integrated and improves the function's usability.

212-212: Introducing the SaveContext object and using it within a context manager alongside the progress bar is a significant improvement. This change optimizes the saving mechanism by batching save operations, which is especially beneficial for large datasets.

252-252: Updating the saving mechanism to leverage the SaveContext object for saving samples is a prudent enhancement. This approach efficiently manages database transactions and reduces overhead, which is particularly advantageous for operations involving large datasets.
docs/source/user_guide/config.rst (1)
59-61: The addition of save_context() to the default_batcher configuration description accurately reflects its expanded usage. Ensure this change is consistent with the actual code implementation.

Verification successful

The search results from the source code provide instances where save_context() is used, particularly in fiftyone/core/dataset.py, fiftyone/core/view.py, and fiftyone/core/collections.py. These instances show the method being used in various contexts, including within dataset and view operations, with parameters such as batch_size and batching_strategy being passed to it. This indicates that save_context() is indeed designed to work with configurable batching strategies, which could include the default_batcher configuration as mentioned in the documentation.

However, the specific term default_batcher was not found directly in the search results. This could mean that the term default_batcher is a configuration parameter or a concept that indirectly affects how save_context() operates, rather than being directly referenced within the method's implementation or its immediate documentation.

Given the evidence from the source code showing save_context()'s flexible usage with batching strategies, and without direct references to default_batcher, it can be inferred that the addition of save_context() to the default_batcher configuration description in the documentation is consistent with the method's design to work with configurable batching strategies. However, the exact term default_batcher being part of the method's implementation or documentation could not be verified directly.

Based on this analysis, the review comment is considered to be verified with the understanding that default_batcher refers to a concept or configuration affecting batching strategies, which save_context() is designed to work with.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that `save_context()` method uses `default_batcher` configuration.
rg "save_context" --type rst
Length of output: 664

Script:
#!/bin/bash
# Search for mentions of `default_batcher` in relation to `save_context()` in the source code or documentation comments.
rg "save_context" -A 5 -B 5 --type py
Length of output: 9915
fiftyone/core/view.py (3)
481-501: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [434-493]

Consider adding type hints for the new parameters to improve code readability and maintainability.
- def iter_samples(self, progress=False, autosave=False, batch_size=None, batching_strategy=None):
+ def iter_samples(self, progress: bool = False, autosave: bool = False, batch_size: Optional[Union[int, float]] = None, batching_strategy: Optional[str] = None):
493-498: The introduction of the batching_strategy parameter enhances the flexibility of autosaving samples. However, ensure that the implementation of this feature is thoroughly tested, especially for edge cases such as extremely large batch sizes or very short latencies.

619-639: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [572-636]

The changes to iter_groups mirror those in iter_samples, introducing the batching_strategy parameter. As with iter_samples, consider adding type hints for the new parameters and ensure thorough testing of the new functionality.
- def iter_groups(self, group_slices=None, progress=False, autosave=False, batch_size=None, batching_strategy=None):
+ def iter_groups(self, group_slices: Optional[Iterable[str]] = None, progress: bool = False, autosave: bool = False, batch_size: Optional[Union[int, float]] = None, batching_strategy: Optional[str] = None):
fiftyone/utils/data/parsers.py (7)

1681-1682: Consider documenting the default value of write_clips in the class docstring.

1727-1727: Validate that clip_path does not already exist before extracting a clip to avoid unintentional overwrites.

1760-1760: Clarify in the docstring that write_clips controls the behavior for ClipView instances specifically.

1848-1848: The constructor correctly passes parameters to the ExtractClipsMixin base class. Ensure consistent documentation for write_clips across related classes.

1833-1833: As with FiftyOneUnlabeledVideoSampleParser, clarify in the docstring the specific conditions under which write_clips applies.

1859-1859: The constructor correctly initializes base classes and handles the new parameters for label handling. Ensure that the write_clips parameter is consistently documented and understood across classes.

Line range hint 1905-1905: Ensure that the compute_metadata parameter is clearly documented, including its impact on performance when enabled.

fiftyone/core/models.py (5)

322-325: The use of contextlib.ExitStack() for managing contexts and resources is a good practice, ensuring proper cleanup in case of errors.

343-343: Centralizing save operations with ctx.save(sample) is a strategic improvement, potentially optimizing database interactions.

905-906: Ensure that the logic for adjusting field selection based on media type is correctly implemented, especially in the context of embeddings computation.

1447-1452: Verify the new utilities and methods added for handling sample collections and efficient file reading to ensure they meet the intended requirements and improve performance.

176-198: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-1794]

Ensure that the removal of unneeded pylint: disable=no-member directives does not introduce any new linting issues.
fiftyone/core/utils.py (1)
1608-1656: Ensure the function handles all expected input types correctly, especially when batch_size is neither None, an integer, nor a float.
+ # Verify handling of unexpected `batch_size` types
+ if not isinstance(batch_size, (type(None), int, float)):
+     raise TypeError("`batch_size` must be None, an integer, or a float")
fiftyone/utils/labelbox.py (2)

1645-1657: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1587-1654]

Consider optimizing the sample addition process in import_from_labelbox.

The current implementation adds new samples one by one inside the loop, which can be inefficient for large datasets. Consider collecting all new samples in a list and adding them in a single call to dataset.add_samples(new_samples) after the loop. This approach can significantly reduce database operations and improve performance.

1847-1855: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1850-1870]

Optimize the media upload process in upload_media_to_labelbox.

The current implementation uploads media files one by one, which can be slow for large datasets. Consider using the create_data_rows() method provided by Labelbox, which allows for batch uploading of media files. This method can significantly improve the upload speed by reducing the number of API calls made to Labelbox.
fiftyone/utils/annotations.py (4)
1347-1352: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1313-1494]

Ensure DrawConfig parsing and usage are correctly implemented in draw_labeled_images.

1347-1352: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1313-1494]

Verify the conversion of label fields to ImageLabels and the annotation process in draw_labeled_image.

1347-1352: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1313-1494]

Ensure correct handling of clips and full videos in draw_labeled_videos.

Verification successful

The expanded search successfully identified a reference to is_clips within the draw_labeled_videos function, confirming that the function does indeed differentiate between clips and full videos through the use of the is_clips variable. This variable is derived from the dataset's _is_clips property, indicating a designed approach to handle clips and full videos distinctly within the function's logic.

Based on this finding, it can be concluded that the handling of clips and full videos in the draw_labeled_videos function aligns with the intended behavior as described in the review comment.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that the handling of clips and full videos aligns with the function's logic and intended behavior.
grep -A 30 "def draw_labeled_videos(" fiftyone/utils/annotations.py | grep "is_clips"
Length of output: 85

Script:
#!/bin/bash
# Expand the search to include a broader portion of the function's body and related logic to understand how clips and full videos are handled.
grep -A 50 -B 20 "def draw_labeled_videos(" fiftyone/utils/annotations.py | grep -E "is_clips|clip_bounds"
Length of output: 150
1347-1352: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1313-1494]

Verify the conversion of label fields to VideoLabels and the annotation process in draw_labeled_video.

Verification successful

The verification process has confirmed the presence and usage of both _to_video_labels and etaa.annotate_video within the draw_labeled_video function in fiftyone/utils/annotations.py. This aligns with the review comment's request to verify the conversion of label fields to VideoLabels and the annotation process, indicating that the function correctly implements these steps.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check the conversion and annotation process for correctness and efficiency.
grep -A 20 "def draw_labeled_video(" fiftyone/utils/annotations.py | grep "_to_video_labels"
grep -A 20 "def draw_labeled_video(" fiftyone/utils/annotations.py | grep "etaa.annotate_video"
Length of output: 188

Script:
#!/bin/bash
# Extend the search to the entire `draw_labeled_video` function for `_to_video_labels` and `etaa.annotate_video`
grep -C 50 "def draw_labeled_video(" fiftyone/utils/annotations.py | grep "_to_video_labels"
grep -C 50 "def draw_labeled_video(" fiftyone/utils/annotations.py | grep "etaa.annotate_video"
Length of output: 284
fiftyone/utils/data/exporters.py (8)

Line range hint 1085-1085: Consider adding type hints to improve code readability and maintainability.

Line range hint 1165-1165: Ensure that the setup method properly handles exceptions that may arise during setup, such as issues with creating directories or initializing exporters.

Line range hint 1177-1177: Verify that the export_sample method correctly handles cases where the media cannot be exported due to permissions or other filesystem issues.

Line range hint 1191-1191: Consider handling potential exceptions when writing the labels JSON file, such as permissions issues or disk space limitations.

Line range hint 1335-1335: Consider adding type hints to improve code readability and maintainability.

Line range hint 1395-1395: Ensure that the setup method properly handles exceptions that may arise during setup, such as issues with creating directories.

Line range hint 1407-1407: Verify that the export_sample method correctly handles cases where the media cannot be exported due to permissions or other filesystem issues.

Line range hint 1421-1421: Ensure that the close method properly finalizes all aspects of the export, including handling any cleanup necessary for partial exports.

fiftyone/utils/cvat.py (3)

4404-4406: Consider precomputing metadata asynchronously or in parallel.

The call to samples.compute_metadata() is a potentially expensive operation, especially for large datasets. Consider precomputing this metadata asynchronously or in parallel to improve performance.

5285-5296: Consider adding a configuration option to control the warning threshold.

The warning about the required RAM for uploading all images to CVAT in a single task is helpful, but the threshold is hardcoded. Consider adding a configuration option to allow users to adjust this threshold based on their system capabilities or preferences.

4404-4406: Consider parallelizing metadata computation for batch uploads.

The metadata computation step in the batch upload process could potentially be a bottleneck for large datasets. Consider parallelizing this step to improve the overall upload performance.

fiftyone/core/dataset.py (8)

2144-2149: Introducing batching_strategy parameter enhances flexibility in batch saving. Ensure backward compatibility and update documentation accordingly.

2171-2174: Example usage of batching_strategy is clear. Consider including these examples in the official documentation for better user guidance.

2193-2207: Documentation for batch_size and batching_strategy is clear and comprehensive. Ensure consistency with example usage provided elsewhere.

2220-2224: Implementation of batching_strategy in autosave context appears correct. Recommend thorough testing to ensure expected behavior under various configurations.

2267-2267: Introducing batching_strategy to iter_groups enhances batch saving flexibility. Ensure backward compatibility and update documentation accordingly.

2291-2294: Example usage of batching_strategy with iter_groups is clear. Consider including these examples in the official documentation for better user guidance.

2316-2330: Documentation for batch_size and batching_strategy in iter_groups is clear and comprehensive. Ensure consistency with example usage provided elsewhere.

2347-2351: Implementation of batching_strategy in autosave context for groups appears correct. Recommend thorough testing to ensure expected behavior under various configurations.

fiftyone/core/collections.py (5)

118-128: Ensure proper initialization of batching strategy-related variables.

The initialization of variables related to the batching strategy (_curr_batch_size, _curr_batch_size_bytes, _last_time) in the __enter__ method is correctly handled based on the chosen strategy. This setup ensures that the context is ready for batch processing as soon as it's entered, which is a good practice.

161-180: Optimize the calculation of _curr_batch_size_bytes.

The calculation of _curr_batch_size_bytes in the if self._batching_strategy == "size": block involves converting operations to strings and then summing their lengths. This approach might not accurately reflect the actual byte size of the operations and could be inefficient. Consider a more direct method of estimating or tracking the size of operations to improve accuracy and performance.

874-894: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [862-891]

Clarify the behavior of batch_size and batching_strategy in documentation.

The documentation for the iter_samples method explains the batch_size and batching_strategy parameters but could benefit from additional clarity on how these parameters interact and their default behaviors. Providing examples or more detailed descriptions could help users better understand how to use these features effectively.

987-1013: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [955-1010]

Simplify the save_context method's interface if possible.

The save_context method's interface is consistent with the rest of the changes, offering flexibility in configuring the batching strategy. However, consider if there are ways to simplify this interface for common use cases, potentially by setting sensible defaults or providing utility functions for common configurations. Simplifying the interface can make it easier for users to adopt and use these features.

10082-10086: Ensure consistent handling of the "filepath" field.

The conditional addition or removal of the "filepath" field based on the include_filepath flag in _get_media_fields is a clear and straightforward way to manage the inclusion of this field. This approach allows for flexible handling of media fields, which is beneficial in different contexts.

coderabbitai · 2024-04-08T17:23:51Z

fiftyone/core/utils.py

+def parse_batching_strategy(batch_size=None, batching_strategy=None):
+    """Parses the given batching strategy configuration, applying any default
+    config settings as necessary.
+
+    Args:
+        batch_size (None): the batch size to use. If a ``batching_strategy`` is
+            provided, this parameter configures that strategy as described
+            below. If no ``batching_strategy`` is provided, this can either be
+            an integer specifying the number of samples to save in a batch (in
+            which case ``batching_strategy`` is implicitly set to ``"static"``)
+            or a float number of seconds between batched saves (in which case
+            ``batching_strategy`` is implicitly set to ``"latency"``)
+        batching_strategy (None): the batching strategy to use for each save
+            operation. Supported values are:
+
+            -   ``"static"``: a fixed sample batch size for each save
+            -   ``"size"``: a target batch size, in bytes, for each save
+            -   ``"latency"``: a target latency, in seconds, between saves
+    """
+    if batching_strategy is None:
+        if batch_size is None:
+            batching_strategy = fo.config.default_batcher
+        elif isinstance(batch_size, numbers.Integral):
+            batching_strategy = "static"
+        elif isinstance(batch_size, numbers.Number):
+            batching_strategy = "latency"
+        else:
+            raise ValueError(
+                "Unsupported batch size %s; must be an integer or float"
+                % batch_size
+            )
+
+    supported_batching_strategies = ("static", "size", "latency")
+    if batching_strategy not in supported_batching_strategies:
+        raise ValueError(
+            "Unsupported batching strategy '%s'; supported values are %s"
+            % (batching_strategy, supported_batching_strategies)
+        )
+
+    if batch_size is None:
+        if batching_strategy == "static":
+            batch_size = fo.config.batcher_static_size
+        elif batching_strategy == "size":
+            batch_size = fo.config.batcher_target_size_bytes
+        elif batching_strategy == "latency":
+            batch_size = fo.config.batcher_target_latency
+
+    return batch_size, batching_strategy
+


Consider adding type hints for better code readability and static type checking.

- def parse_batching_strategy(batch_size=None, batching_strategy=None): + def parse_batching_strategy(batch_size: Optional[Union[int, float]] = None, batching_strategy: Optional[str] = None) -> Tuple[Union[int, float], str]:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def parse_batching_strategy(batch_size=None, batching_strategy=None):

"""Parses the given batching strategy configuration, applying any default

config settings as necessary.

Args:

batch_size (None): the batch size to use. If a ``batching_strategy`` is

provided, this parameter configures that strategy as described

below. If no ``batching_strategy`` is provided, this can either be

an integer specifying the number of samples to save in a batch (in

which case ``batching_strategy`` is implicitly set to ``"static"``)

or a float number of seconds between batched saves (in which case

``batching_strategy`` is implicitly set to ``"latency"``)

batching_strategy (None): the batching strategy to use for each save

operation. Supported values are:

- ``"static"``: a fixed sample batch size for each save

- ``"size"``: a target batch size, in bytes, for each save

- ``"latency"``: a target latency, in seconds, between saves

"""

if batching_strategy is None:

if batch_size is None:

batching_strategy = fo.config.default_batcher

elif isinstance(batch_size, numbers.Integral):

batching_strategy = "static"

elif isinstance(batch_size, numbers.Number):

batching_strategy = "latency"

else:

raise ValueError(

"Unsupported batch size %s; must be an integer or float"

% batch_size

)

supported_batching_strategies = ("static", "size", "latency")

if batching_strategy not in supported_batching_strategies:

raise ValueError(

"Unsupported batching strategy '%s'; supported values are %s"

% (batching_strategy, supported_batching_strategies)

)

if batch_size is None:

if batching_strategy == "static":

batch_size = fo.config.batcher_static_size

elif batching_strategy == "size":

batch_size = fo.config.batcher_target_size_bytes

elif batching_strategy == "latency":

batch_size = fo.config.batcher_target_latency

return batch_size, batching_strategy

def parse_batching_strategy(batch_size: Optional[Union[int, float]] = None, batching_strategy: Optional[str] = None) -> Tuple[Union[int, float], str]:

"""Parses the given batching strategy configuration, applying any default

config settings as necessary.

Args:

batch_size (None): the batch size to use. If a ``batching_strategy`` is

provided, this parameter configures that strategy as described

below. If no ``batching_strategy`` is provided, this can either be

an integer specifying the number of samples to save in a batch (in

which case ``batching_strategy`` is implicitly set to ``"static"``)

or a float number of seconds between batched saves (in which case

``batching_strategy`` is implicitly set to ``"latency"``)

batching_strategy (None): the batching strategy to use for each save

operation. Supported values are:

- ``"static"``: a fixed sample batch size for each save

- ``"size"``: a target batch size, in bytes, for each save

- ``"latency"``: a target latency, in seconds, between saves

"""

if batching_strategy is None:

if batch_size is None:

batching_strategy = fo.config.default_batcher

elif isinstance(batch_size, numbers.Integral):

batching_strategy = "static"

elif isinstance(batch_size, numbers.Number):

batching_strategy = "latency"

else:

raise ValueError(

"Unsupported batch size %s; must be an integer or float"

% batch_size

)

supported_batching_strategies = ("static", "size", "latency")

if batching_strategy not in supported_batching_strategies:

raise ValueError(

"Unsupported batching strategy '%s'; supported values are %s"

% (batching_strategy, supported_batching_strategies)

)

if batch_size is None:

if batching_strategy == "static":

batch_size = fo.config.batcher_static_size

elif batching_strategy == "size":

batch_size = fo.config.batcher_target_size_bytes

elif batching_strategy == "latency":

batch_size = fo.config.batcher_target_latency

return batch_size, batching_strategy

The function documentation could be expanded to include examples of usage and more detailed explanations of the parameters and return values.

+ """ + Examples: + >>> parse_batching_strategy() + (1, 'static') # Default values from config + >>> parse_batching_strategy(10) + (10, 'static') # Explicit batch size, inferred static strategy + >>> parse_batching_strategy(0.5, 'latency') + (0.5, 'latency') # Explicit batch size and strategy + """

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def parse_batching_strategy(batch_size=None, batching_strategy=None):

"""Parses the given batching strategy configuration, applying any default

config settings as necessary.

Args:

batch_size (None): the batch size to use. If a ``batching_strategy`` is

provided, this parameter configures that strategy as described

below. If no ``batching_strategy`` is provided, this can either be

an integer specifying the number of samples to save in a batch (in

which case ``batching_strategy`` is implicitly set to ``"static"``)

or a float number of seconds between batched saves (in which case

``batching_strategy`` is implicitly set to ``"latency"``)

batching_strategy (None): the batching strategy to use for each save

operation. Supported values are:

- ``"static"``: a fixed sample batch size for each save

- ``"size"``: a target batch size, in bytes, for each save

- ``"latency"``: a target latency, in seconds, between saves

"""

if batching_strategy is None:

if batch_size is None:

batching_strategy = fo.config.default_batcher

elif isinstance(batch_size, numbers.Integral):

batching_strategy = "static"

elif isinstance(batch_size, numbers.Number):

batching_strategy = "latency"

else:

raise ValueError(

"Unsupported batch size %s; must be an integer or float"

% batch_size

)

supported_batching_strategies = ("static", "size", "latency")

if batching_strategy not in supported_batching_strategies:

raise ValueError(

"Unsupported batching strategy '%s'; supported values are %s"

% (batching_strategy, supported_batching_strategies)

)

if batch_size is None:

if batching_strategy == "static":

batch_size = fo.config.batcher_static_size

elif batching_strategy == "size":

batch_size = fo.config.batcher_target_size_bytes

elif batching_strategy == "latency":

batch_size = fo.config.batcher_target_latency

return batch_size, batching_strategy

def parse_batching_strategy(batch_size=None, batching_strategy=None):

"""Parses the given batching strategy configuration, applying any default

config settings as necessary.

Args:

batch_size (None): the batch size to use. If a ``batching_strategy`` is

provided, this parameter configures that strategy as described

below. If no ``batching_strategy`` is provided, this can either be

an integer specifying the number of samples to save in a batch (in

which case ``batching_strategy`` is implicitly set to ``"static"``)

or a float number of seconds between batched saves (in which case

``batching_strategy`` is implicitly set to ``"latency"``)

batching_strategy (None): the batching strategy to use for each save

operation. Supported values are:

- ``"static"``: a fixed sample batch size for each save

- ``"size"``: a target batch size, in bytes, for each save

- ``"latency"``: a target latency, in seconds, between saves

Examples:

>>> parse_batching_strategy()

(1, 'static') # Default values from config

>>> parse_batching_strategy(10)

(10, 'static') # Explicit batch size, inferred static strategy

>>> parse_batching_strategy(0.5, 'latency')

(0.5, 'latency') # Explicit batch size and strategy

"""

if batching_strategy is None:

if batch_size is None:

batching_strategy = fo.config.default_batcher

elif isinstance(batch_size, numbers.Integral):

batching_strategy = "static"

elif isinstance(batch_size, numbers.Number):

batching_strategy = "latency"

else:

raise ValueError(

"Unsupported batch size %s; must be an integer or float"

% batch_size

)

supported_batching_strategies = ("static", "size", "latency")

if batching_strategy not in supported_batching_strategies:

raise ValueError(

"Unsupported batching strategy '%s'; supported values are %s"

% (batching_strategy, supported_batching_strategies)

)

if batch_size is None:

if batching_strategy == "static":

batch_size = fo.config.batcher_static_size

elif batching_strategy == "size":

batch_size = fo.config.batcher_target_size_bytes

elif batching_strategy == "latency":

batch_size = fo.config.batcher_target_latency

return batch_size, batching_strategy

coderabbitai · 2024-04-08T17:23:51Z

fiftyone/core/collections.py

+        batch_size (None): the batch size to use. If a ``batching_strategy`` is
+            provided, this parameter configures the strategy as described below.
+            If no ``batching_strategy`` is provided, this can either be an
+            integer specifying the number of samples to save in a batch (in
+            which case ``batching_strategy`` is implicitly set to ``"static"``)
+            or a float number of seconds between batched saves (in which case
+            ``batching_strategy`` is implicitly set to ``"latency"``)
+        batching_strategy (None): the batching strategy to use for each save
+            operation. Supported values are:
+
+            -   ``"static"``: a fixed sample batch size for each save
+            -   ``"size"``: a target batch size, in bytes, for each save
+            -   ``"latency"``: a target latency, in seconds, between saves
    """

-    def __init__(self, sample_collection, batch_size=None):
-        if batch_size is None:
-            batch_size = 0.2
+    def __init__(
+        self,
+        sample_collection,
+        batch_size=None,
+        batching_strategy=None,
+    ):
+        batch_size, batching_strategy = fou.parse_batching_strategy(
+            batch_size=batch_size, batching_strategy=batching_strategy
+        )


Consider validating batch_size and batching_strategy parameters.

The constructor of SaveContext accepts batch_size and batching_strategy parameters but does not perform any validation on them. It's important to validate these parameters to ensure they meet the expected criteria (e.g., batch_size is a positive number, batching_strategy is one of the allowed values). This can prevent potential runtime errors and improve the robustness of the code.

coderabbitai · 2024-04-08T17:23:51Z

fiftyone/core/collections.py

+    def _resolve_media_field(self, media_field):
+        _media_field, is_frame_field = self._handle_frame_field(media_field)
+
+        media_fields = self._get_media_fields(frames=is_frame_field)
+        for root, leaf in media_fields.items():
+            if leaf is not None:
+                leaf = root + "." + leaf
+
+            if _media_field in (root, leaf):
+                return leaf if leaf is not None else root
+
+        raise ValueError("'%s' is not a valid media field" % media_field)
+


Validate the media_field parameter in _resolve_media_field.

The _resolve_media_field method raises a ValueError if the provided media_field is not valid. It's good practice to validate input parameters, but consider adding more detailed error messages or suggestions for valid values to help users correct their input more easily.

coderabbitai · 2024-04-08T17:23:51Z

fiftyone/core/models.py

@@ -19,6 +19,7 @@
 import eta.core.web as etaw

 import fiftyone as fo
+import fiftyone.core.collections as foc


Consider removing the unused import fiftyone.core.collections if it's not being utilized within this file.

coderabbitai · 2024-04-08T17:23:51Z

fiftyone/utils/data/parsers.py

@@ -1707,7 +1707,7 @@ def _get_clip_path(self, sample):
        video_path = sample.filepath
        basename, ext = os.path.splitext(os.path.basename(video_path))

-        if self.export_media:
+        if self.write_clips:


Ensure that clip_dir is created if it does not exist when write_clips is True.

if self.write_clips: + os.makedirs(self.clip_dir, exist_ok=True) if self.clip_dir is None: self.clip_dir = etau.make_temp_dir()

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

if self.write_clips:

if self.write_clips:

os.makedirs(self.clip_dir, exist_ok=True)

if self.clip_dir is None:

self.clip_dir = etau.make_temp_dir()

coderabbitai · 2024-04-08T17:23:52Z

fiftyone/core/utils.py

+    try:
+        end = len(sliceable)
+    except:
+        end = None
+
    start = 0
    while True:
+        if end is not None and start >= end:
+            return
+
        chunk = sliceable[start : (start + batch_size)]
-        if len(chunk) == 0:  # works for numpy arrays, Torch tensors, etc
+
+        # works for numpy arrays, Torch tensors, etc
+        if end is None and len(chunk) == 0:


The handling of cases where the length of the sliceable object is not available could be made more explicit in the documentation.

+ """ + Note: + If the length of the `sliceable` object is not available, this function + will attempt to iterate over it until it's exhausted. This behavior is + suitable for objects like generators, but it may have unintended + consequences for infinite generators. + """

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

try:

end = len(sliceable)

except:

end = None

start = 0

while True:

if end is not None and start >= end:

return

chunk = sliceable[start : (start + batch_size)]

if len(chunk) == 0: # works for numpy arrays, Torch tensors, etc

# works for numpy arrays, Torch tensors, etc

if end is None and len(chunk) == 0:

try:

end = len(sliceable)

except:

end = None

start = 0

while True:

if end is not None and start >= end:

return

chunk = sliceable[start : (start + batch_size)]

# works for numpy arrays, Torch tensors, etc

if end is None and len(chunk) == 0:

brimoor · 2024-04-08T18:08:47Z

closing in favor of #4244

brimoor added 14 commits April 8, 2024 13:09

centralize select_fields() usage

40eba2c

rename parameter to write_clips for clarity

913a7dd

remove deprecated warning

d391342

warn earlier

9d4f017

use iter_samples() for robustness

8a7ae64

adding ability to read multiple files in a threadpool

34caea2

support sample collections

ae7f23e

add large task warning

a401ea3

adding skip_failures option

27da906

coderabbit knows all

fa174a7

resolve media field

dbaf677

documenting new download parameters

337ecc0

adding batching strategies to SaveContext

aa220cf

documenting save context batching

46d81c4

brimoor requested a review from swheaton April 8, 2024 17:14

coderabbitai bot reviewed Apr 8, 2024

View reviewed changes

brimoor closed this Apr 8, 2024

brimoor deleted the models-cleanup-release branch April 8, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export/model inference cleanup (for release) #4243

Export/model inference cleanup (for release) #4243

brimoor commented Apr 8, 2024

coderabbitai bot commented Apr 8, 2024 •

edited

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Apr 8, 2024

coderabbitai bot Apr 8, 2024

coderabbitai bot Apr 8, 2024

coderabbitai bot Apr 8, 2024

coderabbitai bot Apr 8, 2024

coderabbitai bot Apr 8, 2024

brimoor commented Apr 8, 2024

-        if self.write_clips:
+        if self.write_clips:
+            os.makedirs(self.clip_dir, exist_ok=True)
+            if self.clip_dir is None:
+                self.clip_dir = etau.make_temp_dir()

Export/model inference cleanup (for release) #4243

Export/model inference cleanup (for release) #4243

Conversation

brimoor commented Apr 8, 2024

Change log

coderabbitai bot commented Apr 8, 2024 • edited

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Apr 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 8, 2024

Choose a reason for hiding this comment

brimoor commented Apr 8, 2024

coderabbitai bot commented Apr 8, 2024 •

edited

CodeRabbit Configration File (`.coderabbit.yaml`)