-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IO-1444] LongVideo Support #642
Conversation
IO-1444 Extract frames out of the original video
Given Darwin JSON 2.1, we need to amend how
https://www.notion.so/v7labs/Pitch-Long-Videos-5a01c12bdc0e45abbd7b4fd5da149f5a |
dt.SegmentManifest(slot=slot, segment=segment_int, total_frames=len(seg_manifests), items=seg_manifests) | ||
) | ||
|
||
# Calculate the absolute frame number for each item, as manifests are per segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not strictly required for this approach as we do it per segment anyway, but if we do ever want to try a full file download approach then it's mapped back from the manifest
|
||
def test_parse_manifests(manifest_paths: List[Path]) -> None: | ||
segment_manifests = dm._parse_manifests(manifest_paths, "0") | ||
assert len(segment_manifests) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These long assert chains brought to you by copilot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments to consider. Will approve on request though.
!darwin/future/tests/data_objects/workflow/data | ||
!tests/darwin/dataset/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always like a data based test
darwin/dataset/download_manager.py
Outdated
|
||
|
||
def get_segment_manifests(slot: dt.Slot, parent_path: Path, api_key: str) -> List[dt.SegmentManifest]: | ||
temp_dir = parent_path / "temp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you avoid TemporaryDirectory
for xplat reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just always hesitant with temps and supporting other platforms
if annotation_format == "json": | ||
return _download_image_from_json_annotation( | ||
api_key, annotation_path, images_path, use_folders, video_frames, force_slots, ignore_slots | ||
) | ||
else: | ||
console = Console() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but I feel like we need a Logging
style console getter/factory to maintain the console(s) in use.
def _download_all_slots_from_json_annotation(annotation, api_key, parent_path, video_frames): | ||
def _download_all_slots_from_json_annotation( | ||
annotation: dt.AnnotationFile, api_key: str, parent_path: Path, video_frames: bool | ||
) -> Iterable[Callable[[], None]]: | ||
generator = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code, but this feels like a name we should change, given that it's a list, and really not even acting as a generator. At best it could feed a generator's output.
Anyway, I digress slightly...
darwin/dataset/download_manager.py
Outdated
def download_manifest_txts(urls: List[str], api_key: str, folder: Path) -> List[Path]: | ||
paths = [] | ||
with requests.Session() as session: | ||
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on backoff
raise Exception(f"Manifest file ({url}) is empty.") | ||
path = folder / f"manifest_{index + 1}.txt" | ||
with open(str(path), "wb") as file: | ||
file.write(response.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like the bubbled exceptions from these might be a bit non-specific if writes fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah true, we can work that out in post though
return segment_manifests | ||
|
||
|
||
def _parse_manifests(paths: List[Path], slot: str) -> List[dt.SegmentManifest]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is the hardest to read, but I don't think I'd change it, because the obstacle is really domain knowledge I think.
@@ -360,6 +360,12 @@ class Slot: | |||
#: Metadata of the slot | |||
metadata: Optional[Dict[str, UnknownType]] = None | |||
|
|||
#: Frame Manifest for video slots | |||
frame_manifest: Optional[List[Dict[str, UnknownType]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely optional change, but Dict[str, UnknownType]
is interchangable with JSONType
In schema https://darwin-public.s3.eu-west-1.amazonaws.com/darwin_json/2.1/schema.json we don't have frame_manifest field - we have frame_manifests instead.
darwin/dataset/download_manager.py
Outdated
|
||
def _extract_frames_from_segment(path: Path, manifest: dt.SegmentManifest) -> None: | ||
try: | ||
import cv2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency needs to be added to README with some explanation about . Right now, I managed to install it only by upping python version requirement and hardcoding numpy
version.
Also,
$ pip install "darwin[ocv]"
ERROR: Could not find a version that satisfies the requirement darwin[ocv] (from versions: none)
ERROR: No matching distribution found for darwin[ocv]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the pip install message is only relevant after this PR gets deployed as it needs to be packaged for pip
segment_url = slot.segments[index]["url"] | ||
path = video_path / f".{index:07d}.ts" | ||
generator.append( | ||
functools.partial(_download_and_extract_video_segment, segment_url, api_key, path, manifest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes too long right now. poetry run python -m darwin.cli dataset pull --video-frames 23.98s user 6.25s system 201% cpu 14.995 total
24 seconds for 76 frames on m2 CPU.
What if we have few thousand frames at least? Maybe at least show ETA with proper progress bar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retested. For identical videos, "old" video download is 1 second and new way of download is 16 seconds on 39 frame video. It should not be that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another inconsistency we have right now are output messages in dataset pull --video-frames
.
If I download video with frames video, I get this:
Going to download 349 files to .....
Total file count after download completed 447.
But if video is without frames, this output is weird:
Going to download 2 files to ....
Total file count after download completed 98.
Maybe add information about 2 files downloaded and 96 generated?
darwin/dataset/download_manager.py
Outdated
raise Exception(f"Failed to read frame {frame_index} from video segment {path}") | ||
if frame_index in frames_to_extract: | ||
frames_to_extract.remove(frame_index) | ||
frame_path = path.parent / f"{frame_index:07d}.png" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth noting that this way we persist original frame numbers in filenames, while our existing export downloads are sequential (0-1-2-3-4, not 5-10-15-20).
Unfortunately because of the way this process was originally written, it would be basically impossible without a full rewrite to capture that information. It's basically using len(download_functions) as a proxy for the amount of images, but to keep consistency, the download_function for segments is the only thing that knows how many frames will get extracted and no quick/easy way to pass that back up because it's a subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't retested it, but all my comments have been addressed.
Maybe create a ticket to do that in the future then? Right now it's not obvious, and numbers just don't match. |
Can maybe put in a placeholder such that if video-frames flag is selected it doesn't print that information but I think rewriting the download processor probably won't happen over a darwin py v2 implementation. |
Problem
LongVideo features isn't supported by the current downloader
Solution
Add in support at the download_manager
segments
andframe_manifest
Changelog
LongVideos supported in darwin-py