[IO-1444] LongVideo Support #642

Nathanjp91 · 2023-08-03T07:55:47Z

Problem

LongVideo features isn't supported by the current downloader

Solution

Add in support at the download_manager

Ammended slot information and parser to include segments and frame_manifest
Added new dataclass models representing the manifest file and rows of the manifest
Added functions to retrieve manifests from url retrieved via slot.frame_manifest
- get_segment_manifests -> download_manifest_txts -> _parse_manifest
- Supports multiple segment files
- collates into SegmentManifest object, with all manifests summarized into List[SegmentManifest]
- Cleans downloads after completion
Added functions to retrieve video segments
- Segments preffered over full video as overcomes hvec support
- _download_and_extract_video_segment -> _download_video_segment -> _extract_frames_from_segment
- gets passed segment manifest relevent to the file
- cleans up segment after extraction
- raises if cv2 is not importable

Changelog

LongVideos supported in darwin-py

linear · 2023-08-03T07:55:49Z

IO-1444 Extract frames out of the original video

Given Darwin JSON 2.1, we need to amend how dataset pull --video-frames works:

If Darwin JSON contains $.item.slots[n].frame_urls, we keep existing implementation
If Darwin JSON doesn't contain $.item.slots[n].frame_urls we use orignal video file + $.item.slots[n].frame_manifests to extract frames on the fly (result of this should be the same as when fetching frames from $.item.slots[n].frame_urls currently - same frame filenames etc).

https://www.notion.so/v7labs/Pitch-Long-Videos-5a01c12bdc0e45abbd7b4fd5da149f5a

Nathanjp91 · 2023-08-03T08:33:01Z

darwin/dataset/download_manager.py

+            dt.SegmentManifest(slot=slot, segment=segment_int, total_frames=len(seg_manifests), items=seg_manifests)
+        )
+
+    # Calculate the absolute frame number for each item, as manifests are per segment


This is not strictly required for this approach as we do it per segment anyway, but if we do ever want to try a full file download approach then it's mapped back from the manifest

Nathanjp91 · 2023-08-03T08:33:55Z

tests/darwin/dataset/download_manager_test.py

+
+def test_parse_manifests(manifest_paths: List[Path]) -> None:
+    segment_manifests = dm._parse_manifests(manifest_paths, "0")
+    assert len(segment_manifests) == 4


These long assert chains brought to you by copilot

owencjones

Lots of comments to consider. Will approve on request though.

owencjones · 2023-08-03T09:10:43Z

.gitignore

+!darwin/future/tests/data_objects/workflow/data
+!tests/darwin/dataset/data


Always like a data based test

owencjones · 2023-08-03T09:11:42Z

darwin/dataset/download_manager.py

+
+
+def get_segment_manifests(slot: dt.Slot, parent_path: Path, api_key: str) -> List[dt.SegmentManifest]:
+    temp_dir = parent_path / "temp"


Did you avoid TemporaryDirectory for xplat reasons?

I'm just always hesitant with temps and supporting other platforms

owencjones · 2023-08-03T09:12:57Z

darwin/dataset/download_manager.py

    if annotation_format == "json":
        return _download_image_from_json_annotation(
            api_key, annotation_path, images_path, use_folders, video_frames, force_slots, ignore_slots
        )
    else:
+        console = Console()


Unrelated, but I feel like we need a Logging style console getter/factory to maintain the console(s) in use.

owencjones · 2023-08-03T09:14:51Z

darwin/dataset/download_manager.py

-def _download_all_slots_from_json_annotation(annotation, api_key, parent_path, video_frames):
+def _download_all_slots_from_json_annotation(
+    annotation: dt.AnnotationFile, api_key: str, parent_path: Path, video_frames: bool
+) -> Iterable[Callable[[], None]]:
    generator = []


Not your code, but this feels like a name we should change, given that it's a list, and really not even acting as a generator. At best it could feed a generator's output.

Anyway, I digress slightly...

darwin/dataset/download_manager.py

owencjones · 2023-08-03T10:02:05Z

darwin/dataset/download_manager.py

+def download_manifest_txts(urls: List[str], api_key: str, folder: Path) -> List[Path]:
+    paths = []
+    with requests.Session() as session:
+        retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])


Same comment on backoff

owencjones · 2023-08-03T10:02:51Z

darwin/dataset/download_manager.py

+                raise Exception(f"Manifest file ({url}) is empty.")
+            path = folder / f"manifest_{index + 1}.txt"
+            with open(str(path), "wb") as file:
+                file.write(response.content)


Feels like the bubbled exceptions from these might be a bit non-specific if writes fail?

Yeah true, we can work that out in post though

darwin/dataset/download_manager.py

owencjones · 2023-08-03T10:32:02Z

darwin/dataset/download_manager.py

+    return segment_manifests
+
+
+def _parse_manifests(paths: List[Path], slot: str) -> List[dt.SegmentManifest]:


This one is the hardest to read, but I don't think I'd change it, because the obstacle is really domain knowledge I think.

owencjones · 2023-08-03T10:33:33Z

darwin/datatypes.py

@@ -360,6 +360,12 @@ class Slot:
    #: Metadata of the slot
    metadata: Optional[Dict[str, UnknownType]] = None

+    #: Frame Manifest for video slots
+    frame_manifest: Optional[List[Dict[str, UnknownType]]] = None


Completely optional change, but Dict[str, UnknownType] is interchangable with JSONType

In schema https://darwin-public.s3.eu-west-1.amazonaws.com/darwin_json/2.1/schema.json we don't have frame_manifest field - we have frame_manifests instead.

brain-geek · 2023-08-10T12:49:36Z

darwin/dataset/download_manager.py

+
+def _extract_frames_from_segment(path: Path, manifest: dt.SegmentManifest) -> None:
+    try:
+        import cv2


This dependency needs to be added to README with some explanation about . Right now, I managed to install it only by upping python version requirement and hardcoding numpy version.

Also,

$ pip install "darwin[ocv]" ERROR: Could not find a version that satisfies the requirement darwin[ocv] (from versions: none) ERROR: No matching distribution found for darwin[ocv]

Yeah the pip install message is only relevant after this PR gets deployed as it needs to be packaged for pip

brain-geek · 2023-08-10T12:56:48Z

darwin/dataset/download_manager.py

+                segment_url = slot.segments[index]["url"]
+                path = video_path / f".{index:07d}.ts"
+                generator.append(
+                    functools.partial(_download_and_extract_video_segment, segment_url, api_key, path, manifest)


It takes too long right now. poetry run python -m darwin.cli dataset pull --video-frames 23.98s user 6.25s system 201% cpu 14.995 total 24 seconds for 76 frames on m2 CPU.

What if we have few thousand frames at least? Maybe at least show ETA with proper progress bar?

Retested. For identical videos, "old" video download is 1 second and new way of download is 16 seconds on 39 frame video. It should not be that bad.

brain-geek

Another inconsistency we have right now are output messages in dataset pull --video-frames.

If I download video with frames video, I get this:

Going to download 349 files to .....

Total file count after download completed 447.

But if video is without frames, this output is weird:

Going to download 2 files to ....

Total file count after download completed 98.

Maybe add information about 2 files downloaded and 96 generated?

brain-geek · 2023-08-11T00:14:24Z

darwin/dataset/download_manager.py

+            raise Exception(f"Failed to read frame {frame_index} from video segment {path}")
+        if frame_index in frames_to_extract:
+            frames_to_extract.remove(frame_index)
+            frame_path = path.parent / f"{frame_index:07d}.png"


worth noting that this way we persist original frame numbers in filenames, while our existing export downloads are sequential (0-1-2-3-4, not 5-10-15-20).

Nathanjp91 · 2023-08-16T11:14:13Z

@brain-geek

Maybe add information about 2 files downloaded and 96 generated?

Unfortunately because of the way this process was originally written, it would be basically impossible without a full rewrite to capture that information. It's basically using len(download_functions) as a proxy for the amount of images, but to keep consistency, the download_function for segments is the only thing that knows how many frames will get extracted and no quick/easy way to pass that back up because it's a subprocess

brain-geek

I haven't retested it, but all my comments have been addressed.

brain-geek · 2023-08-16T12:02:41Z

@Nathanjp91

Unfortunately because of the way this process was originally written, it would be basically impossible without a full rewrite to capture that information

Maybe create a ticket to do that in the future then? Right now it's not obvious, and numbers just don't match.

Nathanjp91 · 2023-08-16T12:09:12Z

Maybe create a ticket to do that in the future then? Right now it's not obvious, and numbers just don't match.

Can maybe put in a placeholder such that if video-frames flag is selected it doesn't print that information but I think rewriting the download processor probably won't happen over a darwin py v2 implementation.

Nathan Perkins added 5 commits August 2, 2023 10:10

WIP LV support

eca8834

WIP LV support

f8643bb

manifest tests

effb370

get_segment test

fb5ea09

multi-slot support

aa55d67

Nathan Perkins added 3 commits August 3, 2023 18:13

tests

e1b64dd

test data

da54d1d

test data

4d8f03a

Nathanjp91 commented Aug 3, 2023

View reviewed changes

frame_manifest

ad37223

owencjones reviewed Aug 3, 2023

View reviewed changes

Nathan Perkins and others added 4 commits August 3, 2023 21:42

cv2 and path changes

e9b230c

revisions

88fc1b0

Fix error message (right now it drops all in []

b9b356e

Fix field name in darwin slot parsing

b70226f

In schema https://darwin-public.s3.eu-west-1.amazonaws.com/darwin_json/2.1/schema.json we don't have frame_manifest field - we have frame_manifests instead.

brain-geek reviewed Aug 10, 2023

View reviewed changes

brain-geek reviewed Aug 11, 2023

View reviewed changes

rslota and others added 2 commits August 14, 2023 16:07

Index extracted frames in sequence

cca7cc1

PR revisions

cabf579

Nathanjp91 marked this pull request as ready for review August 16, 2023 10:56

removing useless code

992168b

brain-geek approved these changes Aug 16, 2023

View reviewed changes

merge master

be1dd6f

Nathanjp91 merged commit a5238b1 into master Aug 17, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IO-1444] LongVideo Support #642

[IO-1444] LongVideo Support #642

Nathanjp91 commented Aug 3, 2023

linear bot commented Aug 3, 2023

Nathanjp91 Aug 3, 2023

Nathanjp91 Aug 3, 2023 •

edited

Loading

owencjones left a comment

owencjones Aug 3, 2023

owencjones Aug 3, 2023

Nathanjp91 Aug 3, 2023

owencjones Aug 3, 2023

owencjones Aug 3, 2023

owencjones Aug 3, 2023

owencjones Aug 3, 2023

Nathanjp91 Aug 3, 2023

owencjones Aug 3, 2023

owencjones Aug 3, 2023

brain-geek Aug 10, 2023

Nathanjp91 Aug 16, 2023

brain-geek Aug 10, 2023

brain-geek Aug 11, 2023

brain-geek left a comment

brain-geek Aug 11, 2023

Nathanjp91 commented Aug 16, 2023 •

edited

Loading

brain-geek left a comment

brain-geek commented Aug 16, 2023

Nathanjp91 commented Aug 16, 2023

		!darwin/future/tests/data_objects/workflow/data
		!tests/darwin/dataset/data



		def get_segment_manifests(slot: dt.Slot, parent_path: Path, api_key: str) -> List[dt.SegmentManifest]:
		temp_dir = parent_path / "temp"

		return segment_manifests


		def _parse_manifests(paths: List[Path], slot: str) -> List[dt.SegmentManifest]:

[IO-1444] LongVideo Support #642

[IO-1444] LongVideo Support #642

Conversation

Nathanjp91 commented Aug 3, 2023

Problem

Solution

Changelog

linear bot commented Aug 3, 2023

Choose a reason for hiding this comment

Nathanjp91 Aug 3, 2023 • edited Loading

Choose a reason for hiding this comment

owencjones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brain-geek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nathanjp91 commented Aug 16, 2023 • edited Loading

brain-geek left a comment

Choose a reason for hiding this comment

brain-geek commented Aug 16, 2023

Nathanjp91 commented Aug 16, 2023

Nathanjp91 Aug 3, 2023 •

edited

Loading

Nathanjp91 commented Aug 16, 2023 •

edited

Loading