Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IO-1444] LongVideo Support #642

Merged
merged 17 commits into from
Aug 17, 2023
Merged

[IO-1444] LongVideo Support #642

merged 17 commits into from
Aug 17, 2023

Conversation

Nathanjp91
Copy link
Contributor

Problem

LongVideo features isn't supported by the current downloader

Solution

Add in support at the download_manager

  • Ammended slot information and parser to include segments and frame_manifest
  • Added new dataclass models representing the manifest file and rows of the manifest
  • Added functions to retrieve manifests from url retrieved via slot.frame_manifest
    • get_segment_manifests -> download_manifest_txts -> _parse_manifest
    • Supports multiple segment files
    • collates into SegmentManifest object, with all manifests summarized into List[SegmentManifest]
    • Cleans downloads after completion
  • Added functions to retrieve video segments
    • Segments preffered over full video as overcomes hvec support
    • _download_and_extract_video_segment -> _download_video_segment -> _extract_frames_from_segment
    • gets passed segment manifest relevent to the file
    • cleans up segment after extraction
    • raises if cv2 is not importable

Changelog

LongVideos supported in darwin-py

@linear
Copy link

linear bot commented Aug 3, 2023

IO-1444 Extract frames out of the original video

Given Darwin JSON 2.1, we need to amend how dataset pull --video-frames works:

  • If Darwin JSON contains $.item.slots[n].frame_urls, we keep existing implementation
  • If Darwin JSON doesn't contain $.item.slots[n].frame_urls we use orignal video file + $.item.slots[n].frame_manifests to extract frames on the fly (result of this should be the same as when fetching frames from $.item.slots[n].frame_urls currently - same frame filenames etc).

https://www.notion.so/v7labs/Pitch-Long-Videos-5a01c12bdc0e45abbd7b4fd5da149f5a

dt.SegmentManifest(slot=slot, segment=segment_int, total_frames=len(seg_manifests), items=seg_manifests)
)

# Calculate the absolute frame number for each item, as manifests are per segment
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly required for this approach as we do it per segment anyway, but if we do ever want to try a full file download approach then it's mapped back from the manifest


def test_parse_manifests(manifest_paths: List[Path]) -> None:
segment_manifests = dm._parse_manifests(manifest_paths, "0")
assert len(segment_manifests) == 4
Copy link
Contributor Author

@Nathanjp91 Nathanjp91 Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These long assert chains brought to you by copilot

Copy link
Contributor

@owencjones owencjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments to consider. Will approve on request though.

!darwin/future/tests/data_objects/workflow/data
!tests/darwin/dataset/data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always like a data based test



def get_segment_manifests(slot: dt.Slot, parent_path: Path, api_key: str) -> List[dt.SegmentManifest]:
temp_dir = parent_path / "temp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you avoid TemporaryDirectory for xplat reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just always hesitant with temps and supporting other platforms

if annotation_format == "json":
return _download_image_from_json_annotation(
api_key, annotation_path, images_path, use_folders, video_frames, force_slots, ignore_slots
)
else:
console = Console()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but I feel like we need a Logging style console getter/factory to maintain the console(s) in use.

def _download_all_slots_from_json_annotation(annotation, api_key, parent_path, video_frames):
def _download_all_slots_from_json_annotation(
annotation: dt.AnnotationFile, api_key: str, parent_path: Path, video_frames: bool
) -> Iterable[Callable[[], None]]:
generator = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but this feels like a name we should change, given that it's a list, and really not even acting as a generator. At best it could feed a generator's output.

Anyway, I digress slightly...

darwin/dataset/download_manager.py Outdated Show resolved Hide resolved
def download_manifest_txts(urls: List[str], api_key: str, folder: Path) -> List[Path]:
paths = []
with requests.Session() as session:
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on backoff

raise Exception(f"Manifest file ({url}) is empty.")
path = folder / f"manifest_{index + 1}.txt"
with open(str(path), "wb") as file:
file.write(response.content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like the bubbled exceptions from these might be a bit non-specific if writes fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true, we can work that out in post though

darwin/dataset/download_manager.py Outdated Show resolved Hide resolved
return segment_manifests


def _parse_manifests(paths: List[Path], slot: str) -> List[dt.SegmentManifest]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is the hardest to read, but I don't think I'd change it, because the obstacle is really domain knowledge I think.

@@ -360,6 +360,12 @@ class Slot:
#: Metadata of the slot
metadata: Optional[Dict[str, UnknownType]] = None

#: Frame Manifest for video slots
frame_manifest: Optional[List[Dict[str, UnknownType]]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely optional change, but Dict[str, UnknownType] is interchangable with JSONType


def _extract_frames_from_segment(path: Path, manifest: dt.SegmentManifest) -> None:
try:
import cv2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency needs to be added to README with some explanation about . Right now, I managed to install it only by upping python version requirement and hardcoding numpy version.

Also,

$ pip install "darwin[ocv]"
ERROR: Could not find a version that satisfies the requirement darwin[ocv] (from versions: none)
ERROR: No matching distribution found for darwin[ocv]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the pip install message is only relevant after this PR gets deployed as it needs to be packaged for pip

segment_url = slot.segments[index]["url"]
path = video_path / f".{index:07d}.ts"
generator.append(
functools.partial(_download_and_extract_video_segment, segment_url, api_key, path, manifest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes too long right now. poetry run python -m darwin.cli dataset pull --video-frames 23.98s user 6.25s system 201% cpu 14.995 total 24 seconds for 76 frames on m2 CPU.

What if we have few thousand frames at least? Maybe at least show ETA with proper progress bar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retested. For identical videos, "old" video download is 1 second and new way of download is 16 seconds on 39 frame video. It should not be that bad.

Copy link
Contributor

@brain-geek brain-geek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another inconsistency we have right now are output messages in dataset pull --video-frames.

If I download video with frames video, I get this:

Going to download 349 files to .....

Total file count after download completed 447.

But if video is without frames, this output is weird:

Going to download 2 files to ....

Total file count after download completed 98.

Maybe add information about 2 files downloaded and 96 generated?

raise Exception(f"Failed to read frame {frame_index} from video segment {path}")
if frame_index in frames_to_extract:
frames_to_extract.remove(frame_index)
frame_path = path.parent / f"{frame_index:07d}.png"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth noting that this way we persist original frame numbers in filenames, while our existing export downloads are sequential (0-1-2-3-4, not 5-10-15-20).

@Nathanjp91 Nathanjp91 marked this pull request as ready for review August 16, 2023 10:56
@Nathanjp91
Copy link
Contributor Author

Nathanjp91 commented Aug 16, 2023

@brain-geek

Maybe add information about 2 files downloaded and 96 generated?

Unfortunately because of the way this process was originally written, it would be basically impossible without a full rewrite to capture that information. It's basically using len(download_functions) as a proxy for the amount of images, but to keep consistency, the download_function for segments is the only thing that knows how many frames will get extracted and no quick/easy way to pass that back up because it's a subprocess

Copy link
Contributor

@brain-geek brain-geek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't retested it, but all my comments have been addressed.

@brain-geek
Copy link
Contributor

@Nathanjp91

Unfortunately because of the way this process was originally written, it would be basically impossible without a full rewrite to capture that information

Maybe create a ticket to do that in the future then? Right now it's not obvious, and numbers just don't match.

@Nathanjp91
Copy link
Contributor Author

Maybe create a ticket to do that in the future then? Right now it's not obvious, and numbers just don't match.

Can maybe put in a placeholder such that if video-frames flag is selected it doesn't print that information but I think rewriting the download processor probably won't happen over a darwin py v2 implementation.

@Nathanjp91 Nathanjp91 merged commit a5238b1 into master Aug 17, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants