Skip to content

Content Validation

rhoopr edited this page Apr 10, 2026 · 4 revisions

Content Validation

Every downloaded file is validated before being renamed from its temp suffix to the final path. Validation catches corrupted downloads, truncated transfers, and CDN error pages served as HTTP 200.

Magic Byte Signatures

Known media types are validated by checking the first bytes of the file against expected signatures. A mismatch produces a warning (not an error) because format variants exist -- for example, classic QuickTime MOV files start with wide or mdat instead of ftyp.

Extension(s) Offset Expected Bytes Notes
.jpg, .jpeg 0 FF D8 JPEG SOI marker
.png 0 89 50 4E 47 \x89PNG
.heic, .heif, .mov, .mp4, .m4v 4 66 74 79 70 ISO BMFF ftyp atom
.gif 0 47 49 46 38 GIF8 (covers GIF87a/GIF89a)
.tiff, .tif 0 49 49 2A 00 or 4D 4D 00 2A Little-endian (II) or big-endian (MM) TIFF
.webp 0 + 8 52 49 46 46 at 0, 57 45 42 50 at 8 RIFF container with WEBP chunk

Files with unknown extensions skip magic byte checks entirely.

ISO BMFF Containers (HEIC, MOV, MP4)

The ISO Base Media File Format family (HEIC, HEIF, MOV, MP4, M4V) is identified by the ftyp atom at offset 4. The first 4 bytes are the atom size (big-endian uint32), followed by the four-byte atom type. Common brand codes at offset 8:

Brand Format
heic HEIC (HEIF with HEVC)
mif1 HEIF
qt QuickTime MOV
isom ISO MP4
mp42 MP4 v2
M4V Apple M4V

Classic QuickTime files may start with other atoms (wide, mdat, moov, free) instead of ftyp. These produce a magic byte mismatch warning but are accepted. This is expected for live photo companion MOV files from older iPhones.

HTML Rejection

Apple's CDN occasionally serves error or rate-limit pages as HTTP 200 with an HTML body. HTML content is always rejected as a hard error regardless of file extension. Detection checks the first non-whitespace bytes for:

  • <! (catches <!DOCTYPE html>)
  • <html (case-insensitive)

HTML rejection is retryable -- the download is deleted and re-attempted on the next pass with a fresh CDN URL.

There is also an HTTP-level check: responses with Content-Type: text/html are rejected before any bytes are written to disk.

XML content (e.g., AAE sidecar files which are Apple plist XML) is not rejected. The <?xml prefix does not trigger HTML detection.

Apple UTI to Extension Mapping

iCloud returns file types as Apple Uniform Type Identifiers (UTIs) in the asset_type field. kei maps these to file extensions when the original filename's extension doesn't match:

UTI Extension
public.heic HEIC
public.heif HEIF
public.jpeg JPG
public.png PNG
com.apple.quicktime-movie MOV
com.adobe.raw-image DNG
com.canon.cr2-raw-image CR2
com.canon.crw-raw-image CRW
com.canon.cr3-raw-image CR3
com.sony.arw-raw-image ARW
com.fuji.raw-image RAF
com.panasonic.rw2-raw-image RW2
com.nikon.nrw-raw-image NRF
com.nikon.raw-image NEF
com.pentax.raw-image PEF
com.olympus.raw-image ORF
com.olympus.or-raw-image ORF
org.webmproject.webp WEBP

Assets with unrecognized UTIs keep their original filename extension unchanged.

Content-Length Verification

Before checksum comparison, the pipeline verifies that the number of bytes received matches the Content-Length header. This catches truncated downloads early and triggers an automatic retry, avoiding wasted time computing a SHA-256 on incomplete data.

Checksum Note

Apple's fileChecksum field in the CloudKit API is an MMCS (MobileMe Chunked Storage) compound signature used for internal chunk routing -- it is not a SHA-1 or SHA-256 content hash. File integrity after download is verified by kei verify --checksums, which compares a locally-computed SHA-256 (stored at download time) against the file on disk.

Commands

Getting Started

Features

Clone this wiki locally