-
-
Notifications
You must be signed in to change notification settings - Fork 5
Download Pipeline
kei uses a streaming, concurrent download architecture designed for large libraries.
Assets flow from the iCloud API directly into the download pipeline. The API is paginated, and while one page of results is being downloaded, the next page is being fetched. Downloads begin as soon as the first page of photos is returned, without waiting to count the total library size first.
For a library with 100k+ photos, downloads begin within seconds of authentication completing.
On subsequent runs, the pipeline checks for changes using a CloudKit sync token before enumerating. If nothing changed, the cycle completes in 1-2 API calls. When changes exist, only the delta is processed - new assets go through the download pipeline, deletions are logged and skipped. Clear sync tokens with kei reset sync-token when you need the next sync to do a full enumeration. See State Tracking.
Multiple files can be downloaded simultaneously. Set [download].threads in TOML. Downloads use buffer_unordered - files complete in whatever order they finish, not the order they were started.
File downloads use a dedicated HTTP client with no total request timeout. This prevents large files (e.g., 500MB MOV videos) from being killed mid-transfer when sharing bandwidth with many concurrent downloads. The download client uses:
- 30s connect timeout - fast failure for unreachable hosts
- 120s read timeout - detects stalled connections (no bytes received for 120s) without killing slow-but-progressing transfers
API calls (authentication, album enumeration) continue to use a 30s total timeout for fast failure on errors.
If a download is interrupted, the partially downloaded temp file is kept. On the next run, the existing bytes are verified and the download resumes from where it left off using HTTP Range requests. The locally-stored SHA-256 checksum covers the entire file (existing + new bytes) and is used by verify --checksums.
Resume hashing uses a 256KB buffer for fast re-verification of large partial files.
If the server does not support Range requests and returns HTTP 200 instead of 206, the download restarts from the beginning. This is logged at info level so you can see when a resume was reset.
When the server does return partial content, kei validates Content-Range before appending to the existing .kei-tmp file. A mismatched range resets the partial file rather than stitching unrelated byte ranges together.
Partial downloads use .kei-tmp as the temp file suffix by default (changed from .part in v0.2.0). This avoids conflicts with Nextcloud, Seafile, and other WebDAV sync clients that reject or ignore .part files. The suffix is configurable with [download].temp_suffix.
Before comparing checksums, the pipeline verifies that the number of bytes received matches the Content-Length header from Apple's CDN. This catches truncated downloads (e.g., Apple silently cutting off large videos at ~1 GB) early and triggers an automatic retry, avoiding wasted time computing a checksum on incomplete data.
After the main download pass, any failed downloads get a second attempt. The cleanup pass re-fetches CDN URLs from iCloud before retrying, which fixes failures caused by expired download URLs on large files.
Deferred unfiled passes also reopen their iCloud stream after album passes finish. That keeps signed CDN URLs fresher and avoids spending a long run on URLs that expired while album work was still active.
If Apple invalidates the session mid-sync (common on very large libraries that run for hours), downloads start failing with HTTP 401/403 errors. The pipeline detects these auth errors and, after a threshold is reached, pauses downloads and triggers automatic re-authentication.
Once re-authenticated, the download pass resumes. Already-downloaded files are naturally skipped, so progress isn't lost. See Authentication for details.
Every download is verified against the expected file size (from both the Content-Length header and the API's size field) and validated for correct content type using magic byte checks (JPEG, HEIC, PNG, MOV, etc.). HTML and JSON error pages served by Apple's CDN are detected and rejected with an automatic retry.
Note: Apple's
fileChecksumfield in the CloudKit API is an MMCS (MobileMe Chunked Storage) compound signature used for internal chunk routing - it is not a SHA-1 or SHA-256 content hash. It cannot be used to verify downloaded file content. File integrity after download is checked byverify --checksums, which compares a locally-computed SHA-256 (stored at download time) against the file on disk.
State is flushed to SQLite after each successful download. If the process is killed mid-sync, the worst case is one file gets re-downloaded on the next run - nothing is lost.
EXIF datetime tags and XMP Toolkit writes use the same temp-file and atomic-rename path as media downloads. If a metadata write fails, the file still lands in the download directory with a warning, but the final path never contains a half-written metadata block.
Content is validated (size and format) while the file is still .kei-tmp. A failed validation doesn't leave garbage in the download directory.
When a shutdown signal is received (Ctrl+C, SIGTERM, SIGHUP), the pipeline finishes any downloads already in flight, then stops processing new ones. Partial .kei-tmp files from interrupted downloads are preserved and resumed on the next run via HTTP Range requests.
After 30 seconds, remaining in-flight downloads are cancelled. A second signal force-exits immediately.
When multiple iCloud assets share the same filename (common with generic names like IMG_0001.jpg), the default [photos].file_match_policy = "name-size-dedup-with-suffix" keeps both files:
- If a file exists with the same size, it's considered already downloaded and skipped
- If a file exists with a different size, the new file is saved with a size suffix (e.g.,
photo-12345.jpg)
This matches Python icloudpd's behavior and keeps photos from being silently skipped due to filename collisions.
macOS screenshots include time-of-day in the filename, with the AM/PM separator varying by locale: a regular space (1.40.01 PM), a narrow no-break space U+202F (1.40.01 PM), or no space (1.40.01PM). kei normalizes these variants when checking for existing files, so a screenshot downloaded with one locale setting won't be re-downloaded after switching locales.
Invalid filesystem characters in filenames are replaced with _. A file named photo:/name becomes photo__name.
Filenames exceeding the 255-byte filesystem limit are truncated while preserving the extension.
Album names and folder template values are sanitized to prevent directory traversal. The sanitizer strips ../ sequences, Windows reserved names (CON, NUL, etc.), and leading dots that could create hidden directories.
WEBP images (org.webmproject.webp) are correctly classified as images. Previously they defaulted to the movie type, which caused video-skip media filters to incorrectly exclude WEBP photos.
Downloads write to a temporary file (.kei-tmp by default), then atomically rename to the final path. This prevents partial files from appearing in the download directory.
After each cycle, kei logs a human-readable summary:
50 downloaded, 350 skipped, 0 failed (400 total)
Bytes downloaded and bytes written to disk are tracked through the pipeline. The disk number can be higher than the download number when EXIF datetime tags are written to files.
For machine-readable output with per-reason skip breakdowns and byte totals, set [report].json. The same stats are available to notification scripts as environment variables.
Full-enumeration runs include full_enumeration_reason in the JSON report and logs. Prometheus exposes the same bounded vocabulary through kei_sync_full_enumeration_reason_total{reason="..."}.
A progress bar tracks downloads in real time, showing the number of assets processed out of the total. It auto-hides when stdout is not a TTY (e.g., cron jobs or piped output) or when --no-progress-bar is set. For a persistent setting, use [ui].progress_bar = false.
Already-downloaded files advance the progress counter on resume, so the bar reflects true progress rather than starting from zero.
The total is based on photo count. Each photo can produce multiple files (live photo MOVs, RAW alternates), so the counter may slightly overshoot the total.