Skip to content

AbstractEmbedFinder::createDocumentFromFeed — duplicate detection bugs with file hash and embedId mismatch #428

@eliot488995568

Description

@eliot488995568

Two bugs in AbstractEmbedFinder::createDocumentFromFeed() cause incorrect embed document handling

Summary

Two related bugs in AbstractEmbedFinder::createDocumentFromFeed() cause incorrect behavior when creating embed documents:

  1. Different embeds with the same cover image overwrite each other — when areDuplicatesAllowed() returns false, the AbstractDocumentFactory deduplicates by file hash of the downloaded thumbnail. If two different videos share the same cover (e.g. a default placeholder), the factory returns the existing document, and lines 249-250 then overwrite its embedId and embedPlatform, silently destroying the link to the first video.

  2. Same embed imported twice creates duplicatesvalidateEmbedId() in AbstractYoutubeEmbedFinder, AbstractDailymotionEmbedFinder, and AbstractTedEmbedFinder stores the full URL as embedId. But getFeed() (called later via downloadThumbnail()) rewrites $this->embedId to the real platform ID (e.g. YouTube extracts it from the oEmbed HTML response at line 81). The getExistingDocument() check at line 217 runs before this normalization, so it searches for the full URL while the database contains only the cleaned ID — the lookup always misses.


Bug 1 — File hash collision overwrites embedId

Flow

  1. Video A (embedId="AAA") is created with a cover that hashes to X
  2. Video B (embedId="BBB") has the same cover (hash X)
  3. getExistingDocument(_, "BBB", "youtube") → not found (different embedId) → continues
  4. documentFactory->getDocument(false, false) → finds Document A by hash X → returns it
  5. $document->setEmbedId("BBB") → overwrites Video A's embedId
  6. flush() → Video A's link is lost

Affected platforms

YouTube, Vimeo, Dailymotion, Ted, PodEduc, Unsplash — all platforms where areDuplicatesAllowed() returns false (the default).

Spotify, Deezer, Soundcloud, Mixcloud, and Podcast already override areDuplicatesAllowed() to return true, so they are not affected.

Suggested fix

areDuplicatesAllowed() should return true for all embed finders. The file hash deduplication is designed for regular file uploads, not for embed documents where two different media can legitimately share the same thumbnail. The getExistingDocument() check (by embedId + embedPlatform) already handles true embed deduplication.


Bug 2 — embedId not normalized before duplicate lookup

Flow (YouTube example)

  1. URL https://youtu.be/PFVXOb52E3g?si=XbGd0nJGxcAcjJgx is passed
  2. validateEmbedId() matches the $idPattern regex but returns $embedId (the full URL) instead of $matches['id']
  3. getExistingDocument() at line 217 searches for the full URL
  4. downloadThumbnail() → calls getFeed() → YouTube's getFeed() (line 81) rewrites $this->embedId to "PFVXOb52E3g"
  5. setEmbedId() at line 249 stores "PFVXOb52E3g" in the database
  6. On second import of the same URL, getExistingDocument() searches for the full URL but the DB contains "PFVXOb52E3g" → not found → duplicate created

Affected platforms

  • AbstractYoutubeEmbedFinder::validateEmbedId() — returns $embedId instead of $matches['id']
  • AbstractDailymotionEmbedFinder::validateEmbedId() — same issue
  • AbstractTedEmbedFinder::validateEmbedId() — same issue

Other finders are correct: AbstractVimeoEmbedFinder returns $matches['id'], AbstractPodEducEmbedFinder normalizes the URL.

Suggested fix

validateEmbedId() should return $matches['id'] (the extracted platform ID) instead of the raw $embedId string, so that $this->embedId is normalized from construction and getExistingDocument() can match it against what's stored in the database.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions