Two bugs in AbstractEmbedFinder::createDocumentFromFeed() cause incorrect embed document handling
Summary
Two related bugs in AbstractEmbedFinder::createDocumentFromFeed() cause incorrect behavior when creating embed documents:
-
Different embeds with the same cover image overwrite each other — when areDuplicatesAllowed() returns false, the AbstractDocumentFactory deduplicates by file hash of the downloaded thumbnail. If two different videos share the same cover (e.g. a default placeholder), the factory returns the existing document, and lines 249-250 then overwrite its embedId and embedPlatform, silently destroying the link to the first video.
-
Same embed imported twice creates duplicates — validateEmbedId() in AbstractYoutubeEmbedFinder, AbstractDailymotionEmbedFinder, and AbstractTedEmbedFinder stores the full URL as embedId. But getFeed() (called later via downloadThumbnail()) rewrites $this->embedId to the real platform ID (e.g. YouTube extracts it from the oEmbed HTML response at line 81). The getExistingDocument() check at line 217 runs before this normalization, so it searches for the full URL while the database contains only the cleaned ID — the lookup always misses.
Bug 1 — File hash collision overwrites embedId
Flow
- Video A (
embedId="AAA") is created with a cover that hashes to X
- Video B (
embedId="BBB") has the same cover (hash X)
getExistingDocument(_, "BBB", "youtube") → not found (different embedId) → continues
documentFactory->getDocument(false, false) → finds Document A by hash X → returns it
$document->setEmbedId("BBB") → overwrites Video A's embedId
flush() → Video A's link is lost
Affected platforms
YouTube, Vimeo, Dailymotion, Ted, PodEduc, Unsplash — all platforms where areDuplicatesAllowed() returns false (the default).
Spotify, Deezer, Soundcloud, Mixcloud, and Podcast already override areDuplicatesAllowed() to return true, so they are not affected.
Suggested fix
areDuplicatesAllowed() should return true for all embed finders. The file hash deduplication is designed for regular file uploads, not for embed documents where two different media can legitimately share the same thumbnail. The getExistingDocument() check (by embedId + embedPlatform) already handles true embed deduplication.
Bug 2 — embedId not normalized before duplicate lookup
Flow (YouTube example)
- URL
https://youtu.be/PFVXOb52E3g?si=XbGd0nJGxcAcjJgx is passed
validateEmbedId() matches the $idPattern regex but returns $embedId (the full URL) instead of $matches['id']
getExistingDocument() at line 217 searches for the full URL
downloadThumbnail() → calls getFeed() → YouTube's getFeed() (line 81) rewrites $this->embedId to "PFVXOb52E3g"
setEmbedId() at line 249 stores "PFVXOb52E3g" in the database
- On second import of the same URL,
getExistingDocument() searches for the full URL but the DB contains "PFVXOb52E3g" → not found → duplicate created
Affected platforms
AbstractYoutubeEmbedFinder::validateEmbedId() — returns $embedId instead of $matches['id']
AbstractDailymotionEmbedFinder::validateEmbedId() — same issue
AbstractTedEmbedFinder::validateEmbedId() — same issue
Other finders are correct: AbstractVimeoEmbedFinder returns $matches['id'], AbstractPodEducEmbedFinder normalizes the URL.
Suggested fix
validateEmbedId() should return $matches['id'] (the extracted platform ID) instead of the raw $embedId string, so that $this->embedId is normalized from construction and getExistingDocument() can match it against what's stored in the database.
Two bugs in
AbstractEmbedFinder::createDocumentFromFeed()cause incorrect embed document handlingSummary
Two related bugs in
AbstractEmbedFinder::createDocumentFromFeed()cause incorrect behavior when creating embed documents:Different embeds with the same cover image overwrite each other — when
areDuplicatesAllowed()returnsfalse, theAbstractDocumentFactorydeduplicates by file hash of the downloaded thumbnail. If two different videos share the same cover (e.g. a default placeholder), the factory returns the existing document, and lines 249-250 then overwrite itsembedIdandembedPlatform, silently destroying the link to the first video.Same embed imported twice creates duplicates —
validateEmbedId()inAbstractYoutubeEmbedFinder,AbstractDailymotionEmbedFinder, andAbstractTedEmbedFinderstores the full URL asembedId. ButgetFeed()(called later viadownloadThumbnail()) rewrites$this->embedIdto the real platform ID (e.g. YouTube extracts it from the oEmbed HTML response at line 81). ThegetExistingDocument()check at line 217 runs before this normalization, so it searches for the full URL while the database contains only the cleaned ID — the lookup always misses.Bug 1 — File hash collision overwrites
embedIdFlow
embedId="AAA") is created with a cover that hashes toXembedId="BBB") has the same cover (hashX)getExistingDocument(_, "BBB", "youtube")→ not found (differentembedId) → continuesdocumentFactory->getDocument(false, false)→ finds Document A by hashX→ returns it$document->setEmbedId("BBB")→ overwrites Video A'sembedIdflush()→ Video A's link is lostAffected platforms
YouTube, Vimeo, Dailymotion, Ted, PodEduc, Unsplash — all platforms where
areDuplicatesAllowed()returnsfalse(the default).Spotify, Deezer, Soundcloud, Mixcloud, and Podcast already override
areDuplicatesAllowed()to returntrue, so they are not affected.Suggested fix
areDuplicatesAllowed()should returntruefor all embed finders. The file hash deduplication is designed for regular file uploads, not for embed documents where two different media can legitimately share the same thumbnail. ThegetExistingDocument()check (byembedId+embedPlatform) already handles true embed deduplication.Bug 2 —
embedIdnot normalized before duplicate lookupFlow (YouTube example)
https://youtu.be/PFVXOb52E3g?si=XbGd0nJGxcAcjJgxis passedvalidateEmbedId()matches the$idPatternregex but returns$embedId(the full URL) instead of$matches['id']getExistingDocument()at line 217 searches for the full URLdownloadThumbnail()→ callsgetFeed()→ YouTube'sgetFeed()(line 81) rewrites$this->embedIdto"PFVXOb52E3g"setEmbedId()at line 249 stores"PFVXOb52E3g"in the databasegetExistingDocument()searches for the full URL but the DB contains"PFVXOb52E3g"→ not found → duplicate createdAffected platforms
AbstractYoutubeEmbedFinder::validateEmbedId()— returns$embedIdinstead of$matches['id']AbstractDailymotionEmbedFinder::validateEmbedId()— same issueAbstractTedEmbedFinder::validateEmbedId()— same issueOther finders are correct:
AbstractVimeoEmbedFinderreturns$matches['id'],AbstractPodEducEmbedFindernormalizes the URL.Suggested fix
validateEmbedId()should return$matches['id'](the extracted platform ID) instead of the raw$embedIdstring, so that$this->embedIdis normalized from construction andgetExistingDocument()can match it against what's stored in the database.