Fix AioStreams metadata disambiguation using watch history and year#47
Fix AioStreams metadata disambiguation using watch history and year#47
Conversation
…metadata matching Co-authored-by: willtho89 <4933503+willtho89@users.noreply.github.com>
Co-authored-by: willtho89 <4933503+willtho89@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Improves AioStreams metadata selection for ambiguous TV titles by using a user’s watch history (episode continuity + IDs) and ensuring year-in-filename influences search and candidate choice.
Changes:
- Made
_select_candidate_for_entry()async and addeddb/user_idso selection can consult watch history. - Added
_check_series_continuity()to prefer candidates that match previously watched episodes (ID-based). - Updated tests to call the new async selector and added a continuity test case.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| backend/src/librarysync/jobs/aiostreams_import.py | Adds async candidate selection and DB-backed continuity logic to disambiguate same-title TV shows. |
| backend/tests/test_aiostreams_import.py | Updates tests for async signature and adds coverage for continuity-based selection. |
| # First, find all TV shows the user has watched | ||
| # Use a subquery to get the max season and episode per show | ||
| # Note: This may return max_season from one episode and max_episode from another, | ||
| # but it's acceptable for our continuity check as we're looking for a general pattern | ||
| result = await db.execute( | ||
| select( | ||
| MediaItem, | ||
| func.max(EpisodeItem.season_number).label("max_season"), | ||
| func.max(EpisodeItem.episode_number).label("max_episode"), | ||
| ) | ||
| .join(EpisodeItem, EpisodeItem.show_media_item_id == MediaItem.id) | ||
| .join(WatchedItem, and_( | ||
| WatchedItem.episode_item_id == EpisodeItem.id, | ||
| WatchedItem.user_id == user_id | ||
| )) | ||
| .where(MediaItem.media_type == "tv") | ||
| .group_by(MediaItem.id) |
There was a problem hiding this comment.
Computing max_season and max_episode independently can produce an impossible (season, episode) pair (e.g., max season from one row and max episode from another). This can incorrectly reject true continuations (e.g., user watched S03E01 but max_episode comes from S01E22, causing the nearby/continuation checks to fail). Consider selecting the latest watched episode per show using a single ordering over (season_number DESC, episode_number DESC) (e.g., a window function / DISTINCT ON in Postgres), or a two-step aggregate (max season, then max episode constrained to that season).
| # First, find all TV shows the user has watched | |
| # Use a subquery to get the max season and episode per show | |
| # Note: This may return max_season from one episode and max_episode from another, | |
| # but it's acceptable for our continuity check as we're looking for a general pattern | |
| result = await db.execute( | |
| select( | |
| MediaItem, | |
| func.max(EpisodeItem.season_number).label("max_season"), | |
| func.max(EpisodeItem.episode_number).label("max_episode"), | |
| ) | |
| .join(EpisodeItem, EpisodeItem.show_media_item_id == MediaItem.id) | |
| .join(WatchedItem, and_( | |
| WatchedItem.episode_item_id == EpisodeItem.id, | |
| WatchedItem.user_id == user_id | |
| )) | |
| .where(MediaItem.media_type == "tv") | |
| .group_by(MediaItem.id) | |
| # First, find the latest watched episode per show for this user, in a way that | |
| # guarantees (max_season, max_episode) comes from a real episode row. | |
| # | |
| # Step 1: find the max season per show that the user has watched. | |
| max_season_per_show_subq = ( | |
| select( | |
| EpisodeItem.show_media_item_id.label("show_id"), | |
| func.max(EpisodeItem.season_number).label("max_season"), | |
| ) | |
| .join( | |
| WatchedItem, | |
| and_( | |
| WatchedItem.episode_item_id == EpisodeItem.id, | |
| WatchedItem.user_id == user_id, | |
| ), | |
| ) | |
| .group_by(EpisodeItem.show_media_item_id) | |
| .subquery() | |
| ) | |
| # Step 2: within that max season, find the max episode per show. | |
| max_episode_in_max_season_subq = ( | |
| select( | |
| EpisodeItem.show_media_item_id.label("show_id"), | |
| func.max(EpisodeItem.episode_number).label("max_episode"), | |
| ) | |
| .join( | |
| max_season_per_show_subq, | |
| and_( | |
| max_season_per_show_subq.c.show_id == EpisodeItem.show_media_item_id, | |
| max_season_per_show_subq.c.max_season == EpisodeItem.season_number, | |
| ), | |
| ) | |
| .group_by(EpisodeItem.show_media_item_id) | |
| .subquery() | |
| ) | |
| # Step 3: join the aggregates back to MediaItem to get per-show latest season/episode. | |
| result = await db.execute( | |
| select( | |
| MediaItem, | |
| max_season_per_show_subq.c.max_season.label("max_season"), | |
| max_episode_in_max_season_subq.c.max_episode.label("max_episode"), | |
| ) | |
| .join( | |
| max_season_per_show_subq, | |
| max_season_per_show_subq.c.show_id == MediaItem.id, | |
| ) | |
| .join( | |
| max_episode_in_max_season_subq, | |
| max_episode_in_max_season_subq.c.show_id == MediaItem.id, | |
| ) | |
| .where(MediaItem.media_type == "tv") |
| result = await db.execute( | ||
| select( | ||
| MediaItem, | ||
| func.max(EpisodeItem.season_number).label("max_season"), | ||
| func.max(EpisodeItem.episode_number).label("max_episode"), | ||
| ) | ||
| .join(EpisodeItem, EpisodeItem.show_media_item_id == MediaItem.id) | ||
| .join(WatchedItem, and_( | ||
| WatchedItem.episode_item_id == EpisodeItem.id, | ||
| WatchedItem.user_id == user_id | ||
| )) | ||
| .where(MediaItem.media_type == "tv") | ||
| .group_by(MediaItem.id) | ||
| ) |
There was a problem hiding this comment.
Despite computing title_key, the query fetches all TV shows a user has watched and only later filters by title in Python. For users with large histories this can be unnecessarily expensive. Consider pushing at least a coarse title filter into SQL (e.g., ilike / lower(title) equality on entry.title) so you only aggregate candidate-relevant shows, then keep the stricter _normalize_title_key check in Python if needed.
| selected = asyncio.run( | ||
| aiostreams_import._select_candidate_for_entry(db, "test_user", entry, candidates) | ||
| ) |
There was a problem hiding this comment.
Using asyncio.run() inside unittest.TestCase can break if tests are executed under an environment that already has a running event loop. Prefer unittest.IsolatedAsyncioTestCase with async def test_... methods (and await ...) for these async tests.
| # 1. Are TV shows | ||
| # 2. Have a matching normalized title | ||
| # 3. Have episode watches by this user | ||
| from sqlalchemy import and_, func |
There was a problem hiding this comment.
Minor cleanup: importing and_/func inside the function makes dependencies less discoverable; prefer module-level imports for consistency. Also, the if entry.season_number is not None: check is redundant because the function already returns early when season_number is None.
| if entry.season_number is not None: | ||
| is_continuation = ( |
There was a problem hiding this comment.
Minor cleanup: importing and_/func inside the function makes dependencies less discoverable; prefer module-level imports for consistency. Also, the if entry.season_number is not None: check is redundant because the function already returns early when season_number is None.
AioStreams provides only filenames without media IDs, causing ambiguous metadata matches when multiple shows share titles (e.g., "Fallout" TV show vs "Fallout" anime).
Changes
Series continuity check - Query user's watch history to disambiguate TV show candidates:
Year extraction - Verify existing PTT year parsing flows correctly:
["Fallout 2024", "Fallout"]Impact
Implementation
_select_candidate_for_entry()async, addeddb/user_idparams_check_series_continuity()with SQLAlchemy query for watch historymax_season/max_episodein continuity logicOriginal prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.