Skip to content

Reclaim orphaned article content; stop prune_old_articles leaking it#633

Merged
mircealungu merged 8 commits into
masterfrom
cleanup-orphaned-article-content
May 27, 2026
Merged

Reclaim orphaned article content; stop prune_old_articles leaking it#633
mircealungu merged 8 commits into
masterfrom
cleanup-orphaned-article-content

Conversation

@mircealungu
Copy link
Copy Markdown
Member

Problem

A production snapshot revealed that ~95–98% of every article-content table was orphaned — rows belonging to articles that had already been deleted:

table total live orphaned
article_fragment 30,178,086 519,067 98%
new_text 22,724,665 960,883 96%
source_text 2,403,486 45,425 98%
source 2,403,478 45,426 98%
article_tokenization_cache 265,467 7,646 97%

This bloats both production storage (~25 GB) and the anonymized backup dump.

Root causes (both fixed here)

  1. prune_old_articles.py deletes with FOREIGN_KEY_CHECKS = 0 (necessary to get past article's NO ACTION children like user_article/user_reading_session). But FK-checks-off also suppresses the ON DELETE CASCADE children — so fragments, tokenization cache, CEFR assessment, etc. were never cleaned. delete_in_batches now deletes the cascade-owned children explicitly (incl. article_fragment_context, which is nested under article_fragment).

  2. new_text / source / source_text have no cascade path at all and are content-deduplicated (one row shared across articles and user data), so no per-article delete can safely remove them. New tool cleanup_orphaned_content.py reclaims them — and clears the historical backlog — deleting only rows not referenced by any surviving article fragment, bookmark_context, caption, bookmark, user_activity_data, or video. Dry-run by default; --execute to apply, --optimize to return disk to the OS.

Validation (on a real production snapshot)

  • cleanup_orphaned_content.py --execute deleted 56,396,735 rows with no FK errors; every table landed exactly on its computed keep-count.
  • Anonymized dump shrank from 6.3 GB → 634 MB zipped (~10×).

Follow-up

Recommend running cleanup_orphaned_content.py in cron right after prune_old_articles.py --apply to reclaim shared content on an ongoing basis.

🤖 Generated with Claude Code

Article deletion was leaving ~95% of the content tables orphaned. Two causes,
both fixed here:

1. prune_old_articles.py deletes with FOREIGN_KEY_CHECKS=0 (needed to get past
   article's NO ACTION children). With FK checks off, the ON DELETE CASCADE
   children are never cleaned up. delete_in_batches now deletes article's
   cascade-owned children explicitly (article_fragment + its
   article_fragment_context, article_tokenization_cache, article_cefr_assessment,
   and the rest of the CASCADE set) so pruning no longer leaks them.

2. The shared, deduplicated content tables (new_text / source / source_text)
   have no cascade path at all — a row can be shared across articles and user
   data. New tool cleanup_orphaned_content.py reclaims them (and clears the
   historical backlog) by deleting only rows not referenced by any surviving
   article fragment, bookmark_context, caption, bookmark, user_activity_data,
   or video. Dry-run by default; --execute to apply, --optimize to reclaim disk.

Measured on a production snapshot: ~32M deletable rows, shrinking the dump from
~40 GB to ~4 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

ArchLens - No architecturally relevant changes to the existing views

mircealungu and others added 7 commits May 25, 2026 21:50
…h data

When pruning an article, only delete its regenerable/computed children
(cefr_assessment, classification, tokenization_cache, topic_map,
url_keyword_map, difficulty_lingo_rank, broken_code_map, grammar_correction_log)
plus fragments. Deliberately stop deleting the user/research/teacher cascade
children — user_activity_data (learning analytics), personal_copy,
cohort_article_map, article_topic_user_feedback, user_article_broken_report —
which are worth preserving even after an article is pruned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strengthen referenced_article_ids() so an article is never pruned while it is
pointed at by personal_copy, cohort_article_map, user_activity_data,
article_topic_user_feedback, or user_article_broken_report. These tables hold
only pointers (no content of their own), so the article must survive for them
to mean anything; protecting the article also guarantees none of the
user/research cascade children we intentionally don't delete can become a
dangling orphan. This exactly complements CASCADE_CHILDREN (the derived/
computed children we do delete).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the fragile FOREIGN_KEY_CHECKS=0 + manual cascade approach (which
crashed mid-run when the session setting was lost across pooled connections,
and silently cascade-deleted data we meant to keep) with FK-checks-ON deletion:

  - DB CASCADE removes the derived/regenerable children automatically; the
    manual CASCADE_CHILDREN list and delete_article_owned_children are gone.
  - Migration 26-05-26 switches the data we keep (personal_copy,
    user_activity_data, cohort_article_map, article_topic_user_feedback,
    user_article_broken_report) from ON DELETE CASCADE to RESTRICT, so deleting
    a referenced article is blocked instead of silently destroying that data.
  - parent_article_id stays CASCADE; prune additionally protects an original
    whose AI-simplification is referenced, so families stay intact and no
    simplification is orphaned. Unreferenced simplifications cascade out with
    their pruned original.
  - referenced_article_ids() pre-filters the blocking tables (so we don't
    attempt doomed deletes); if it ever drifts, delete_in_batches ABORTS LOUDLY
    naming the blocking FK rather than skipping — so the gap is noticed.

Validated on a full un-anonymized prod mirror (2.47M articles): prune --apply
ran to completion with zero new dangling references across every article-child
table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pruning an article orphans its de-duplicated new_text/source/source_text (no FK
from article, so the cascade can't reach them). Rather than relying on a
separate sweep, prune now reclaims them per batch: it captures the text_id /
source_id / source_text_id the batch points at, deletes the articles, then
deletes those rows that are now referenced by nobody (same NOT EXISTS guards as
cleanup_orphaned_content.py, but scoped by id -> no full-table scan).

Net: one command does the whole job. tools/cleanup_orphaned_content.py stays for
the one-time historical backlog and the anonymization pipeline.

Smoke-tested: pruning one article reclaimed 16 new_text + 1 source + 1
source_text. Article-deletion path validated at 595K-article scale earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anonymize_users.py deleted its unreferenced articles with the same leaky
pattern prune used to (FOREIGN_KEY_CHECKS=0, no content reclaim) — which is what
created the ~15M-row orphan backlog that bloated every anon backup.

Extract the validated FK-checks-ON deletion into zeeguu/core/article_pruning.py
(referenced_article_ids, reclaim_shared_content, delete_articles_in_batches) and
use it from BOTH prune_old_articles.py and anonymize_users.py. Now:
  - neither path disables FK checks or leaves orphans;
  - both protect exactly the same referenced set (incl. bookmarks, simplification
    parents) and abort loudly on a pre-filter gap;
  - the anon DB comes out clean, so the backup is small without a separate sweep.

Smoke-tested via prune --apply (shared path): deleted + reclaimed content, zero
orphans across all article-child tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mircealungu mircealungu merged commit f4a2e36 into master May 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant