New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip binary indexing during upgrade #446

tusmester opened this Issue Aug 16, 2018 · 1 comment


None yet
3 participants
Copy link

tusmester commented Aug 16, 2018

Reindexing the whole db takes too much time. A possible workaround: skip binary text extract (at least for documents) and only index metadata, and provide a tool or API for reindexing files gradually later, while the site is already running.

  • verify that the assumption is true
    • 3300 office files, hard reindex (gap growth is exponential)
      • original code: 1:42 min
      • skip binaries: 1:14 min
  • make the necessary changes in the populator, in the step and the upgrade package (parameters, etc.)
    • add a parameter to the populator/step for skipping binaries
    • open the API for injecting binary text extracts, without reindexing the whole content (currently this is hidden)
  • documentation (snadmin tools, upgrade blog post)


  • The secondary indexing should work silently in the background, without overloading the server.
  • It has to survive site, tool and db restarts, as it may run for many hours.
  • We have to know the progress, percentage of already re-indexed files.

Options for re-indexing binaries

  • a) solve it using indexing activities (a single or per-content)
    • PRO: built-in api
    • CON:
      • one activity per file --> too many activities
      • how and when to create these activities?
  • a2) Hibrid solution: single activity for multiple (range of) content (1-100, 101-200). Create a new activity when the previous one is finished.
    • PRO: this makes it possible to continue the indexing process after a site restart.
    • CON: -
  • b) maintenance (like the binary cleanup mechanism)
    • PRO: controlled by us
    • CON: runs on all web nodes, no explicit locking (may index the same file on multiple web nodes)
  • c) task management
    • PRO: scalable
    • CON: taskman is not installed everywhere, complicated install

This comment has been minimized.

Copy link
Member Author

tusmester commented Sep 2, 2018

The chosen solution:

We create small tasks in a dedicated table during indexing for documents that we skipped binaries of. After restarting the site we load and lock these tasks so that they are not executed multiple times and reindex those documents gradually.

During the patch there is no need to save the index document to the index itself, we only have to regenerate the binary serialized document stored in the Versions table (because the format of the index document stored in the Lucene index have not actually changed).

@tusmester tusmester removed the discussion label Sep 2, 2018

@tusmester tusmester added this to the Sprint 166 milestone Sep 2, 2018

@kultsar kultsar closed this Sep 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment