Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced Srcset and Style Sheet Containing Media Queries Preservation #359

Merged

Conversation

N0taN3rd
Copy link
Contributor

@N0taN3rd N0taN3rd commented Jul 24, 2018

Description

This PR introduces to Pywb via changes to wombat.js and the addition of a new static resource wombatPreservationWorker.js the ability to preserve
the set of values contained in the srcset attribute and images contained in CSS media query rules.

The preservation scheme is split into two parts:

  1. wombat: Initiating wombatPreservationWorker and sending it srcset and media query values for preservation
  2. wombatPreservationWorker: The web worker responsible for fetching the resources

wombat

In order to facilitate this new preservation scheme the changes to wombat are as follows.

Three new internal global variables
  1. WBPreserWorker: a reference to the preservation worker abstraction used by wombat itself
  2. wbSheetMediaQChecker: a reference to the load event callback for checking the stylesheet introduced by link[rel='stylesheet']
  3. wbUsePresWorker: flag indicating if we should use the preservation worker.

The first two global variables (WBPreserWorker and wbSheetMediaQChecker) remain undefined unless the third (wbUsePresWorker) is true.
wbUsePresWorker is set to true when window.Worker is non-null and wombat is operating in live mode or the wbinfo.top_url contains record.

New init function initPreserveWorker and PWorker

The new init function, initPreserveWorker, contains wombats interface to the new wombatPreservationWorker, that was designed to be cross-frame safe.
The init function only introduces the PWorker interface when the wbUsePresWorker flag is true.

initPreserveWorker operates similar to the existing web worker override except that it must come first in order to use non-overridden reference to window.Worker
which is used by PWorker (preservation worker).

PWorker is an ES5 class that provides functionality for cross-frame preservation of srcset and media query CSS rule preservation.
Creation of the PWorker takes two arguments prefix (wb_abs_prefix) and mod (wbinfo.mod).
If wombat is operating in the __WB_replay_top browser context, the backing worker is created otherwise it is not re-created.
The URL to the backing worker includes query parameters prefix and mod which are used by wombatPreservationWorker to create itself if we can use window.URL
otherwise we must immediately send an init message containing those values to the worker.

The PWorker class exposes six functions:

  1. deferredSheetExtraction: Method for checking a style or link[rel='stylesheet'] elements sheet property for media query rules.
    Done in a manner that does not block the UI thread by using Promise.resolve (when available) or setTimeout(cb, 1).
    Used by the checkStyle (value of wbSheetMediaQChecker), rewrite_elem and override_html_assign functions.
    Usage in rewrite_elem is tied to elements style (text contents changed and sheet != null) and link[rel='stylesheet'] (via link.addEventListener('load',wbSheetMediaQChecker)).

  2. terminate: terminate the backing web worker, only terminates worker when in the __WB_replay_top browser context.

  3. postMessage: send a message to the backing web worker. If wombat is in the __WB_replay_top browser context sends message directly
    to worker otherwise forwards msg to __WB_replay_top which will then send the msg directly to the backing worker.

  4. preserveSrcset: method used when sending srcset values from rewrite_srcset

  5. preserveMedia: method used when sending media query values from deferredSheetExtraction

  6. extractFromLocalDoc: method to check document.stylesheets and document.querySelectorAll('*[srcset]') when notify_top is called from init_top_frame_notify

wombatPreservationWorker

The new file wombatPreservationWorker.js contains the worker code used for preserving srcset and media query values.
Because it is expected of wombat to operate in a wide range of browsers this file contains minimal polyfills of Promise and fetch if these symbols are not defined.

This worker expects to receive two messages from wombat:

  1. init: when the WHATWG URL parser is not available to instantiate the Preserver class.
  2. values: contains values to be preserved.

The Preserver class (ES5) is responsible for the extraction and preservation of the desired values.
The Preserver tracks seen URLs, up to 2500 before resetting, in order to lessen the load against the server(s) requests are made too.
If the Preserver has seen a URL no fetch is made otherwise it is added to the seen cache.

The Preserver class exposes 9 functions:

  1. fixupURL: Ensures the URL to be fetched are absolute
  2. safeFetch: Initiates a URL fetch or queues the URL to be fetched if we are waiting for a batch of fetches to complete
  3. urlExtractor: Extracts the URL from sheet text (media query) in a similar manner as style_replacer
  4. fetchDone: When a batch of fetches is done ensures the URL queue is drained and if the Preserver has seen 2500 URLs clears the seen count for GC purposes.
  5. fetchAll: Ensures all fetches complete, fetchDone is called and indicates we are to queue URLs until fetches complete.
    If Preserver is queuing URLs or no fetches are to be made this is a no op.
  6. drainQ: Drains the queue by calling safeFetch for each queued URL and then calls fetchAll
  7. extractMedia: For each style sheet that contained a media query, extract URLs from it's text and fetch initiate a fetch for it if there was a URL
  8. extractSrcset: For each srcset value, if the value is from Pworker.preserveSrcset only fetch URL otherwise from Pworker.extractFromLocalDoc need to extract split full srcset attribute value and then fetch each URL contained
  9. preserveMediaSrcset: Used when the values message is received and calls extractMedia, extractSrcset and fetchAll.

Motivation and Context

Ref Rhizome-Conifer/conifer#64

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

@N0taN3rd N0taN3rd changed the title [WIP Do Not Merge] Advanced Srcset and Media Query Based StyleSheet Rules Preservation [WIP Do Not Merge] Advanced Srcset and Media Query Style Sheet Rules Preservation Jul 24, 2018
@codecov
Copy link

codecov bot commented Jul 24, 2018

Codecov Report

Merging #359 into develop will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #359   +/-   ##
========================================
  Coverage    87.78%   87.78%           
========================================
  Files           59       59           
  Lines         7107     7107           
  Branches      1256     1256           
========================================
  Hits          6239     6239           
  Misses         582      582           
  Partials       286      286

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 841687f...d983a8a. Read the comment docs.

@N0taN3rd N0taN3rd changed the title [WIP Do Not Merge] Advanced Srcset and Media Query Style Sheet Rules Preservation Advanced Srcset and Style Sheet Containing Media Queries Preservation Jul 25, 2018
@N0taN3rd N0taN3rd requested a review from ikreymer July 25, 2018 18:03
@N0taN3rd N0taN3rd force-pushed the srcset-mediaquery-advanced-preservation branch from 71f4cb9 to 36493a7 Compare August 10, 2018 16:11
}
// depending on if we have promises, defer things until
// next time the Promise.resolve or setTimeout Qs are cleared
if (typeof $wbwindow.Promise !== 'undefined') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to assume we have Promises if we have workers? If so, maybe should just do it one or the other, as there's already enough conditionals here. :) I think we use Promises already in wombat..

var WBPreserWorker;
var wbSheetMediaQChecker;
var wbUsePresWorker = $wbwindow.Worker != null &&
(wbinfo.is_live || wbinfo.top_url.indexOf('/record/') !== -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't need the /record/ check, is_live should be sufficient..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are.
is_live is sufficient, changing

for (var i = 0; i < values.length; i++) {
values[i] = rewrite_url(values[i].trim());
}

if (WBPreserWorker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be checking wbUsePresWorker also?

@@ -3630,6 +3804,9 @@ var _WBWombat = function($wbwindow, wbinfo) {
}

send_top_message(message);
if (WBPreserWorker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check wbUsePresWorker for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify_top() actually runs twice (for readyState interactive and for readyState complete, to allow the banner to update as soon as possible). Probably should send to worker only when readyState is complete?

encodeURIComponent(prefix) + '&mod=' +
encodeURIComponent(mod);
this.worker = new Worker(workerURL);
if (typeof $wbwindow.URL === 'undefined') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should just be the default, to avoid the multiple init paths? (I wonder which browsers support workers and not window.URL..) Just trying to remove variation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem that safari is the only browser that might give us trouble is safari (searchParams available in v10.1)

@ikreymer
Copy link
Member

This is all really awesome, just added few minor comments.

And one other, we also want this to work in proxy mode. Might be more tricky since we're not intercepting element addition in proxy mode? (hopefully can avoid mutationobservers and just check periodically?) Maybe that should be a separate task/PR, what do you think?

@N0taN3rd N0taN3rd force-pushed the srcset-mediaquery-advanced-preservation branch 2 times, most recently from 6ecf31e to 496fa7d Compare August 20, 2018 17:45
…preservation of srcset values to fix Rhizome-Conifer/conifer#64.

wombat.js:
- Finalized PreserveWorker that preserves srcset values and Media Query values
- Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered
- Hooked into places where wombat rewrites the values we are interested in
wombatPreservationWorker.js:
- Updated handling of srcset extraction now that we are sending wombat srcset rewrites
- Added check to see if we have seen a URL to be fetched
- Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari
wombat.spec.js
- Updated to include values necessary to work with PWorker changes.
@N0taN3rd N0taN3rd force-pushed the srcset-mediaquery-advanced-preservation branch from 496fa7d to d983a8a Compare August 20, 2018 17:48
@N0taN3rd
Copy link
Contributor Author

Probably best to add proxy support in another task because we would have to check periodically and thus be sending repeat values to the worker depending on the amount of time the browser stays on the same page.

A more involved deduplication strategy is in order to account for this fact

@ikreymer ikreymer merged commit b4d4be8 into webrecorder:develop Aug 20, 2018
@N0taN3rd N0taN3rd deleted the srcset-mediaquery-advanced-preservation branch August 20, 2018 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants