Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce crawler load on servers #1581

Merged
merged 1 commit into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/lib/mock-server.js
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,12 @@ mockAgent
.reply(200, '')
.persist();

mockAgent
.get("https://www.w3.org")
.intercept({ method: "GET", path: "/StyleSheets/TR/2021/dark.css" })
.reply(200, '')
.persist();

Comment on lines +126 to +131
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add: that's unrelated but now needed (I suspect the style sheet was added recently) to avoid a test complaint that request on this resource is not being mocked.

mockAgent
.get("https://www.w3.org")
.intercept({ method: "GET", path: "/Tools/respec/respec-highlight" })
Expand Down
168 changes: 148 additions & 20 deletions src/lib/specs-crawler.js
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,88 @@ const {

const {version: reffyVersion} = require('../../package.json');

/**
* To be friendly with servers, requests get serialized by origin server,
* and the code sleeps a bit in between requests to a given origin server.
* To achieve, the code needs to take a lock on the origin it wants to send a
* request to.
*/
const originLocks = {};


/**
* Helper function to sleep for a specified number of milliseconds
*/
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms, 'slept'));
}


/**
* Helper function to interleave values of a list of arrays.
*
* For example:
* interleave([0, 2, 4, 6, 8], [1, 3, 5]) returns [0, 1, 2, 3, 4, 5, 6, 8]
* interleave([0, 3], [1, 4], [2, 5]) returns [0, 1, 2, 3, 4, 5]
*
* The function is used to sort the list of specs to crawl so as to distribute
* origins throughout the list.
*
* Note the function happily modifies (and empties in practice) the arrays
* it receives as arguments.
*/
function interleave(firstArray, ...furtherArrays) {
if (firstArray?.length > 0) {
// Return the concactenation of the first item in the first array,
// and of the result of interleaving remaining arrays, putting the
// first array last in the list.
const firstItem = firstArray.shift();
return [firstItem, ...interleave(...furtherArrays, firstArray)];
}
else {
// First array is empty, let's proceed with remaining arrays
// until there's nothing else to proceed.
if (furtherArrays.length > 0) {
return interleave(...furtherArrays);
}
else {
return [];
}
}
}


/**
* Helper function that returns the "origin" of a URL, defined in a loose way
* as the part of the true origin that identifies the server that's going to
* serve the resource.
*
* For example "github.io" for all specs under github.io, "whatwg.org" for
* all WHATWG specs, "csswg.org" for CSS specs at large (including Houdini
* and FXTF specs since they are served by the same server).
*/
function getOrigin(url) {
if (!url) {
return '';
}
const origin = (new URL(url)).origin;
if (origin.endsWith('.whatwg.org')) {
return 'whatwg.org';
}
else if (origin.endsWith('.github.io')) {
return 'github.io';
}
else if (origin.endsWith('.csswg.org') ||
origin.endsWith('.css-houdini.org') ||
origin.endsWith('.fxtf.org')) {
return 'csswg.org';
}
else {
return origin;
}
}


/**
* Return the spec if crawl succeeded or crawl result from given fallback list
* if crawl yielded an error (and fallback does exist).
Expand Down Expand Up @@ -95,24 +177,51 @@ async function crawlSpec(spec, crawlOptions) {
result = {};
}
else {
result = await processSpecification(
urlToCrawl,
(spec, modules) => {
const idToHeading = modules.find(m => m.needsIdToHeadingMap) ?
window.reffy.mapIdsToHeadings() : null;
const res = {
crawled: window.location.toString()
};
modules.forEach(mod => {
res[mod.property] = window.reffy[mod.name](spec, idToHeading);
});
return res;
},
[spec, crawlOptions.modules],
{ quiet: crawlOptions.quiet,
forceLocalFetch: crawlOptions.forceLocalFetch,
...cacheInfo}
);
// To be friendly with servers, requests are serialized per origin
// and only sent after a couple of seconds.
const origin = getOrigin(urlToCrawl.url);
let originLock = originLocks[origin];
if (!originLock) {
originLock = {
locked: false,
last: 0
};
originLocks[origin] = originLock;
}
// Wait for the "lock" on the origin. Once we can take it, sleep as
// needed to only send a request after enough time has elapsed.
while (originLock.locked) {
await sleep(100);
}
originLock.locked = true;
const now = Date.now();
if (now - originLock.last < 2000) {
await sleep(2000 - (now - originLock.last));
}
try {
result = await processSpecification(
urlToCrawl,
(spec, modules) => {
const idToHeading = modules.find(m => m.needsIdToHeadingMap) ?
window.reffy.mapIdsToHeadings() : null;
const res = {
crawled: window.location.toString()
};
modules.forEach(mod => {
res[mod.property] = window.reffy[mod.name](spec, idToHeading);
});
return res;
},
[spec, crawlOptions.modules],
{ quiet: crawlOptions.quiet,
forceLocalFetch: crawlOptions.forceLocalFetch,
...cacheInfo}
);
}
finally {
originLock.last = Date.now();
originLock.locked = false;
}
if (result.status === "notmodified" && fallback) {
crawlOptions.quiet ?? console.warn(`skipping ${spec.url}, no change`);
const copy = Object.assign({}, fallback);
Expand Down Expand Up @@ -343,14 +452,33 @@ async function crawlList(speclist, crawlOptions) {
return { spec, readyToCrawl, resolve, reject };
});

// While we want results to be returned following the initial order of the
// specs, to avoid sending too many requests at once to the same origin,
// we'll sort specs so that origins get interleaved.
// Note: there may be specs without URL (ISO specs)
const specsByOrigin = {};
for (const spec of list) {
const toCrawl = crawlOptions.publishedVersion ?
(spec.release ?? spec.nightly) :
spec.nightly;
const origin = getOrigin(toCrawl?.url);
if (!specsByOrigin[origin]) {
specsByOrigin[origin] = [];
}
specsByOrigin[origin].push(spec);
}
const spreadList = interleave(...Object.values(specsByOrigin));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much the interleaving buys us vs managing a proper per-"origin" queue of requests that would let us process requests in parallel without pauses as long as there is no lock on the relevant origins.

I think e.g. https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 accomplishes this for the CG monitor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization-wise, I don't know if either approach really trumps the other. From a readibility and maintenance perspective, a proper per-origin queue would be cleaner though! I was more trying to patch the existing code than to re-write it.

One constraint I think we need in Reffy is that parallelism needs to be kept to a reasonable level, say 4-5 specs in parallel maximum, because loading too many specs at once in Puppeteer may use a lot of RAM. If we just enqueue everything on a per-origin queue without throttling, we may end up in a situation where 10 specs from 10 different origins get processed at once.

Right now that throttling is done at the spec level, which is why I chose to shuffle the list instead, to reduce the likelihood that the 4 specs that get processed in parallel target the same origin server. We'd need to apply that throttling to the list of origins instead (process 4 origins in parallel among the list of origins). That can also very likely be done "properly" ;)


// In debug mode, specs are processed one by one. In normal mode,
// specs are processing in chunks
const chunkSize = Math.min((crawlOptions.debug ? 1 : 4), list.length);

let pos = 0;
function flagNextSpecAsReadyToCrawl() {
if (pos < listAndPromise.length) {
listAndPromise[pos].resolve();
if (pos < spreadList.length) {
const spec = spreadList[pos];
const specAndPromise = listAndPromise.find(sp => sp.spec === spec);
specAndPromise.resolve();
pos += 1;
}
}
Expand Down
Loading