Merged
Conversation
calebjclark
reviewed
Jan 6, 2023
databox/docs/advanced/hero-plugin.md
Outdated
Contributor
There was a problem hiding this comment.
Do we still have a afterRun?
calebjclark
reviewed
Jan 6, 2023
databox/docs/basics/crawler.md
Outdated
Comment on lines
21
to
22
Contributor
There was a problem hiding this comment.
Is craw a array or object? The documentation below makes it seem like it's an array, but this code makes it seem like an object.
Contributor
Author
There was a problem hiding this comment.
Should be an array
Contributor
There was a problem hiding this comment.
Does HeroReplay accept an array?
Contributor
Author
There was a problem hiding this comment.
No, should be
const [crawl] = await crawler.stream({ input: { maxTimeInCache: 60 } });
const heroReplay = await HeroReplay(crawl);
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NOTE: depends on @ulixee/shared PR 11
This PR is a mini-overhaul of Output and Plugins, adds a new type of Function called a Crawler, and changes PassthroughFunctions to have an
onRequestandonResponsehandler.Output
Output is now an object you can create with a single function called emit(). When you emit an Output object, it is published as an event to the caller. The caller could be a Local Databox Function run, a Databox.query api call via the client, or a Passthrough Function. This new structure means we can handle data updates or "forwards" immediately as output is processed. It also means a partial response is possible where you get a few records back, but an error occurs midstream.
Plugin simplification
During the course of building this feature, it became clear that having
beforeRunandafterRunon a function that could stream results would make the plugins very complicated. For this reason, the Plugin structure was changed in two ways:beforeRunandafterRunfrom functions and plugins. You can only extend what will be passed into therunfunction.Crawlers
A new type of function that generates "assets" that can be consumed to extract data has been created called a
Crawler. A Crawler is intended to allow a user to use a live browser (like Hero) to generate assets that can be extracted in further scripts (eg, HeroReplay).toCrawlerOutput. This contains the type of Crawler, as well as a SessionId and version. Right now, the only implementer is HeroReplay, but it would be easy to create for other types of automated browsers.Caching
Crawlers automatically keep track of inputs mapped to the created sessionIds in a private table called "cache" (can be disabled with disableCache in constructor). If a schema is provided, columns are created for each input field. If a consuming Function provides a "maxTimeInCache", the cache will look for matching inputs and return any sessionId newer than the provided
maxTimeInCache. An absence of
maxTimeInCachemeans don't look in the cache.PassthroughFunctions
Passthrough Functions needed to change to support the new streaming output records. To do so, we needed a new structure. The new structure allows two phases, both optionally enhanced with Plugin constructors (eg, Hero).
onRequestis called before an upstream call is made. This function allows a PassthroughFunction author to manipulate input arguments before sending to a source DataboxonResponseis called once the remote function has been invoked. Astreamobject is provided to the callback. Thisstreamcan be invoked like an AsyncIterable (for await (const output of stream)), or it can be awaited directly to wait for all results. The author then can emit Output as desired. NoonResponsefunction will re-emit the output records automatically.