Skip to content

Streaming Output#157

Merged
calebjclark merged 9 commits intomainfrom
output
Jan 6, 2023
Merged

Streaming Output#157
calebjclark merged 9 commits intomainfrom
output

Conversation

@blakebyrnes
Copy link
Contributor

NOTE: depends on @ulixee/shared PR 11

This PR is a mini-overhaul of Output and Plugins, adds a new type of Function called a Crawler, and changes PassthroughFunctions to have an onRequest and onResponse handler.

Output

Output is now an object you can create with a single function called emit(). When you emit an Output object, it is published as an event to the caller. The caller could be a Local Databox Function run, a Databox.query api call via the client, or a Passthrough Function. This new structure means we can handle data updates or "forwards" immediately as output is processed. It also means a partial response is possible where you get a few records back, but an error occurs midstream.

Plugin simplification

During the course of building this feature, it became clear that having beforeRun and afterRun on a function that could stream results would make the plugins very complicated. For this reason, the Plugin structure was changed in two ways:

  1. We removed beforeRun and afterRun from functions and plugins. You can only extend what will be passed into the run function.
  2. Since you don't always want a HeroReplay in this new setup, the Hero and Puppeteer plugins now add constructors for Hero, HeroReplay and Puppeteer instead of instances. The function implementation can create whatever it needs.

Crawlers

A new type of function that generates "assets" that can be consumed to extract data has been created called a Crawler. A Crawler is intended to allow a user to use a live browser (like Hero) to generate assets that can be extracted in further scripts (eg, HeroReplay).

  • Crawlers cannot create Output. They must return an object implementing toCrawlerOutput. This contains the type of Crawler, as well as a SessionId and version. Right now, the only implementer is HeroReplay, but it would be easy to create for other types of automated browsers.
  • In this version, Crawlers are only private to a Databox - eg, they must be consumed by a Function.

Caching

Crawlers automatically keep track of inputs mapped to the created sessionIds in a private table called "cache" (can be disabled with disableCache in constructor). If a schema is provided, columns are created for each input field. If a consuming Function provides a "maxTimeInCache", the cache will look for matching inputs and return any sessionId newer than the provided
maxTimeInCache. An absence of maxTimeInCache means don't look in the cache.

PassthroughFunctions

Passthrough Functions needed to change to support the new streaming output records. To do so, we needed a new structure. The new structure allows two phases, both optionally enhanced with Plugin constructors (eg, Hero).

  • onRequest is called before an upstream call is made. This function allows a PassthroughFunction author to manipulate input arguments before sending to a source Databox
  • onResponse is called once the remote function has been invoked. A stream object is provided to the callback. This stream can be invoked like an AsyncIterable (for await (const output of stream)), or it can be awaited directly to wait for all results. The author then can emit Output as desired. No onResponse function will re-emit the output records automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have a afterRun?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will fix

Comment on lines 21 to 22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is craw a array or object? The documentation below makes it seem like it's an array, but this code makes it seem like an object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HeroReplay accept an array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, should be

const [crawl] = await crawler.stream({ input: { maxTimeInCache: 60 } });
const heroReplay = await HeroReplay(crawl);

@calebjclark calebjclark merged commit a47cbc2 into main Jan 6, 2023
@calebjclark calebjclark deleted the output branch January 6, 2023 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments