Streaming Output by blakebyrnes · Pull Request #157 · ulixee/platform

blakebyrnes · 2023-01-04T21:34:26Z

NOTE: depends on @ulixee/shared PR 11

This PR is a mini-overhaul of Output and Plugins, adds a new type of Function called a Crawler, and changes PassthroughFunctions to have an onRequest and onResponse handler.

Output

Output is now an object you can create with a single function called emit(). When you emit an Output object, it is published as an event to the caller. The caller could be a Local Databox Function run, a Databox.query api call via the client, or a Passthrough Function. This new structure means we can handle data updates or "forwards" immediately as output is processed. It also means a partial response is possible where you get a few records back, but an error occurs midstream.

Plugin simplification

During the course of building this feature, it became clear that having beforeRun and afterRun on a function that could stream results would make the plugins very complicated. For this reason, the Plugin structure was changed in two ways:

We removed beforeRun and afterRun from functions and plugins. You can only extend what will be passed into the run function.
Since you don't always want a HeroReplay in this new setup, the Hero and Puppeteer plugins now add constructors for Hero, HeroReplay and Puppeteer instead of instances. The function implementation can create whatever it needs.

Crawlers

A new type of function that generates "assets" that can be consumed to extract data has been created called a Crawler. A Crawler is intended to allow a user to use a live browser (like Hero) to generate assets that can be extracted in further scripts (eg, HeroReplay).

Crawlers cannot create Output. They must return an object implementing toCrawlerOutput. This contains the type of Crawler, as well as a SessionId and version. Right now, the only implementer is HeroReplay, but it would be easy to create for other types of automated browsers.
In this version, Crawlers are only private to a Databox - eg, they must be consumed by a Function.

Caching

Crawlers automatically keep track of inputs mapped to the created sessionIds in a private table called "cache" (can be disabled with disableCache in constructor). If a schema is provided, columns are created for each input field. If a consuming Function provides a "maxTimeInCache", the cache will look for matching inputs and return any sessionId newer than the provided
maxTimeInCache. An absence of maxTimeInCache means don't look in the cache.

PassthroughFunctions

Passthrough Functions needed to change to support the new streaming output records. To do so, we needed a new structure. The new structure allows two phases, both optionally enhanced with Plugin constructors (eg, Hero).

onRequest is called before an upstream call is made. This function allows a PassthroughFunction author to manipulate input arguments before sending to a source Databox
onResponse is called once the remote function has been invoked. A stream object is provided to the callback. This stream can be invoked like an AsyncIterable (for await (const output of stream)), or it can be awaited directly to wait for all results. The author then can emit Output as desired. No onResponse function will re-emit the output records automatically.

calebjclark · 2023-01-06T14:55:08Z

databox/docs/advanced/hero-plugin.md

Do we still have a afterRun?

No, will fix

calebjclark · 2023-01-06T15:01:15Z

databox/docs/basics/crawler.md

Is craw a array or object? The documentation below makes it seem like it's an array, but this code makes it seem like an object.

Should be an array

Does HeroReplay accept an array?

No, should be

const [crawl] = await crawler.stream({ input: { maxTimeInCache: 60 } }); const heroReplay = await HeroReplay(crawl);

blakebyrnes added 5 commits January 5, 2023 09:52

feat: convert outputs to an array of records

a6f6ab4

feat(databox): stream output records as available

a92da44

feat: crawler documentation

0615cd8

fix: merge issues

de1f3b4

chore: update sqlite

576f4e5

blakebyrnes force-pushed the output branch from b968d7d to a696797 Compare January 5, 2023 16:04

fix: docs pricing should include remote

ca7bd07

blakebyrnes force-pushed the output branch from a696797 to ca7bd07 Compare January 5, 2023 16:04

fix(databox): docs links

f9d8c91

blakebyrnes assigned calebjclark Jan 6, 2023

calebjclark reviewed Jan 6, 2023

View reviewed changes

fix: docs still referencing afterRun

875d295

calebjclark reviewed Jan 6, 2023

View reviewed changes

fix: incorrect syntax for crawler output in docs

fb672f9

calebjclark merged commit a47cbc2 into main Jan 6, 2023

calebjclark deleted the output branch January 6, 2023 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming Output#157

Streaming Output#157
calebjclark merged 9 commits intomainfrom
output

blakebyrnes commented Jan 4, 2023

Uh oh!

calebjclark Jan 6, 2023

Uh oh!

blakebyrnes Jan 6, 2023

Uh oh!

calebjclark Jan 6, 2023

Uh oh!

blakebyrnes Jan 6, 2023

Uh oh!

calebjclark Jan 6, 2023

Uh oh!

blakebyrnes Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Uh oh!

Conversation

blakebyrnes commented Jan 4, 2023

Output

Plugin simplification

Crawlers

Caching

PassthroughFunctions

Uh oh!

calebjclark Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

blakebyrnes Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

calebjclark Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

blakebyrnes Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

calebjclark Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

blakebyrnes Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments