Skip to content

Commit

Permalink
feat: crawler documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
blakebyrnes committed Jan 5, 2023
1 parent a92da44 commit 0615cd8
Show file tree
Hide file tree
Showing 19 changed files with 258 additions and 221 deletions.
20 changes: 10 additions & 10 deletions databox/README.md
@@ -1,10 +1,10 @@
# Databox for Hero
# Databox

Databox for Hero is a simple wrapper for your Hero scraper script that converts it into a discrete, composable, and deployable unit.
Databox is a simple wrapper for your scraper script that converts it into a discrete, composable, and deployable unit.

- [x] **Production Proof Your Script** - Production proof your script a thousand different ways.
- [x] **Breaking Notifications** - Get notified when your scripts break.
- [x] **Runs Anywhere** - Containerize your scripts to run everywhere
- [x] **Runs Anywhere** - Containerize your scripts to run everywhere.
- [x] **Works with Chrome Alive!** - Progressively build your scripts with Chrome Alive!
- [x] **Easy Management** - Manage your databoxes like a boss.

Expand All @@ -27,22 +27,22 @@ Wrapping your script in a Databox gives it instant access to the input and outpu
script.ts

```js
const Databox = require('@ulixee/databox-plugins-hero');
const { Function, HeroFunctionPlugin } = require('@ulixee/databox-plugins-hero');

new Databox(async databox => {
const { input, Output, Hero } = databox;
new Function(async context => {
const { input, Output, Hero } = context;
const hero = new Hero();
await hero.goto('https://example.org');
Output.emit({ text: `I went to example.org. Your input was: ${input.params.name}` });
});
Output.emit({ text: `I went to example.org. Your input was: ${input.name}` });
}, HeroFunctionPlugin);
```

You can call your script in several ways.

1. Directly from the command line:

```shell script
% node script.js --params.name=Alfonso
% node script.js --input.name=Alfonso
```

2. Through Stream:
Expand All @@ -54,7 +54,7 @@ import Stream from '@ulixee/stream';

const stream = new Stream('');

const output = await stream.query({ params: { name: 'Alfonso' } });
const output = await stream.query({ input: { name: 'Alfonso' } });
```

Browse the [full API docs](https://docs.ulixee.org/databox).
Expand Down
3 changes: 0 additions & 3 deletions databox/client/interfaces/IDataboxMetadata.ts
Expand Up @@ -17,9 +17,6 @@ export default interface IDataboxMetadata {
crawlersByName: {
[name: string]: {
corePlugins: { [name: string]: string };
remoteFunction?: string;
remoteSource?: string;
remoteDataboxVersionHash?: string;
} & Omit<IFunctionComponents<any, any>, 'run'>;
};
tablesByName: {
Expand Down
4 changes: 0 additions & 4 deletions databox/client/lib/DataboxInternal.ts
Expand Up @@ -193,16 +193,12 @@ export default class DataboxInternal<
}

for (const [name, func] of Object.entries(this.crawlers)) {
const passThrough = func as unknown as PassthroughFunction<any, any>;
metadata.crawlersByName[name] = {
corePlugins: func.corePlugins ?? {},
schema: func.schema,
pricePerQuery: func.pricePerQuery,
addOnPricing: func.addOnPricing,
minimumPrice: func.minimumPrice,
remoteSource: passThrough?.remoteSource,
remoteFunction: passThrough?.remoteFunction,
remoteDataboxVersionHash: passThrough?.databoxVersionHash,
};
}

Expand Down
108 changes: 30 additions & 78 deletions databox/docs/advanced/hero-plugin.md
@@ -1,12 +1,13 @@
# HeroFunctionPlugin

> HeroFunctionPlugin supercharges your databox Function with full Hero capabilities. It also allow you to organize your script into two execution stages - the "live" `run` callback and a second "replayed" `afterRun` callback.
> HeroFunctionPlugin supercharges your databox Function with full Hero capabilities. It also allow you to organize your script into two execution stages - the "live" Crawler Function and a second Function operating on the cacheable Crawler output.
Databox Functions with HeroFunctionPlugin allow you break down your script into a "live" (run) phase and an "offline extraction" (afterRun) phase.
Databox Functions with HeroFunctionPlugin allow you break down a script into a "live" Crawler Function and a second "offline" Function operating on the cacheable Crawler output.

The 'run' step is passed a pre-initialized [Hero](https://ulixee.org/docs/hero) instance to interact with a website. You can collect all output in this phase, or you can choose to detach assets like [Resources](https://ulixee.org/docs/hero/docs/hero/advanced-client/detached-resources), [HTML Elements](https://ulixee.org/docs/hero/docs/hero/advanced-client/detached-elements) and [Data Snippets](https://ulixee.org/docs/hero/basic-client/hero-replay#getSnippet) that can be extracted later.
The HeroFunctionPlugin adds two options to a Function's `run` callback:

The 'afterRun' step is passed in a [HeroReplay](https://ulixee.org/docs/hero/docs/hero/basics-client/hero-replay) instance instead of a "live" Hero. You can use this function to pull out data from your [Detached assets](https://ulixee.org/docs/hero/docs/hero/basics-client/hero-replay) (ie, you don't have to run your logic browser-side). It also allows you to run your extraction logic as a unit, which enables you to re-run it on assets collected from your last `run` until your logic works correctly.
- A [Hero](https://ulixee.org/docs/hero) constructor to interact with a website. The constructor will automatically connect to the local Hero Core. You can collect all output in this phase, or you can choose to detach assets like [Resources](https://ulixee.org/docs/hero/docs/hero/advanced-client/detached-resources), [HTML Elements](https://ulixee.org/docs/hero/docs/hero/advanced-client/detached-elements) and [Data Snippets](https://ulixee.org/docs/hero/basic-client/hero-replay#getSnippet) that can be extracted later.
- A [HeroReplay](https://ulixee.org/docs/hero/docs/hero/basics-client/hero-replay) constructor that can be supplied with the sessionId of a previous Hero run. A constructed instance will automatically connect to the local Hero Core. You can use this class to pull out data from your [Detached assets](https://ulixee.org/docs/hero/docs/hero/basics-client/hero-replay) (ie, you don't have to run your logic browser-side). It also allows you to run your extraction logic as a unit, which enables you to re-run it on assets collected from your Crawler until your logic works correctly.

## Getting Started

Expand All @@ -17,8 +18,8 @@ You can run this script as a regular node script and it will run the callback. H
To use HeroFunctionPlugin, import the plugin and include it in the `plugins` vararg array of your Databox Function constructor.

```js
import { HeroFunctionPlugin, Function } from '@ulixee/databox-plugins-hero';
export default new Function(async context => {
import { HeroFunctionPlugin, Crawler } from '@ulixee/databox-plugins-hero';
export default new Crawler(async context => {
const { input, Output, Hero } = context;

const hero = new Hero();
Expand All @@ -28,7 +29,7 @@ export default new Function(async context => {
const output = new Output();
output.title = title;
output.body = await hero.document.body.textContent;
await hero.close();
return hero;
}, HeroFunctionPlugin);
```

Expand All @@ -37,45 +38,47 @@ export default new Function(async context => {
To use the [HeroReplay](https://ulixee.org/docs/hero/basics-client/hero-replay) extraction phase, you'll simply add an additional afterRun callback:

```js
import { Function, HeroFunctionPlugin } from '@ulixee/databox-plugins-hero';
import { Crawler, Function, HeroFunctionPlugin } from '@ulixee/databox-plugins-hero';

export default new Function(
{
async run(context) {
const databox = new Databox({
crawlers: {
ulixee: new Crawler(async context => {
const { Hero } = context;
const hero = new Hero();
await hero.goto('https://ulixee.org');
console.log(await hero.sessionId);
await document.querySelector('h1').$addToDetachedElements('h1');
},
async afterRun(context) {
const { input, Output, heroReplay } = context;
const h1 = await hero.detachedElements.get('h1');
}, HeroFunctionPlugin),
},
functions: {
ulixee: new Function(async context => {
const { input, Output, HeroReplay } = context;
const maxTimeInCache = input.maxTimeInCache || 5 * 60;
const crawledContent = await databox.crawl('ulixee', { maxTimeInCache });
const heroReplay = new HeroReplay(crawledContent);
const h1 = await heroReplay.detachedElements.get('h1');
const output = new Output();
output.title = h1.textContent;
},
}, HeroFunctionPlugin),
},
HeroFunctionPlugin,
);
});
export default databox;
```

If you have a prior Hero SessionId to replay, you can run ONLY the `afterRun` phase by running your function as follows:
If you have a prior Hero SessionId to replay, you can run ONLY the `Function` phase by running as follows:

```bash
node ./heroFunction.js --replaySessionId=session123
node ./heroFunction.js --maxTimeInCache=30
```

## Changes to FunctionContext

The HeroFunctionPlugin for Hero adds automatically initialized Hero instances to the `run` and `afterRun` phases of a Function.
The HeroFunctionPlugin for Hero adds "automatically connecting" Hero and Hero Replay constructors.

### run _(functionContext)_ {#run-hero}

- functionContext.hero `Hero`. Readonly access to a pre-initialized [Hero](https://ulixee.org/docs/hero/basic-client/hero) instance.

### runAfter _(functionContext)_ {#runafter-hero}

- functionContext.heroReplay `HeroReplay`. Readonly access to a pre-initialized [HeroReplay](https://ulixee.org/docs/hero/basic-client/hero-replay) instance.
- functionContext.Hero `Hero`. [Hero](https://ulixee.org/docs/hero/basic-client/hero) constructor that is automatically connected and cleaned up.
- functionContext.HeroReplay `HeroReplay`. [HeroReplay](https://ulixee.org/docs/hero/basic-client/hero-replay) constructor that's automatically connected and cleaned up.

## Constructor

Expand All @@ -85,55 +88,4 @@ The HeroFunctionPlugin modifies the Function constructor with the following chan

#### **Arguments**:

- run: `function`(functionContext): `Promise<any>`. Adds a hero instance to the run function as per [above](#run-hero).
- runAfter: `function`(functionContext): `Promise<any>`. An optional function where you can transform collected assets into your desired output structure. The only difference between this callback and `run` is that the FunctionContext supplies a [heroReplay](https://ulixee.org/docs/hero/basic-client/hero-replay) instance instead of `hero`.
- defaultHeroOptions [`IHeroCreateOptions`](https://ulixee.org/docs/hero/basic-client/hero#constructor). Configure Hero with any default options.

```js
import Databox, { Function } from '@ulixee/databox';
import { HeroFunctionPlugin } from '@ulixee/databox-plugins-hero';

export default new Databox({
functions: {
hero: new Function(
{
async run({ Hero }) {
const hero = new Hero();
const page = await hero.goto('https://ulixee.org');
await page.$addToDetachedResources('default');
},
async afterRun({ heroReplay }) {
const collected = await heroReplay.detachedResources.get('default');
},
defaultHeroOptions: {
showChrome: true,
},
},
HeroFunctionPlugin,
),
},
});
```

## Passing In Hero-Specific Configuration

You can configure the supplied [Hero](https://ulixee.org/docs/hero) instance through the defaultHeroOptions added to the Function constructor. This can be helpful to supply common configurations to your :

```js
import { Function, HeroFunctionPlugin } from '@ulixee/databox-plugins-hero';

export default new Function(
{
defaultHeroOptions: {
locale: 'en-GB,en',
},
async run(databox) {
const { hero, input } = databox;
await hero.goto(input.url);
// expect en-GB
const locale = await hero.getJsValue('navigator.language');
},
},
HeroFunctionPlugin,
);
```
- run: `function`(functionContext): `Promise<any>`. Adds a Hero and HeroReplay constructor to the run function as per [above](#run-hero).
44 changes: 11 additions & 33 deletions databox/docs/advanced/plugins.md
Expand Up @@ -38,25 +38,22 @@ The following method is called during Databox Function setup:

Called when a Databox Function instance starts execution. This function gives you access to the Function lifecycle.

A plugin can manipulate the lifecycle [FunctionContext](../basics/function-context.md) of each phase of a Function (`beforeRun`, `run` and `afterRun`). For instance, the [Hero plugin](./hero-plugin.md) initializes and adds a [Hero](https://ulixee.org/docs/hero/basic-client/hero) instance to the `run` context and a [HeroReplay](https://ulixee.org/docs/hero/basic-client/hero-replay) instance to the `afterRun` callback.

The lifecycle object passed in will indicate if a Function has defined a callback for each phase by marking the phase as `isEnabled`. Each plugin can choose to activate or deactivate a phase, so long as the Function has a callback to run.
A plugin can enhance the [FunctionContext](../basics/function-context.md) pass to a Function's `run` callback. For instance, the [Hero plugin](./hero-plugin.md) adds a [Hero](https://ulixee.org/docs/hero/basic-client/hero) and a [HeroReplay](https://ulixee.org/docs/hero/basic-client/hero-replay) constructor that automatically connect to the local Core.

A plugin _MUST_ call the `next()` callback provided. This callback will allow all other plugins to run to their `next()` callbacks. At that point, the Function will execute all phases. The output will then be returned to the waiting `next()` promise. At that point, each plugin will be allowed to complete the rest of its `run()` callback before the Databox Function will be closed. The flow is shown below:

```js
// 1. for each plugin, call run
for (const plugin of plugins) {
plugin.run(functionInternal, lifecycle, next);
plugin.run(functionInternal, context, next);
}

// 2. wait for every plugin "next" to be called
await waitForAllNextsCalled();

// 3. run Function phases
for (const phase of phases) {
if (phase.isEnabled) await func[phase](phase.context);
}
// 3. run Function `run`
func.run();

// 4. resolve nexts
resolveNexts(functionInternal.output);

Expand All @@ -71,19 +68,9 @@ class Plugin {
name = pkg.name;
version = pkg.version;

async run(functionInternal, lifecycle, next) {
async run(functionInternal, context, next) {
try {
// modify lifecycle enablement
lifecycle.run.isEnabled = this.isVariableSet();

// initialize context variables as needed
if (lifecycle.run.isEnabled) {
lifecycle.run.context.runVar = await this.getRunVar();
}

if (lifecycle.afterRun.isEnabled) {
lifecycle.afterRun.context.runVar = await this.getAfterRunVar();
}
context.Hero = createBoundHeroConstructor();
// wait for next to complete
const output = await next();
} finally {
Expand All @@ -100,25 +87,16 @@ class Plugin {
Arguments provided to the callback are as follows:
- `functionInternal`: An object providing the internal holder of the configuration of the Databox instance.
- `lifecycle`: An object to control the activation, and context variables of each Function phase (`run`, `beforeRun`, `afterRun`).
- beforeRun
- context `IBeforeContext`. The context that will be injected into the `beforeRun` callback.
- isEnabled `boolean`. Did the Function include a `beforeRun` callback, and is it still enabled.
- run
- context `IContext`. The context that will be injected into the `run` callback.
- isEnabled `boolean` Did the Function include a `run` callback, and is it still enabled.
- afterRun
- context `IAfterContext`. The context that will be injected into the `afterRun` callback.
- isEnabled `boolean` Did the Function include a `afterRun` callback, and is it still enabled.
- `context`: The Function Context object containing the state of the Function and Parameters.
- `next`: A callback that allows a plugin to wait for a Function to complete. It will resolve with the output of the Function.
#### Returns Promise<any>. The function may return any promise.
## Typescript Support
Your plugin can be configured so that a Typescript developer using your plugin will receive typing support for:
- Additional configuration allowed in a Function constructor.
- Variables added onto the `run`, `beforeRun` and `afterRun` phases.
- Additional configuration enabled in `Function.exec`.
- Variables added onto the `run` callback.
- Additional configuration enabled in `Function.stream`.
If you implement the [FunctionPluginStatics](https://github.com/ulixee/platform/tree/main/databox/client/interfaces/IFunctionPluginStatics.ts), this typing will be activated by simply adding your plugin to a new Function `new Function(..., YourPlugin)`. The typing for these functions is somewhat complex. It's recommended to copy an existing plugin (`https://github.com/ulixee/platform/tree/main/databox/plugins`).
51 changes: 9 additions & 42 deletions databox/docs/advanced/puppeteer-plugin.md
Expand Up @@ -11,12 +11,15 @@ import { Function, PuppeteerFunctionPlugin } from '@ulixee/databox-plugins-puppe
export default new Databox({
functions: {
pupp: new Function(async ctx => {
const { input, output, browser } = ctx;
const { input, Output, launchBrowser } = ctx;

const browser = await launchBrowser();
const page = await browser.newPage();
await page.goto(`https://en.wikipedia.org/wiki/${input.pageSlug || 'Web_scraping'}`);
output.title = await page.evaluate(() => {
return document.querySelector('#firstHeading').textContent;
Output.emit({
title: await page.evaluate(() => {
return document.querySelector('#firstHeading').textContent;
}),
});
}, PuppeteerFunctionPlugin),
},
Expand All @@ -29,45 +32,9 @@ The PuppeteerFunctionPlugin adds a single property to the [FunctionContext](../b

### run _(functionContext)_ {#run-hero}

- functionContext.browser `Puppeteer`. Readonly access to a pre-initialize [Puppeteer](https://pptr.dev/api) instance.

## Changes to Function Components

PuppeteerFunctionPlugin adds an optional parameter to the Function Components [object](../basics/function#constructor)) to configure Puppeteer options.

### new Function _(runCallback | functionComponents)_ {#constructor}

#### **Added Arguments**:
- defaultPuppeteerOptions [LaunchOptions](https://pptr.dev/api/puppeteer.launchoptions). Configure the [Puppeteer](https://pptr.dev/api) instance with [LaunchOptions](https://pptr.dev/api/puppeteer.launchoptions).

```js
import Databox from '@ulixee/databox';
import { Function, PuppeteerFunctionPlugin } from '@ulixee/databox-plugins-puppeteer';

export default new Databox({
functions: {
pupp: new Function(
{
async run(ctx) {
const { input, output, browser } = ctx;

const page = await browser.newPage();
await page.goto(`https://en.wikipedia.org/wiki/${input.pageSlug || 'Web_scraping'}`);
output.title = await page.evaluate(() => {
return document.querySelector('#firstHeading').textContent;
});
},
defaultPuppeteerOptions: {
timeout: 60e3,
},
},
PuppeteerFunctionPlugin,
),
},
});
```
- functionContext.launchBrowser: () => Promise<`Puppeteer`>. Function to launch a new [Puppeteer](https://pptr.dev/api) Browser instance.

### Function.exec(... puppeteerLaunchArgs)
### Function.stream(... puppeteerLaunchArgs)

Configure the [Puppeteer](https://pptr.dev/api) instance with [LaunchOptions](https://pptr.dev/api/puppeteer.launchoptions).

Expand All @@ -85,5 +52,5 @@ const databox = new Databox({
}, PuppeteerFunctionPlugin),
},
});
await databox.functions.pupp.exec({ waitForInitialPage: false });
await databox.functions.pupp.stream({ waitForInitialPage: false });
```

0 comments on commit 0615cd8

Please sign in to comment.