Skip to content

Commit

Permalink
feat(datastore): shorten version hash length
Browse files Browse the repository at this point in the history
  • Loading branch information
blakebyrnes committed Jan 16, 2023
1 parent e56a24d commit 0e36ead
Show file tree
Hide file tree
Showing 19 changed files with 196 additions and 80 deletions.
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -7,7 +7,7 @@ This repository is the development home to several of the tools that make it eas
## Projects

- Hero `/hero`. The Automated Browser Engine built for scraping. (repository home - https://github.com/ulixee/hero).
- Datastore `/datastore`. Discrete, composable units for interconnected data extraction scripts.
- Datastore `/datastore`. Packaged "database" containing API access to crawler functions and data extraction functions.
- Miner `/miner`. Run Ulixee tooling on a remote machine.
- Stream `/stream`. Query, transform and compose Datastores running on any machine.
- ChromeAlive! `/apps/chromealive*`. Supercharge scraper script development using the Chrome browser.
Expand Down
90 changes: 47 additions & 43 deletions datastore/README.md
@@ -1,68 +1,72 @@
# Datastore
# Introduction

Datastore is a simple wrapper for your scraper script that converts it into a discrete, composable, and deployable unit.
> Datastores are deployable "databases" that have functions and tables, support native payment, and can be cloned and expanded as you see fit. Datastore Functions contain data retrieval functions, like Hero scraper scripts, with structured input and output. Deploying a Datastore provides you with a structured Data API that can be privately consumed, or sold "out of the box".
- [x] **Production Proof Your Script** - Production proof your script a thousand different ways.
- [x] **Breaking Notifications** - Get notified when your scripts break.
- [x] **Runs Anywhere** - Containerize your scripts to run everywhere.
- [x] **Works with Chrome Alive!** - Progressively build your scripts with Chrome Alive!
- [x] **Easy Management** - Manage your datastores like a boss.
## What is a Datastore?

## Installation
Datastores create databases of specialized data structures - for instance, a Travel database, or a Jobs database. They're a combination of static metadata tables, dynamically retrieved web data and cached aggregated data that make up a data category. They support payment out of the box, so a client can pay per query without any setup or contracts. Datastores also support PostgreSQL natively (including payment), so can be tested out and integrated across programming languages.

```shell script
npm install @ulixee/datastore-plugins-hero
```
## Datastore Functions

or
Datastore Functions create structure -- boundaries -- around a single "scrape", which make your scripts are far easier to test, re-try, scale and compose. It allows us to do things like:

```shell script
yarn add @ulixee/datastore-plugins-hero
```
- Restart a script during development as you change it.
- Rotate inputs to try out a variety of IPs, parameters, and more to make sure you can handle edge cases.
- Test the extraction of 100s of different potential results pages and ensure your Output follows the same structure.
- Spawn new Functions from the current one if you need to parallelize following links.

## Usage
## Datastore Crawlers

Wrapping your script in a Datastore gives it instant access to the input and output objects, along with a Hero instance:
Datastore Crawlers allow you to write specialized Functions that only output a "cached" scrape. It comes with built-in caching, so you can automatically re-use results that have been recently recorded.

script.ts
## Datastore Tables

```js
const { Function, HeroFunctionPlugin } = require('@ulixee/datastore-plugins-hero');

new Function(async context => {
const { input, Output, Hero } = context;
const hero = new Hero();
await hero.goto('https://example.org');
Output.emit({ text: `I went to example.org. Your input was: ${input.name}` });
}, HeroFunctionPlugin);
```
Datastore Tables allow you to manage and deploy database tables as part of your "api". This can be useful to enhance your functions with metadata or cached data.

## How Datastores Work

Each Datastore is a wrapper for defining a composable scraper script. You can run datastores directly from the command line or upload them to a [Miner](https://ulixee.org/docs/miner).

You can call your script in several ways.
## Installation

1. Directly from the command line:
To get started using Datastore in your project, use the following commands:

```shell script
% node script.js --input.name=Alfonso
```bash
npm i --save @ulixee/datastore
```

2. Through Stream:
or

**COMING SOON**
```bash
yarn add @ulixee/datastore
```

```js
import Stream from '@ulixee/stream';
It's your responsibility to ensure your Ulixee development environment is setup, such as installing and running [`@ulixee/miner`](https://ulixee.org/docs/miner).

const stream = new Stream('');
## Usage Example

const output = await stream.query({ input: { name: 'Alfonso' } });
The simplest Datastore is initialized with a single Function:

```js
export default new Datastore({
functions: {
default: new Function(ctx => {
ctx.output = `Hello ${ctx.input.firstName}`;
}),
},
});
```

Browse the [full API docs](https://docs.ulixee.org/datastore).
Save that script to your filesystem (i.e., simple.js), and run it as a regular node script:

## Contributing
```bash
node ./simple.js --input.firstName=Me
```

We'd love your help making `Datastore for Hero` a better tool. Please don't hesitate to send a pull request.
However, this Datastore structure also allows us to load it onto a Miner and run it on demand:

## License
```bash
npx @ulixee/datastore deploy ./simple.js
npx @ulixee/datastore run simple.js --input.firstName=Me

[MIT](LICENSE.md)
```
8 changes: 5 additions & 3 deletions datastore/client/cli/index.ts
Expand Up @@ -13,14 +13,14 @@ export default function datastoreCommands(): Command {
const identityPrivateKeyPathOption = cli
.createOption(
'-i, --identity-path <path>',
'A path to a Ulixee Identity. Necessary for signing if a Miner has restricted allowed Uploaders.',
'A path to an Admin Identity. Necessary for actions restricting access to Admins of a Datastore.',
)
.env('ULX_IDENTITY_PATH');

const identityPrivateKeyPassphraseOption = cli
.createOption(
'-p, --identity-passphrase <path>',
'A decryption passphrase to the Ulixee identity (only necessary if specified during key creation).',
'A decryption passphrase to the Ulixee Admin Identity (only necessary if specified during key creation).',
)
.env('ULX_IDENTITY_PASSPHRASE');

Expand Down Expand Up @@ -164,9 +164,11 @@ export default function datastoreCommands(): Command {

cli
.command('upload')
.description('Upload a Datastore package to a miner.')
.description('Upload a Datastore package to a Miner.')
.argument('<dbxPath>', 'The path to the .dbx package.')
.addOption(uploadHostOption)
.addOption(identityPrivateKeyPathOption)
.addOption(identityPrivateKeyPassphraseOption)
.option(
'-a, --allow-new-version-history',
'Allow uploaded Datastore to create a new version history for the script entrypoint.',
Expand Down
27 changes: 27 additions & 0 deletions datastore/client/lib/CreditsTable.ts
Expand Up @@ -10,6 +10,7 @@ const postgresFriendlyNanoid = customAlphabet(

export default class CreditsTable extends Table<typeof CreditsSchema> {
public static tableName = 'ulx_credits';

constructor() {
super({
name: CreditsTable.tableName,
Expand All @@ -24,6 +25,17 @@ export default class CreditsTable extends Table<typeof CreditsSchema> {
microgons: number,
secret?: string,
): Promise<{ id: string; secret: string; remainingCredits: number }> {
const upstreamBalance = await this.getUpstreamCreditLimit();
if (upstreamBalance !== undefined) {
const allocated = (await this.query<{ total: number }>(
'SELECT SUM(issuedCredits) as total FROM self',
)) as any;
if (allocated.total + microgons > upstreamBalance) {
throw new Error(
`This credit amount would exceed the balance of the embedded Credits, which would make use of the Credits unstable. Please increase the limit of the credits.`,
);
}
}
const salt = nanoid(16);
const id = `cred${postgresFriendlyNanoid(8)}`;
secret ??= postgresFriendlyNanoid(12);
Expand Down Expand Up @@ -74,6 +86,21 @@ export default class CreditsTable extends Table<typeof CreditsSchema> {
if (result === undefined) throw new Error('Could not finalize payment for the given Credits.');
return result.balance;
}

private async getUpstreamCreditLimit(): Promise<number | undefined> {
const embeddedCredits = this.datastoreInternal.metadata.remoteDatastoreEmbeddedCredits ?? {};
if (!Object.keys(embeddedCredits).length) return undefined;
let issuedCredits = 0;
for (const [source, credit] of Object.entries(embeddedCredits)) {
const url = this.datastoreInternal.metadata.remoteDatastores[source];
if (!source) continue;
const client = this.datastoreInternal.createApiClient(url);
const datastoreHash = url.split('/').pop();
const balance = await client.getCreditsBalance(datastoreHash, credit.id);
if (Number.isInteger(balance.issuedCredits)) issuedCredits += balance.issuedCredits;
}
return issuedCredits;
}
}

export const CreditsSchema = {
Expand Down
17 changes: 10 additions & 7 deletions datastore/client/lib/Datastore.ts
@@ -1,6 +1,6 @@
import IDatastoreManifest from '@ulixee/specification/types/IDatastoreManifest';
import { SqlParser } from '@ulixee/sql-engine';
import { ExtractSchemaType } from '@ulixee/schema';
import { ExtractSchemaType, ISchemaAny } from '@ulixee/schema';
import ConnectionToDatastoreCore from '../connections/ConnectionToDatastoreCore';
import IDatastoreComponents, {
TCrawlers,
Expand All @@ -11,6 +11,7 @@ import DatastoreInternal from './DatastoreInternal';
import IDatastoreMetadata from '../interfaces/IDatastoreMetadata';
import ResultIterable from './ResultIterable';
import ICrawlerOutputSchema from '../interfaces/ICrawlerOutputSchema';
import DatastoreApiClient from './DatastoreApiClient';

export default class Datastore<
TTable extends TTables = TTables,
Expand Down Expand Up @@ -91,24 +92,24 @@ export default class Datastore<
const datastoreVersionHash = this.#datastoreInternal.manifest?.versionHash;

const sqlParser = new SqlParser(sql);
const inputSchemas = Object.keys(this.#datastoreInternal.functions).reduce((obj, k) => {
return Object.assign(obj, { [k]: this.#datastoreInternal.functions[k].schema.input });
}, {});
const inputSchemas: { [functionName: string]: ISchemaAny } = {};
for (const [key, func] of Object.entries(this.functions)) {
if (func.schema) inputSchemas[key] = func.schema.input;
}
const inputByFunctionName = sqlParser.extractFunctionInputs(inputSchemas, boundValues);
const outputByFunctionName: { [name: string]: any[] } = {};

for (const functionName of Object.keys(inputByFunctionName)) {
const input = inputByFunctionName[functionName];
const output = await this.#datastoreInternal.functions[functionName].stream({ input });
outputByFunctionName[functionName] = Array.isArray(output) ? output : [output];
outputByFunctionName[functionName] = await this.functions[functionName].stream({ input });
}

const recordsByVirtualTableName: { [name: string]: Record<string, any>[] } = {};
for (const tableName of sqlParser.tableNames) {
if (!this.#datastoreInternal.metadata.tablesByName[tableName].remoteSource) continue;

const sqlInputs = sqlParser.extractTableQuery(tableName, boundValues);
recordsByVirtualTableName[tableName] = await this.#datastoreInternal.tables[tableName].query(
recordsByVirtualTableName[tableName] = await this.tables[tableName].query(
sqlInputs.sql,
sqlInputs.args,
);
Expand All @@ -132,8 +133,10 @@ export default class Datastore<
public addConnectionToDatastoreCore(
connectionToCore: ConnectionToDatastoreCore,
manifest?: IDatastoreManifest,
apiClientLoader?: (url: string) => DatastoreApiClient,
): void {
this.#datastoreInternal.manifest = manifest;
this.#datastoreInternal.connectionToCore = connectionToCore;
if (apiClientLoader) this.#datastoreInternal.createApiClient = apiClientLoader;
}
}
10 changes: 4 additions & 6 deletions datastore/client/lib/DatastoreApiClient.ts
Expand Up @@ -30,10 +30,9 @@ export default class DatastoreApiClient {
protected activeIterableByStreamId = new Map<string, ResultIterable<any, any>>();

constructor(host: string) {
if (host.startsWith('ulx://')) {
host = `ws://${host.slice('ulx://'.length)}`;
}
const transport = new WsTransportToCore(`${host}/datastore`);
if (!host.includes('://')) host = `ulx://${host}`;
const url = new URL(host);
const transport = new WsTransportToCore(`ws://${url.host}/datastore`);
this.connectionToCore = new ConnectionToCore(transport);
this.connectionToCore.on('event', this.onEvent.bind(this));
}
Expand Down Expand Up @@ -214,7 +213,7 @@ export default class DatastoreApiClient {
public async getCreditsBalance(
datastoreVersionHash: string,
creditId: string,
): Promise<{ balance: number }> {
): Promise<IDatastoreApiTypes['Datastore.creditsBalance']['result']> {
return await this.runRemote('Datastore.creditsBalance', {
datastoreVersionHash,
creditId,
Expand Down Expand Up @@ -285,7 +284,6 @@ export default class DatastoreApiClient {
args = await DatastoreApiSchemas[command].args.parseAsync(args);
}
} catch (error) {
console.error(error);
throw ValidationError.fromZodValidation(
`The API parameters for ${command} have some issues`,
error,
Expand Down
5 changes: 5 additions & 0 deletions datastore/client/lib/DatastoreInternal.ts
Expand Up @@ -17,6 +17,7 @@ import IDatastoreMetadata from '../interfaces/IDatastoreMetadata';
import type PassthroughFunction from './PassthroughFunction';
import PassthroughTable from './PassthroughTable';
import CreditsTable from './CreditsTable';
import DatastoreApiClient from './DatastoreApiClient';

const pkg = require('../package.json');

Expand Down Expand Up @@ -113,6 +114,10 @@ export default class DatastoreInternal<
}));
}

public createApiClient(host: string): DatastoreApiClient {
return new DatastoreApiClient(host);
}

public close(): Promise<void> {
return (this.#isClosingPromise ??= new Promise(async (resolve, reject) => {
try {
Expand Down
5 changes: 2 additions & 3 deletions datastore/client/lib/PassthroughFunction.ts
Expand Up @@ -142,9 +142,8 @@ export default class PassthroughFunction<
assert(remoteDatastore, `A remote datastore source could not be found for ${remoteSource}`);

try {
const url = new URL(remoteDatastore);
this.datastoreVersionHash = url.pathname.slice(1);
this.upstreamClient = new DatastoreApiClient(url.host);
this.datastoreVersionHash = remoteDatastore.split('/').pop()
this.upstreamClient = this.datastoreInternal.createApiClient(remoteDatastore)
} catch (error) {
throw new Error(
'A valid url was not supplied for this remote datastore. Format should be ulx://<host>/<datastoreVersionHash>',
Expand Down
11 changes: 5 additions & 6 deletions datastore/client/lib/PassthroughTable.ts
Expand Up @@ -43,14 +43,14 @@ export default class PassthroughTable<
this.remoteSource = source;
}

public override async query<T = TRecords>(
public override async query<T = TRecords[]>(
sql: string,
boundValues: any[] = [],
options: Omit<
IDatastoreApiTypes['Datastore.query']['args'],
'sql' | 'boundValues' | 'versionHash'
> = {},
): Promise<T[]> {
): Promise<T> {
this.createApiClient();
if (this.name !== this.remoteTable) {
const sqlParser = new SqlParser(sql, {}, { [this.name]: this.remoteTable });
Expand All @@ -60,7 +60,7 @@ export default class PassthroughTable<
boundValues,
...options,
});
return result.outputs;
return result.outputs as any;
}

protected createApiClient(): void {
Expand All @@ -72,9 +72,8 @@ export default class PassthroughTable<
assert(remoteDatastore, `A remote datastore source could not be found for ${remoteSource}`);

try {
const url = new URL(remoteDatastore);
this.datastoreVersionHash = url.pathname.slice(1);
this.upstreamClient = new DatastoreApiClient(url.host);
this.datastoreVersionHash = remoteDatastore.split('/').pop();
this.upstreamClient = this.datastoreInternal.createApiClient(remoteDatastore);
} catch (error) {
throw new Error(
'A valid url was not supplied for this remote datastore. Format should be ulx://<host>/<datastoreVersionHash>',
Expand Down
2 changes: 1 addition & 1 deletion datastore/client/lib/Table.ts
Expand Up @@ -49,7 +49,7 @@ export default class Table<
return this.#datastoreInternal;
}

public async query<T = TRecords>(sql: string, boundValues: any[] = []): Promise<T[]> {
public async query<T = TRecords[]>(sql: string, boundValues: any[] = []): Promise<T> {
await this.datastoreInternal.ensureDatabaseExists();
const name = this.components.name;
const datastoreInstanceId = this.datastoreInternal.instanceId;
Expand Down
4 changes: 2 additions & 2 deletions datastore/client/test/types.test.ts
Expand Up @@ -48,8 +48,8 @@ it('can install multiple schemas', async () => {
nothing: boolean;
};
}`;
const id1 = encodeBuffer(sha3('schema1'), 'dbx');
const id2 = encodeBuffer(sha3('schema2'), 'dbx');
const id1 = encodeBuffer(sha3('schema1'), 'dbx').substring(0,22);
const id2 = encodeBuffer(sha3('schema2'), 'dbx').substring(0,22);;
installDatastoreSchema(schema1, id1);
installDatastoreSchema(schema2, id2);

Expand Down
2 changes: 1 addition & 1 deletion datastore/core/endpoints/Datastore.creditsBalance.ts
Expand Up @@ -13,6 +13,6 @@ export default new DatastoreApiHandler('Datastore.creditsBalance', {
);
const datastore = await DatastoreVm.open(datastoreVersion.path, datastoreVersion);
const credits = await datastore.tables[CreditsTable.tableName].get(request.creditId);
return { balance: credits?.remainingCredits ?? 0 };
return { balance: credits?.remainingCredits ?? 0, issuedCredits: credits?.issuedCredits ?? 0 };
},
});

0 comments on commit 0e36ead

Please sign in to comment.