Abort crawling

I have looked at #293 and #289, but those issues are slightly different. We have a crawler library based on `node-crawler` that performs computationally intensive crawling tasks and writes to different output sources, depending on the current task.  
Due to the nature of our tasks, it is crucial for crawlers to be interruptible and pick up work at the same point later on, and write consistent output files at the same time.  
Think of a JSON output file, for example, to which an array of objects is written, one object per URL. If the crawler stops, a final closing bracket must be written to the output file to ensure the file is valid JSON.

We achieved this using the `preRequest` hook and a flag:
```ts
let aborted: boolean = false;

const crawler = new Crawler({
    async preRequest(options, done): Promise<void> {
        if (! aborted) {
            await cleanup() && done();
        }
    },
    preRequest(error, response, done): Promise<void> {
        if (error || aborted) {
            await cleanup() && done();
        }

        // ...
    },
});
```

This works, but it's not optimal: Requests may still be queued while we're in aborted state, depending on the implementation. Additionally, there's no way to abort in-flight requests.

To tackle this issue, I'd like to suggest implementing support for the [`AbortController` API](https://nodejs.org/api/globals.html#globals_class_abortcontroller), which can be used in browsers to abort ongoing fetch requests and has been implemented in recent Node.JS versions ([with a poly-fill available, too](https://www.npmjs.com/package/node-abort-controller)). Implementation wise, one could steal from [`node-fetch`](https://www.npmjs.com/package/node-fetch):
```js
// Wrap http.request into fetch
const send       = ( options.protocol === 'https:' ? https : http ).request;
const { signal } = request;
let response     = null;

const abort = () => {
    const error = new AbortError( 'The operation was aborted.' );
    reject( error );
    if ( request.body && request.body instanceof Stream.Readable ) {
        request.body.destroy( error );
    }

    if ( !response || !response.body ) {
        return;
    }

    response.body.emit( 'error', error );
};

if ( signal && signal.aborted ) {
    abort();
    return;
}
```
([See full source](https://github.com/node-fetch/node-fetch/blob/ffef5e3c2322e8493dd75120b1123b01b106ab23/src/index.js#L50-L96))

I'm happy to help with this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Abort crawling #380

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Abort crawling #380

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions