GitHub - yusuftaufiq/cli-website-crawler: Non-blocking CLI based application to recursively crawl data from whole pages on websites in parallel and save the results to HTML output. Built with Node.js, TypeScript, NestJs, and Playwright.

Description

Non-blocking CLI based application to recursively crawl data from whole pages on websites in parallel and save the results to HTML output.

Open pages using Playwright.
On pages, find new link elements that have an HTML a tag on the page.
Filter only links that point to the same domain and are allowed in robots.txt.
Add links to the request queue.
Skips duplicate URLs.
Visit recently queued links.
Repeat the process.

Tech stack: Node.js, TypeScript, NestJs, Playwright
Data structures: Hash Map, Hash Set for high performance O(1) constant insertion and retrieval.
Architectures: modules, services, and commands separated by feature purpose.

Clone this repository

$ https://github.com/yusuftaufiq/cmlabs-backend-crawler-freelance-test.git

Change to the cloned directory and install all required dependencies (may take a while)
```
$ npm install
```
Build the application
```
$ npm run build
```
Start the CLI application, all the features can be seen in the following section
```
$ npm run start:prod -- crawl
```
If the command is launched successfully, all results will be available in ./storage/key_value_stores

Show all available commands

$ npm run start:prod -- --help
$ npm run start:prod -- crawl --help

Customize targets to be crawled. (default: https://cmlabs.co/ https://www.sequence.day/ https://yusuftaufiq.com)
```
$ npm run start:prod -- crawl https://books.toscrape.com/ https://quotes.toscrape.com/
```
Control the verbosity of log messages (choices: "off", "error", "soft_fail", "warning", "info", "debug", "perf", default: "info")
```
$ npm run start:prod -- crawl --log-level warning
```
Sets the maximum concurrency (parallelism) for the crawl (default: 15)
```
$ npm run start:prod -- crawl --max-concurrency 100
```
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. (default: 50)
```
$ npm run start:prod -- crawl --max-requests 1000
```
Timeout by which the function must complete, in seconds. (default: 30)
```
$ npm run start:prod -- crawl --timeout 10
```
Whether to run the browser in headful mode. (default: false)
```
$ npm run start:prod -- crawl --headful
```

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
src		src
storage/key_value_stores		storage/key_value_stores
.editorconfig		.editorconfig
.eslintrc		.eslintrc
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
nest-cli.json		nest-cli.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json