Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Website-scraper on Windows 11 throws error while scraping an academic archive site. #498

Closed
rrgerber opened this issue Jun 27, 2022 · 3 comments

Comments

@rrgerber
Copy link

Configuration

version: [result of npm ls website-scraper --depth 0 command]

`-- website-scraper@5.2.0

options: [provide your full options object]

var datadir = './scraped';
var url = 'https://www.isca-speech.org/archive';
  var options = {
    urls: [url],
    directory: datadir,
    defaultFilename: "index.html",
    ignoreErrors: false,
    recursive: true,
    maxRecursiveDepth: 2,
  };

scrape(options, function (err, t)
  {
  });

Description

In trying to scrape the site https://www.isca-speech.org/archive, an error in the dependency "request" is invoked pretty early on, and the program dies.

Expected behavior: [What you expect to happen]

I expect at least the top-level html files saved to my filesystem.

Actual behavior: [What actually happens]

Here is what actually happens:

$ node --max-old-space-size=24000 main.js

/home/rich/node_modules/request/request.js:1147
response.body = strings.join('')
^

RangeError: Invalid string length
at Array.join (native)
at Request. (/home/rich/node_modules/request/request.js:1147:31)
at Request.emit (events.js:198:13)
at IncomingMessage. (/home/rich/node_modules/request/request.js:1083:12)
at Object.onceWrapper (events.js:286:20)
at IncomingMessage.emit (events.js:203:15)
at endReadableNT (_stream_readable.js:1129:12)
at process._tickCallback (internal/process/next_tick.js:63:19)

Additional Information

[Any additional information, configuration or data that might be necessary to reproduce the issue]

The behavior is at least repeatable on Windows 11.

@phawxby
Copy link
Contributor

phawxby commented Jul 1, 2022

@rrgerber can you provide a reproduction repo including config and importantly lockfile. request is not a dependency of this project, there is no reason it should be getting called.

paul@Pauls-MacBook-Pro node-website-scraper % npm ls request
website-scraper@5.2.0 /Users/paul/Documents/GitHub/node-website-scraper
└── (empty)

@s0ph1e
Copy link
Member

s0ph1e commented Jul 6, 2022

Hi @rrgerber

In addition to @phawxby's suggestion you can try to take a look on logs.
Please also note that callbacks are not supported since version 5.0.0, please use async/await or promises, see usage examples

@rrgerber
Copy link
Author

rrgerber commented Jul 6, 2022

OK, I will check my code. Thanks to both of you for your responses.

@rrgerber rrgerber closed this as completed Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants