Description
Summary:
I parsed millions of domains and faced issue what some stream based domain request can permanently block the queue.
Timeout in this case does not fired and RAM is constantly leaking.
I found two same domains: https://goldfm.nu/, https://rsradio.online/.
It's really nice radio 😄 but totally block my crawler pattern))
Current behavior
I'am using timeout but looks it not work pretty correctly, callback never fired in this case:
_crawler = new Crawler({
timeout: 9000,
retries: 1,
retryTimeout: 1000,
debug: true,
callback: (error, res, done) => {
...
done()
}
})
_crawler.queue([{ uri: 'https://goldfm.nu' }])
Issue
Definitely it because of this request starts media stream and node-crawler tried to get it all... request always in pending state.
Side issues
Also as stream is arriving it increase RAM and seem will thrown 'out of memory' exception.
Attempts to fix
I also tried to set accept header to html only, but it's doesn't have affect.
headers: { Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9' },
Currently I just skip this url as the special case, but I think it may not be unique case.
Expected behavior
Timeout should fire an error when we did not receive a response within the allotted time.
Related issues
This issue is definitely related with request package.
- same behavior described here: Timeout ignored for json request and streaming body request/request#3341
Question
Do you have any ideas how to resolve this case?)