Skip to content

Some requests (stream based) never ends and block the queue #371

Open
@Verhov

Description

@Verhov

Summary:

I parsed millions of domains and faced issue what some stream based domain request can permanently block the queue.
Timeout in this case does not fired and RAM is constantly leaking.
I found two same domains: https://goldfm.nu/, https://rsradio.online/.
It's really nice radio 😄 but totally block my crawler pattern))

Current behavior

I'am using timeout but looks it not work pretty correctly, callback never fired in this case:

_crawler = new Crawler({
      timeout: 9000,
      retries: 1,
      retryTimeout: 1000,
      debug: true,
      callback: (error, res, done) => {
          ...
          done()
      }
})

_crawler.queue([{ uri: 'https://goldfm.nu' }])

image

Issue

Definitely it because of this request starts media stream and node-crawler tried to get it all... request always in pending state.
image

Side issues

Also as stream is arriving it increase RAM and seem will thrown 'out of memory' exception.
image

Attempts to fix

I also tried to set accept header to html only, but it's doesn't have affect.
headers: { Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9' },

Currently I just skip this url as the special case, but I think it may not be unique case.

Expected behavior

Timeout should fire an error when we did not receive a response within the allotted time.

Related issues

This issue is definitely related with request package.

Question

Do you have any ideas how to resolve this case?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions