Stop overriding default port configurations #40

Merged
merged 1 commit into from Jun 3, 2013

Conversation

Projects
None yet
2 participants
@distracteddev
Contributor

distracteddev commented May 31, 2013

I was getting internal client errors because the Queue was being populated with port:80 queueItems but the protocol was listed as https. This caused a call to https.get({port:80}) which throws an OpenSSH error:

[Error: 140735148347776:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:../deps/openssl/openssl/ssl/s23_clnt.c:766:
@cgiffard

This comment has been minimized.

Show comment Hide comment
@cgiffard

cgiffard Jun 3, 2013

Collaborator

Sorry I've been slow off the mark with this - I've been bedridden with a terrible flu. This looks good to me! Sorry it was causing you trouble.

Collaborator

cgiffard commented Jun 3, 2013

Sorry I've been slow off the mark with this - I've been bedridden with a terrible flu. This looks good to me! Sorry it was causing you trouble.

cgiffard added a commit that referenced this pull request Jun 3, 2013

Merge pull request #40 from distracteddev/master
Stop overriding default port configurations

@cgiffard cgiffard merged commit 845faca into simplecrawler:master Jun 3, 2013

1 check passed

default The Travis CI build passed
Details
@cgiffard

This comment has been minimized.

Show comment Hide comment
@cgiffard

cgiffard Jun 3, 2013

Collaborator

I'll push to npm soonish, but I'm trying to resolve an error/semantic problem (#39) before I do so. :)

Collaborator

cgiffard commented Jun 3, 2013

I'll push to npm soonish, but I'm trying to resolve an error/semantic problem (#39) before I do so. :)

@distracteddev

This comment has been minimized.

Show comment Hide comment
@distracteddev

distracteddev Jun 3, 2013

Contributor

No worries and thanks for the great crawler :) Autodiscovery saved me a ton of time.

One issue I'm still having however is when I try to discover 50+ links the state of the app ends up in a situation where

if (crawler._openRequests >= crawler.maxConcurrency) 

always evaluates to true and the openRequests never timeout (even after setting the timeout option to something small like 5 or 10 seconds) and the app is just left spinning in cycles waiting on requests that never seem to end.

I've tried to fix it for hours to no avail. As such, I was thinking of porting the Queue to use Async's built in queue implementation to handle making and receiving requests. Any thoughts on this?

Contributor

distracteddev commented Jun 3, 2013

No worries and thanks for the great crawler :) Autodiscovery saved me a ton of time.

One issue I'm still having however is when I try to discover 50+ links the state of the app ends up in a situation where

if (crawler._openRequests >= crawler.maxConcurrency) 

always evaluates to true and the openRequests never timeout (even after setting the timeout option to something small like 5 or 10 seconds) and the app is just left spinning in cycles waiting on requests that never seem to end.

I've tried to fix it for hours to no avail. As such, I was thinking of porting the Queue to use Async's built in queue implementation to handle making and receiving requests. Any thoughts on this?

@cgiffard

This comment has been minimized.

Show comment Hide comment
@cgiffard

cgiffard Jun 3, 2013

Collaborator

(BTW I totally forgot to tell you, but 0.2.7 is on npm now.:))

In regards to that issue, perhaps I need to be tighter with the timeouts. I'm pretty sure the problem does not relate to the queue, and I'd like it to retain a relatively general interface so that a queue can be implemented, for example, in Redis, and shared between multiple machines running the same crawl between then.

I'll have a look into whether there's a simple way to fix this problem for you - that's most definitely 100% a bug. :)

Collaborator

cgiffard commented Jun 3, 2013

(BTW I totally forgot to tell you, but 0.2.7 is on npm now.:))

In regards to that issue, perhaps I need to be tighter with the timeouts. I'm pretty sure the problem does not relate to the queue, and I'd like it to retain a relatively general interface so that a queue can be implemented, for example, in Redis, and shared between multiple machines running the same crawl between then.

I'll have a look into whether there's a simple way to fix this problem for you - that's most definitely 100% a bug. :)

@distracteddev

This comment has been minimized.

Show comment Hide comment
@distracteddev

distracteddev Jun 3, 2013

Contributor

I looked into it more and I think the Queue is functioning properly, its just that some requests were taking 150+ seconds to respond. I also realized you aren't using the timeout anywhere so I added:

// Around Line 700 of crawler.js
clientRequest.setTimeout(crawler.timeout, function () {
  console.log('TIMEOUT REACHED', queueItem.url);
  clientRequest.abort();
})

and that fixed my problem right up once I set the timeout to 10 seconds.

Contributor

distracteddev commented Jun 3, 2013

I looked into it more and I think the Queue is functioning properly, its just that some requests were taking 150+ seconds to respond. I also realized you aren't using the timeout anywhere so I added:

// Around Line 700 of crawler.js
clientRequest.setTimeout(crawler.timeout, function () {
  console.log('TIMEOUT REACHED', queueItem.url);
  clientRequest.abort();
})

and that fixed my problem right up once I set the timeout to 10 seconds.

@cgiffard

This comment has been minimized.

Show comment Hide comment
@cgiffard

cgiffard Jun 4, 2013

Collaborator

Sure, the missing timer is the bug. :)

I've got a bit of stuff on right now but I'll see if I can get to this later today. Thanks!

Collaborator

cgiffard commented Jun 4, 2013

Sure, the missing timer is the bug. :)

I've got a bit of stuff on right now but I'll see if I can get to this later today. Thanks!

@cgiffard

This comment has been minimized.

Show comment Hide comment
@cgiffard

cgiffard Jun 4, 2013

Collaborator

I've made issue #41 for managing the timer bug described here. :)

Collaborator

cgiffard commented Jun 4, 2013

I've made issue #41 for managing the timer bug described here. :)

@SaltwaterC SaltwaterC referenced this pull request in SaltwaterC/http-request Jul 3, 2013

Closed

Strange SSL error when trying to get this URL #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment