-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More workes = More "Failed to Retrieve" #33
Comments
Hi @earwickerh, My guess is that when you increase the amount of workers, you are reaching the limit of your bandwidth and this results in some hosts taking more time to respond. Are you using the source code release? If yes, you can try to increase the hardcoded value. If this does change it, I can make it as an option. |
Great insight, this seemed to do the trick. However, I am running more tests to ensure I'm not inserting new variables in the experiments. I'll update more conclusively. |
I don't think that's it after all. After further testing, I believe it is related to CSV output (probably indirectly). A resource limit of some sort is being hit, once reached, I get "Failed to Retrieve" exclusively for all remaining URLs, even though those same URLs resolve properly when running with fewer workers/smaller set of URLs/individual host. The issue occurs particularly when using the combination of more workers AND csv output. When I run with stdout as the output, the issue seems to go away/be significantly reduced even with 100+ workers. Which leads me to believe CSV isn't the root cause, but rather a contributing factor in reaching a resource limit. Small notes for which i can create pull request/other issues for tracking:
-"search" argument. It's great that it goes through subdomains but ssl vs no ssl isn't taken into account. My hosts file contains the http://, https:// and https://www. versions of the URLs I want to scan to get around this. Having the tool test ssl/non-ssl versions could prove useful. |
Hi, Regarding your other suggestions: I'm happy for every contribution :) Greetings |
@earwickerh can you patch the code at this line: Line 208 in fb291b5
and print out the real error ( fmt.Printf("error retrieving: %v\n", err) ) and run it again?
|
Great, thank you! I'll do this and report back shortly. Thanks again |
I gave this a shot but I get the following compile error "webanalyze/webanalyze.go:208:33: multiple-value fmt.Printf() in single-value context" after changing |
I just commited the improved error reporting, you can pull the changes and then test again. |
whoa, that was quick, thanks! It's running. I'll let you know what I find. Thanks again |
After implementing I only see one new error being display (and only one instance of this new error) but it doesn't seem related as it showed up a good few minutes before my issue began to reoccur. 'Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 100 Continue\r\n\r\n"; err= Here's the command I'm using: webanalyze -hosts crm-url-cleaner.txt -worker 200 -crawl 12 -output csv > results-200-c12-csvout.csv 2> err-w200-c12-csvout.txt The same command, without the "-output csv" works fine... Should we be trying to report errors for csv specifically around line 144 here (my apologies for my ignorance) Thanks for taking a look |
The error handler for retrieving is not dependent on the output method, as you can see it's done before the output is handled. Maybe writing csv is failing because it's done from multiple goroutines. It's odd because we just write to os.Stdout. We can catch the error from csv.Write and see if there is anything wrong here. Can you try this:
|
Any update on this? Otherwise I will close this due to inactivity. |
When I use more workers, I get "Failed to Retrieve" on a lot of URLs that worked at a lower amount of workers. The more workers I added, the more "Failed to Retrieve". Any ideas as to why this may be happening?
The text was updated successfully, but these errors were encountered: