Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More workes = More "Failed to Retrieve" #33

Closed
earwickerh opened this issue Jul 24, 2020 · 12 comments
Closed

More workes = More "Failed to Retrieve" #33

earwickerh opened this issue Jul 24, 2020 · 12 comments

Comments

@earwickerh
Copy link

When I use more workers, I get "Failed to Retrieve" on a lot of URLs that worked at a lower amount of workers. The more workers I added, the more "Failed to Retrieve". Any ideas as to why this may be happening?

@rverton
Copy link
Owner

rverton commented Jul 26, 2020

Hi @earwickerh,
there is a hardcoded timeout for retrieving content currently at 8 seconds (https://github.com/rverton/webanalyze/blob/master/webanalyze.go#L21). Why 8? To be honest, I dont know. It worked very well for me until now.

My guess is that when you increase the amount of workers, you are reaching the limit of your bandwidth and this results in some hosts taking more time to respond.

Are you using the source code release? If yes, you can try to increase the hardcoded value. If this does change it, I can make it as an option.

@earwickerh
Copy link
Author

Great insight, this seemed to do the trick. However, I am running more tests to ensure I'm not inserting new variables in the experiments. I'll update more conclusively.

@earwickerh
Copy link
Author

I don't think that's it after all. After further testing, I believe it is related to CSV output (probably indirectly). A resource limit of some sort is being hit, once reached, I get "Failed to Retrieve" exclusively for all remaining URLs, even though those same URLs resolve properly when running with fewer workers/smaller set of URLs/individual host.

The issue occurs particularly when using the combination of more workers AND csv output. When I run with stdout as the output, the issue seems to go away/be significantly reduced even with 100+ workers. Which leads me to believe CSV isn't the root cause, but rather a contributing factor in reaching a resource limit.
I hope this won't be too difficult to replicate on your end.

Small notes for which i can create pull request/other issues for tracking:

  • stdout provides a: output when the site is successfully loaded but matches aren't found. This would be a nice addition to the csv output as well,signaling the site was reached but no matches were found.

-"search" argument. It's great that it goes through subdomains but ssl vs no ssl isn't taken into account. My hosts file contains the http://, https:// and https://www. versions of the URLs I want to scan to get around this. Having the tool test ssl/non-ssl versions could prove useful.

@rverton
Copy link
Owner

rverton commented Aug 7, 2020

Hi,
yeah I'm not quite sure how I can debug this. I'll ask around and let you know when I found a good way to benchmark/debug this.

Regarding your other suggestions: I'm happy for every contribution :)

Greetings

@rverton
Copy link
Owner

rverton commented Aug 7, 2020

@earwickerh can you patch the code at this line:

return nil, links, fmt.Errorf("Failed to retrieve")

and print out the real error (fmt.Printf("error retrieving: %v\n", err)) and run it again?

@earwickerh
Copy link
Author

Great, thank you! I'll do this and report back shortly. Thanks again

@earwickerh
Copy link
Author

earwickerh commented Aug 9, 2020

I gave this a shot but I get the following compile error "webanalyze/webanalyze.go:208:33: multiple-value fmt.Printf() in single-value context" after changing fmt.Errorf("Failed to retrieve") to fmt.Printf("error retrieving: %v\n", err). Thank for your help

@rverton
Copy link
Owner

rverton commented Aug 9, 2020

I just commited the improved error reporting, you can pull the changes and then test again.

5c9aebf

@earwickerh
Copy link
Author

whoa, that was quick, thanks! It's running. I'll let you know what I find. Thanks again

@earwickerh
Copy link
Author

After implementing I only see one new error being display (and only one instance of this new error) but it doesn't seem related as it showed up a good few minutes before my issue began to reoccur. 'Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 100 Continue\r\n\r\n"; err=

Here's the command I'm using: webanalyze -hosts crm-url-cleaner.txt -worker 200 -crawl 12 -output csv > results-200-c12-csvout.csv 2> err-w200-c12-csvout.txt
After a while, the results seem to just stop being written to the csv, meanwhile the errors continue to be written to the error file.

The same command, without the "-output csv" works fine...

Should we be trying to report errors for csv specifically around line 144 here (my apologies for my ignorance)

Thanks for taking a look

@rverton
Copy link
Owner

rverton commented Aug 9, 2020

The error handler for retrieving is not dependent on the output method, as you can see it's done before the output is handled.

Maybe writing csv is failing because it's done from multiple goroutines. It's odd because we just write to os.Stdout. We can catch the error from csv.Write and see if there is anything wrong here.

Can you try this:

                         err := outWriter.Write(
				[]string{
					result.Host,
					strings.Join(m.CatNames, ","),
					m.AppName,
					m.Version,
				},
			)
                         if err != nil { log.Printf("error writing csv: %v\n", err) }

@rverton
Copy link
Owner

rverton commented Sep 8, 2020

Any update on this? Otherwise I will close this due to inactivity.

@rverton rverton closed this as completed Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants