More workes = More "Failed to Retrieve" #33

earwickerh · 2020-07-24T18:21:18Z

When I use more workers, I get "Failed to Retrieve" on a lot of URLs that worked at a lower amount of workers. The more workers I added, the more "Failed to Retrieve". Any ideas as to why this may be happening?

rverton · 2020-07-26T08:23:49Z

Hi @earwickerh,
there is a hardcoded timeout for retrieving content currently at 8 seconds (https://github.com/rverton/webanalyze/blob/master/webanalyze.go#L21). Why 8? To be honest, I dont know. It worked very well for me until now.

My guess is that when you increase the amount of workers, you are reaching the limit of your bandwidth and this results in some hosts taking more time to respond.

Are you using the source code release? If yes, you can try to increase the hardcoded value. If this does change it, I can make it as an option.

earwickerh · 2020-07-30T03:05:44Z

Great insight, this seemed to do the trick. However, I am running more tests to ensure I'm not inserting new variables in the experiments. I'll update more conclusively.

earwickerh · 2020-08-07T07:06:05Z

I don't think that's it after all. After further testing, I believe it is related to CSV output (probably indirectly). A resource limit of some sort is being hit, once reached, I get "Failed to Retrieve" exclusively for all remaining URLs, even though those same URLs resolve properly when running with fewer workers/smaller set of URLs/individual host.

The issue occurs particularly when using the combination of more workers AND csv output. When I run with stdout as the output, the issue seems to go away/be significantly reduced even with 100+ workers. Which leads me to believe CSV isn't the root cause, but rather a contributing factor in reaching a resource limit.
I hope this won't be too difficult to replicate on your end.

Small notes for which i can create pull request/other issues for tracking:

stdout provides a: output when the site is successfully loaded but matches aren't found. This would be a nice addition to the csv output as well,signaling the site was reached but no matches were found.

-"search" argument. It's great that it goes through subdomains but ssl vs no ssl isn't taken into account. My hosts file contains the http://, https:// and https://www. versions of the URLs I want to scan to get around this. Having the tool test ssl/non-ssl versions could prove useful.

rverton · 2020-08-07T07:09:06Z

Hi,
yeah I'm not quite sure how I can debug this. I'll ask around and let you know when I found a good way to benchmark/debug this.

Regarding your other suggestions: I'm happy for every contribution :)

Greetings

rverton · 2020-08-07T07:12:24Z

@earwickerh can you patch the code at this line:

webanalyze/webanalyze.go

Line 208 in fb291b5

return nil, links, fmt.Errorf("Failed to retrieve")

and print out the real error (fmt.Printf("error retrieving: %v\n", err)) and run it again?

earwickerh · 2020-08-07T09:28:35Z

Great, thank you! I'll do this and report back shortly. Thanks again

earwickerh · 2020-08-09T05:07:58Z

I gave this a shot but I get the following compile error "webanalyze/webanalyze.go:208:33: multiple-value fmt.Printf() in single-value context" after changing fmt.Errorf("Failed to retrieve") to fmt.Printf("error retrieving: %v\n", err). Thank for your help

rverton · 2020-08-09T06:52:12Z

I just commited the improved error reporting, you can pull the changes and then test again.

5c9aebf

earwickerh · 2020-08-09T07:22:00Z

whoa, that was quick, thanks! It's running. I'll let you know what I find. Thanks again

earwickerh · 2020-08-09T10:14:33Z

After implementing I only see one new error being display (and only one instance of this new error) but it doesn't seem related as it showed up a good few minutes before my issue began to reoccur. 'Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 100 Continue\r\n\r\n"; err=

Here's the command I'm using: webanalyze -hosts crm-url-cleaner.txt -worker 200 -crawl 12 -output csv > results-200-c12-csvout.csv 2> err-w200-c12-csvout.txt
After a while, the results seem to just stop being written to the csv, meanwhile the errors continue to be written to the error file.

The same command, without the "-output csv" works fine...

Should we be trying to report errors for csv specifically around line 144 here (my apologies for my ignorance)

Thanks for taking a look

rverton · 2020-08-09T10:21:57Z

The error handler for retrieving is not dependent on the output method, as you can see it's done before the output is handled.

Maybe writing csv is failing because it's done from multiple goroutines. It's odd because we just write to os.Stdout. We can catch the error from csv.Write and see if there is anything wrong here.

Can you try this:

                         err := outWriter.Write(
				[]string{
					result.Host,
					strings.Join(m.CatNames, ","),
					m.AppName,
					m.Version,
				},
			)
                         if err != nil { log.Printf("error writing csv: %v\n", err) }

rverton · 2020-09-08T13:14:11Z

Any update on this? Otherwise I will close this due to inactivity.

rverton closed this as completed Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More workes = More "Failed to Retrieve" #33

More workes = More "Failed to Retrieve" #33

earwickerh commented Jul 24, 2020

rverton commented Jul 26, 2020

earwickerh commented Jul 30, 2020

earwickerh commented Aug 7, 2020

rverton commented Aug 7, 2020

rverton commented Aug 7, 2020

earwickerh commented Aug 7, 2020

earwickerh commented Aug 9, 2020 •

edited

Loading

rverton commented Aug 9, 2020

earwickerh commented Aug 9, 2020

earwickerh commented Aug 9, 2020

rverton commented Aug 9, 2020

rverton commented Sep 8, 2020

More workes = More "Failed to Retrieve" #33

More workes = More "Failed to Retrieve" #33

Comments

earwickerh commented Jul 24, 2020

rverton commented Jul 26, 2020

earwickerh commented Jul 30, 2020

earwickerh commented Aug 7, 2020

rverton commented Aug 7, 2020

rverton commented Aug 7, 2020

earwickerh commented Aug 7, 2020

earwickerh commented Aug 9, 2020 • edited Loading

rverton commented Aug 9, 2020

earwickerh commented Aug 9, 2020

earwickerh commented Aug 9, 2020

rverton commented Aug 9, 2020

rverton commented Sep 8, 2020

earwickerh commented Aug 9, 2020 •

edited

Loading