Retrieve and parse header before requesting the full page #59

vezaynk · 2017-11-05T15:04:52Z

Doing so would fix a wide array of issues, namely the fact that the script currently downloads huge PDFs needlessly just to check their HTTP code. Absolutely pointless.

An issue to open later would be to add a flag for "max file size" to avoid downloading any files larger than whatever.

vezaynk · 2017-11-05T15:05:26Z

This will also address #26

ghost · 2017-11-05T15:09:07Z

Running just
$header=get_headers($url);
should be way more efficient.

vezaynk · 2017-11-05T15:11:28Z

It would be, that is true. I want to do it even more efficiently however (if possible, probably is). If I use get_headers, I would be requested the headers twice which will actually end up hurting performance on more light-weight sites. Splitting the request into two parts would be ideal.

ghost · 2017-11-05T15:15:39Z

Then you will have to use CURL, which should be way less efficient that the PHP function. This might be something you want to test.

vezaynk · 2017-11-05T15:16:27Z

It's using cURL anyways, no performance loss there.

ghost · 2017-11-05T15:20:39Z

Are you sure? https://itstillworks.com/getheaders-vs-curl-speed-php-12245951.html

vezaynk · 2017-11-05T15:30:04Z

I am sure that it is currently using cURL. You asserted that using cURL would make things slower (it may, it may not, I have no idea), I responded that I am using cURL anyways and using more of it's functions would not slow it down further, not that I am sure that cURL is more performant than a given alternative.

ghost · 2017-11-05T15:32:34Z

Oh, OK, a misunderstanding. I though you said that get_headers($url) is using internally cURL.

I think it should be way faster to use just get_headers() and run cURL only when needed.

vezaynk · 2017-11-05T15:44:45Z

As I said, it will be faster certainly be faster for some cases, but for well-built websites it will be slower because it's going to be sending more requests than necessary.

I define a well-built website as a website that has no broken links and no over-sized pages. I wouldn't want to punish websites who do it right in favor of websites who have plenty of broken links and such.

I am however having a hard time finding ressources saying how to achieve what I am looking for, hence I may very well end up going with the alternative.

ghost · 2017-11-05T15:48:49Z

Are you currently crawling CSS, JS, and all other files for the URLs? Using headers should eliminate going through unnecessary files when you know in advance the media types and I believe there are more useless than useful files in an average website.

vezaynk · 2017-11-05T15:51:22Z

Unless someone links a CSS/JS/whatever file with an a[href], it wouldn't even look at it.

ghost · 2017-11-05T15:54:08Z

CSS and JS should not be directly linked, but there there are often various downloads like PDFs etc. that are href-linked.

vezaynk · 2017-11-05T16:14:38Z

Yes, they are href linked and can be optionally indexed since #40. Images however are a better example of wasteful downloads.

smurfiez · 2017-11-11T09:36:07Z

Could always do a is_html() function that users get_headers then retrieve if necessary.

vezaynk · 2017-11-13T21:32:01Z

The method I was initially proposing would be done with something like this:

curl_setopt( $ch, CURLOPT_WRITEFUNCTION, "my_callback" );
  function my_callback() {
    return (isValid) ? 1 : -1;
  }

It is really not exactly I was looking for but works for all intents and purposes. The way it works can be read in detail here, but in short, it can abort a cURL operation as soon as some data has been received.

This doesn't sound amazingly efficient to me and as such I feel compelled to actually test how well it stacks up against just getting the headers and optionally the full request afterwards as @2globalnomads was proposing.

vezaynk · 2017-11-13T22:10:28Z

I lied. The better way to do this is with CURLOPT_HEADERFUNCTION

vezaynk · 2017-11-14T02:32:23Z

I created a branch to test that implementation, it's not the cleanest thing in the world, but it works. https://github.com/knyzorg/Sitemap-Generator-Crawler/tree/single-request

The performance improvements are predictably obvious with websites that have a lot of PDFs such as this one: http://rolf-herbold.de

The effect is ideal, performance for standard sites is unaffected while the performance of sites as the above is improved (how much is determined by internet speed).

smurfiez · 2017-11-14T16:29:00Z

On a sidenote, I'm actually working on multi-threaded(?) version of your crawler via curl_multi_exec.

I have some non database scraped sites that I crawl to build up the site and single threaded takes forever.

vezaynk · 2017-11-14T16:48:10Z

What is this curl_multi_exec version you speak of?

vezaynk · 2017-12-27T22:04:34Z

After more testing. What I did is definitely not merge-able as is. On some systems, handling a failed cURL is more expensive than downloading whatever.

vezaynk added quality of life improvement scope feature labels Nov 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve and parse header before requesting the full page #59

Retrieve and parse header before requesting the full page #59

vezaynk commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

smurfiez commented Nov 11, 2017

vezaynk commented Nov 13, 2017 •

edited

vezaynk commented Nov 13, 2017

vezaynk commented Nov 14, 2017

smurfiez commented Nov 14, 2017

vezaynk commented Nov 14, 2017

vezaynk commented Dec 27, 2017

Retrieve and parse header before requesting the full page #59

Retrieve and parse header before requesting the full page #59

Comments

vezaynk commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

ghost commented Nov 5, 2017

vezaynk commented Nov 5, 2017

smurfiez commented Nov 11, 2017

vezaynk commented Nov 13, 2017 • edited

vezaynk commented Nov 13, 2017

vezaynk commented Nov 14, 2017

smurfiez commented Nov 14, 2017

vezaynk commented Nov 14, 2017

vezaynk commented Dec 27, 2017

vezaynk commented Nov 13, 2017 •

edited