Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve and parse header before requesting the full page #59

Open
vezaynk opened this issue Nov 5, 2017 · 20 comments
Open

Retrieve and parse header before requesting the full page #59

vezaynk opened this issue Nov 5, 2017 · 20 comments

Comments

@vezaynk
Copy link
Owner

vezaynk commented Nov 5, 2017

Doing so would fix a wide array of issues, namely the fact that the script currently downloads huge PDFs needlessly just to check their HTTP code. Absolutely pointless.

An issue to open later would be to add a flag for "max file size" to avoid downloading any files larger than whatever.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

This will also address #26

@ghost
Copy link

ghost commented Nov 5, 2017

Running just
$header=get_headers($url);
should be way more efficient.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

It would be, that is true. I want to do it even more efficiently however (if possible, probably is). If I use get_headers, I would be requested the headers twice which will actually end up hurting performance on more light-weight sites. Splitting the request into two parts would be ideal.

@ghost
Copy link

ghost commented Nov 5, 2017

Then you will have to use CURL, which should be way less efficient that the PHP function. This might be something you want to test.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

It's using cURL anyways, no performance loss there.

@ghost
Copy link

ghost commented Nov 5, 2017

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

I am sure that it is currently using cURL. You asserted that using cURL would make things slower (it may, it may not, I have no idea), I responded that I am using cURL anyways and using more of it's functions would not slow it down further, not that I am sure that cURL is more performant than a given alternative.

@ghost
Copy link

ghost commented Nov 5, 2017

Oh, OK, a misunderstanding. I though you said that get_headers($url) is using internally cURL.

I think it should be way faster to use just get_headers() and run cURL only when needed.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

As I said, it will be faster certainly be faster for some cases, but for well-built websites it will be slower because it's going to be sending more requests than necessary.

I define a well-built website as a website that has no broken links and no over-sized pages. I wouldn't want to punish websites who do it right in favor of websites who have plenty of broken links and such.

I am however having a hard time finding ressources saying how to achieve what I am looking for, hence I may very well end up going with the alternative.

@ghost
Copy link

ghost commented Nov 5, 2017

Are you currently crawling CSS, JS, and all other files for the URLs? Using headers should eliminate going through unnecessary files when you know in advance the media types and I believe there are more useless than useful files in an average website.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

Unless someone links a CSS/JS/whatever file with an a[href], it wouldn't even look at it.

@ghost
Copy link

ghost commented Nov 5, 2017

CSS and JS should not be directly linked, but there there are often various downloads like PDFs etc. that are href-linked.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 5, 2017

Yes, they are href linked and can be optionally indexed since #40. Images however are a better example of wasteful downloads.

@smurfiez
Copy link

Could always do a is_html() function that users get_headers then retrieve if necessary.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 13, 2017

The method I was initially proposing would be done with something like this:

curl_setopt( $ch, CURLOPT_WRITEFUNCTION, "my_callback" );
  function my_callback() {
    return (isValid) ? 1 : -1;
  }

It is really not exactly I was looking for but works for all intents and purposes. The way it works can be read in detail here, but in short, it can abort a cURL operation as soon as some data has been received.

This doesn't sound amazingly efficient to me and as such I feel compelled to actually test how well it stacks up against just getting the headers and optionally the full request afterwards as @2globalnomads was proposing.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 13, 2017

I lied. The better way to do this is with CURLOPT_HEADERFUNCTION

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 14, 2017

I created a branch to test that implementation, it's not the cleanest thing in the world, but it works. https://github.com/knyzorg/Sitemap-Generator-Crawler/tree/single-request

The performance improvements are predictably obvious with websites that have a lot of PDFs such as this one: http://rolf-herbold.de

The effect is ideal, performance for standard sites is unaffected while the performance of sites as the above is improved (how much is determined by internet speed).

@smurfiez
Copy link

On a sidenote, I'm actually working on multi-threaded(?) version of your crawler via curl_multi_exec.

I have some non database scraped sites that I crawl to build up the site and single threaded takes forever.

@vezaynk
Copy link
Owner Author

vezaynk commented Nov 14, 2017

What is this curl_multi_exec version you speak of?

@vezaynk
Copy link
Owner Author

vezaynk commented Dec 27, 2017

After more testing. What I did is definitely not merge-able as is. On some systems, handling a failed cURL is more expensive than downloading whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants