Skip to content
This repository has been archived by the owner on Feb 13, 2021. It is now read-only.

Crawler stopping too soon #33

Closed
andresriancho opened this issue Feb 2, 2017 · 1 comment
Closed

Crawler stopping too soon #33

andresriancho opened this issue Feb 2, 2017 · 1 comment

Comments

@andresriancho
Copy link

andresriancho commented Feb 2, 2017

I'm trying to run the crawler to extract links from a simple page using the following command:

phantomjs --ssl-protocol=any --ignore-ssl-errors=true --proxy=127.0.0.1:8080 --proxy-type=http render.js http://192.168.0.40:8899/pages/11.php

11.php is part of wivet, a web crawler test application. The response generated when browsing to 11.php is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
	<head>
		<meta http-equiv="content-type" content="text/html; charset=windows-1250">
		<link type="text/css" rel="stylesheet" href="/style.css" />
  <script type="text/javascript" src="../js/jquery/jquery.js"></script>
  <script type="text/javascript" >
    $(document).ready(function(){
      $("#link").each(function(){this.href = "../innerpages/11_1f2e4.php";});
    });
  </script>
	</head>
	<body  class="body">
    <center>
      <a id="link" href="" target="body">click me</a>
      <a href="javascript:window.open('../innerpages'+'/11_2d3ff.php', 'windowopen', 'resizable=yes,width=500,height=400');">click me 2</a>
    </center>
	</body>
</html>

I see this in the browser I'm running in 127.0.0.1:8080 (see proxy param in the phantomjs call). I also see the jquery.js page being requested.

The output seen in stdout when running the command is:

{"response":{"headers":{"Date":["Thu, 02 Feb 2017 21:14:30 GMT"],"Server":["Apache/2.4.10 (Debian) PHP/5.6.11"],"X-Powered-By":["PHP/5.6.11"],"Set-Cookie":["PHPSESSID=2a81288b4a514a017c4d79bd89de6c51; path=/"],"Expires":["Thu, 19 Nov 1981 08:52:00 GMT"],"Cache-Control":["no-store, no-cache, must-revalidate, post-check=0, pre-check=0"],"Pragma":["no-cache"],"Vary":["Accept-Encoding"],"Content-Length":["726"],"Connection":["close"],"Content-Type":["text/html; charset=UTF-8"]},"contentType":"text/html; charset=UTF-8","status":200,"url":"http://192.168.0.40:8899/pages/11.php","body":"<html><head>\n\t\t<meta http-equiv=\"content-type\" content=\"text/html; charset=windows-1250\">\n\t\t<link type=\"text/css\" rel=\"stylesheet\" href=\"/style.css\">\n  <script type=\"text/javascript\" src=\"../js/jquery/jquery.js\"></script>\n  <script type=\"text/javascript\">\n    $(document).ready(function(){\n      $(\"#link\").each(function(){this.href = \"../innerpages/11_1f2e4.php\";});\n    });\n  </script>\n\t</head>\n\t<body class=\"body\">\n    <center>\n      <a id=\"link\" href=\"../innerpages/11_1f2e4.php\" target=\"body\">click me</a>\n      <a href=\"javascript:window.open('../innerpages'+'/11_2d3ff.php', 'windowopen', 'resizable=yes,width=500,height=400');\">click me 2</a>\n    </center>\n\t\n\n</body></html>","details":{"links":[{"text":"click me","url":"http://192.168.0.40:8899/innerpages/11_1f2e4.php"}],"forms":[],"jsLinkFeedback":true}},"elasped":531,"ok":1,"msgType":"domSteady","signature":"==lXlKfYWch7H9VdJgPCmJ=="}


{"action":"element.triggered","events":["click"],"keyChain":["root","body/center[1]/a[2]"],"childFrames":[{"headers":[{"name":"Accept","value":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},{"name":"Referer","value":"http://192.168.0.40:8899/pages/11.php"},{"name":"User-Agent","value":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1"}],"id":1,"method":"GET","time":"2017-02-02T21:14:30.560Z","url":"http://192.168.0.40:8899/innerpages/11_2d3ff.php","fromMainFrame":true,"navigationType":"Other"}],"msgType":"domChanged","signature":"==lXlKfYWch7H9VdJgPCmJ=="}

The first one is the response for the initial GET request. The second one seems to be a click on one of the links:

      <a id="link" href="" target="body">click me</a>
      <a href="javascript:window.open('../innerpages'+'/11_2d3ff.php', 'windowopen', 'resizable=yes,width=500,height=400');">click me 2</a>

My questions are:

  • Did you guys run gryffin against WIVET? Any results you can share?
  • Why is only one of the links clicked?
  • Why am I not seeing the HTTP request to 11_2d3ff.php in my proxy?
  • In the JSON printed in stdout I see:
"details":{"links":[{"text":"click me","url":"http://192.168.0.40:8899/innerpages/11_1f2e4.php"}]

If the crawler did not click on the second link, how was that URL extracted?

Is there something I'm doing wrong? I'm using phantomjs 2.1.1

@r-andrew-dev
Copy link
Contributor

The project is being archived.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants