Crawler stopping too soon #33

andresriancho · 2017-02-02T21:27:03Z

I'm trying to run the crawler to extract links from a simple page using the following command:

phantomjs --ssl-protocol=any --ignore-ssl-errors=true --proxy=127.0.0.1:8080 --proxy-type=http render.js http://192.168.0.40:8899/pages/11.php

11.php is part of wivet, a web crawler test application. The response generated when browsing to 11.php is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
	<head>
		<meta http-equiv="content-type" content="text/html; charset=windows-1250">
		<link type="text/css" rel="stylesheet" href="/style.css" />
  <script type="text/javascript" src="../js/jquery/jquery.js"></script>
  <script type="text/javascript" >
    $(document).ready(function(){
      $("#link").each(function(){this.href = "../innerpages/11_1f2e4.php";});
    });
  </script>
	</head>
	<body  class="body">
    <center>
      <a id="link" href="" target="body">click me</a>
      <a href="javascript:window.open('../innerpages'+'/11_2d3ff.php', 'windowopen', 'resizable=yes,width=500,height=400');">click me 2</a>
    </center>
	</body>
</html>

I see this in the browser I'm running in 127.0.0.1:8080 (see proxy param in the phantomjs call). I also see the jquery.js page being requested.

The output seen in stdout when running the command is:

{"response":{"headers":{"Date":["Thu, 02 Feb 2017 21:14:30 GMT"],"Server":["Apache/2.4.10 (Debian) PHP/5.6.11"],"X-Powered-By":["PHP/5.6.11"],"Set-Cookie":["PHPSESSID=2a81288b4a514a017c4d79bd89de6c51; path=/"],"Expires":["Thu, 19 Nov 1981 08:52:00 GMT"],"Cache-Control":["no-store, no-cache, must-revalidate, post-check=0, pre-check=0"],"Pragma":["no-cache"],"Vary":["Accept-Encoding"],"Content-Length":["726"],"Connection":["close"],"Content-Type":["text/html; charset=UTF-8"]},"contentType":"text/html; charset=UTF-8","status":200,"url":"http://192.168.0.40:8899/pages/11.php","body":"<html><head>\n\t\t<meta http-equiv=\"content-type\" content=\"text/html; charset=windows-1250\">\n\t\t<link type=\"text/css\" rel=\"stylesheet\" href=\"/style.css\">\n  <script type=\"text/javascript\" src=\"../js/jquery/jquery.js\"></script>\n  <script type=\"text/javascript\">\n    $(document).ready(function(){\n      $(\"#link\").each(function(){this.href = \"../innerpages/11_1f2e4.php\";});\n    });\n  </script>\n\t</head>\n\t<body class=\"body\">\n    <center>\n      <a id=\"link\" href=\"../innerpages/11_1f2e4.php\" target=\"body\">click me</a>\n      <a href=\"javascript:window.open('../innerpages'+'/11_2d3ff.php', 'windowopen', 'resizable=yes,width=500,height=400');\">click me 2</a>\n    </center>\n\t\n\n</body></html>","details":{"links":[{"text":"click me","url":"http://192.168.0.40:8899/innerpages/11_1f2e4.php"}],"forms":[],"jsLinkFeedback":true}},"elasped":531,"ok":1,"msgType":"domSteady","signature":"==lXlKfYWch7H9VdJgPCmJ=="}


{"action":"element.triggered","events":["click"],"keyChain":["root","body/center[1]/a[2]"],"childFrames":[{"headers":[{"name":"Accept","value":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},{"name":"Referer","value":"http://192.168.0.40:8899/pages/11.php"},{"name":"User-Agent","value":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1"}],"id":1,"method":"GET","time":"2017-02-02T21:14:30.560Z","url":"http://192.168.0.40:8899/innerpages/11_2d3ff.php","fromMainFrame":true,"navigationType":"Other"}],"msgType":"domChanged","signature":"==lXlKfYWch7H9VdJgPCmJ=="}

The first one is the response for the initial GET request. The second one seems to be a click on one of the links:

      <a id="link" href="" target="body">click me</a>
      <a href="javascript:window.open('../innerpages'+'/11_2d3ff.php', 'windowopen', 'resizable=yes,width=500,height=400');">click me 2</a>

My questions are:

Did you guys run gryffin against WIVET? Any results you can share?
Why is only one of the links clicked?
Why am I not seeing the HTTP request to 11_2d3ff.php in my proxy?
In the JSON printed in stdout I see:

"details":{"links":[{"text":"click me","url":"http://192.168.0.40:8899/innerpages/11_1f2e4.php"}]

If the crawler did not click on the second link, how was that URL extracted?

Is there something I'm doing wrong? I'm using phantomjs 2.1.1

The text was updated successfully, but these errors were encountered:

r-andrew-dev · 2021-02-12T21:05:37Z

The project is being archived.

andresriancho mentioned this issue Feb 2, 2017

Javascript crawler andresriancho/w3af#1796

Open

14 tasks

r-andrew-dev closed this as completed Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler stopping too soon #33

Crawler stopping too soon #33

andresriancho commented Feb 2, 2017 •

edited

Loading

r-andrew-dev commented Feb 12, 2021

Crawler stopping too soon #33

Crawler stopping too soon #33

Comments

andresriancho commented Feb 2, 2017 • edited Loading

r-andrew-dev commented Feb 12, 2021

andresriancho commented Feb 2, 2017 •

edited

Loading