Skip to content

Python wrapper around phantomjs headless borwser

License

Notifications You must be signed in to change notification settings

sjqzhang/phantomcurl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhantomCurl

Python wrapper around PhantomJS headless browser for advanced web scraping. Apart of the page content, it can record all requests and responses from the web page, or collect content of all the IFrames.

If used as a command line tool, it returns data in JSON format.

Installation

PhantomCurl is a wrapper around PhantomJS, so first you should install PhantomJS from the project's page

PhantomJS should be visible system wise:

which phantomjs

If the binary is not visible system-wide, you should set the environment variable PHANTOMJS_BIN to point to the PhantomJS binary.

Now, build and install the python egg:

make && make install

Command line tool

You can use the script as a command line tool with:

python -mphantomcurl --help

Returns data in JSON format

Returned values

fetch() returns dictionary with the following fields:

url             - URL fed to the fetch function
requests        - all requests captured
responses       - all responses captured
content         - content of the web page
timestamps      - [start, end], seconds
version         - version of the JS script
command_line    - command line arguments passed to the JS 
frames          - IFrames found on the page. `frames` can contain other frames recursively

IFrames inspection

The script allows deep iframes inspection (-f option). For each iframe it reports src, id and its content. Then for each frame it check if it contains more iframes and reports them, recursively.

About

Python wrapper around phantomjs headless borwser

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 64.6%
  • JavaScript 33.1%
  • Makefile 2.3%