Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

DEFUNCT: PHP Web Spider I started in 2011, accepts regex, css selectors

branch: master
Octocat-spinner-32 config First Commit April 28, 2012
Octocat-spinner-32 css First Commit April 28, 2012
Octocat-spinner-32 includes Fixing simple html dom location April 29, 2012
Octocat-spinner-32 js First Commit April 28, 2012
Octocat-spinner-32 screenshots screenshots April 28, 2012
Octocat-spinner-32 spider First Commit April 28, 2012
Octocat-spinner-32 README.md screenshots in readme April 28, 2012
Octocat-spinner-32 execute.php First Commit April 28, 2012
Octocat-spinner-32 index.php First Commit April 28, 2012
README.md

Spider Monkey

PHP Web Scraping Engine

I started this project in January 2011. It was going to be an easy to use web scraper that anyone could configure. It has an attractive GUI interface using jQuery UI elements.

Capture Screen

The selectors can be entered using three different methods; the first is the tried and true regular expression method. The second, which is the easiest and most powerful to use, is CSS selectors. Someone could visit the pages they want to scrape, use their developer console to debug some selectors, and then toss them into the engine. The third method is a simplified regex syntax I call asterisk, where the user enteres the start string and the end string and an asterisk in the middle.

Capture Screen

There is also an easy to use configuration screen for data storage. The engine was going to be smart enough to build different mysql tables and even build the relationshipts between the data. Or, data could be stored in XML or JSON documents, and a hierarchy would be maintained.

Capture Screen

But, I never finished the project.

This repo is a hodge podge of code, rather unorganized. I haven't been through it in a while, there might actually be two distinct code bases in here.

It is now released under the BSD license.

Something went wrong with that request. Please try again.