A web crawler that indexes relations between the different profiles of users online and ties them together into an identity graph that's similar to a social graph but instead of mapping relations between different identities it maps the relation between different representations of the same identity.
Things left to be done in the refactoring:
- Remove the added Bookshelf code and just go with pure Knex.js instead
- Ensure that the README is up to date
- There's likely going to be a GitHub milestone created for this as well so check that one as well
- Crawls link-tags and a-tags in HTML sites that have a "me"-relation. The "me"-relation is defined by the mother of all Microformats – the XFN – and is widely supported by big sites like Twitter, Google+, GitHub, Lanyrd etc.
- RelSpider uses a combination of PostgreSQL and Neo4j to save all relations it finds.
- Thanks to Neo4j RelSpider supports firing of WebHooks on completion of crawling the graph of a certain identity.
- Thanks to Neo4j all relations are indexes as one directional and thus profile A and B might be part of the same identity graph when a lookup is made on profile A, while they're not if you do the lookup on profile B.
- Only crawls sites scheduled for crawling through some of the methods in the API.
- RelSpider fully supports robots.txt to check whether it's allowed to index a page or not.
- Robots.txt files are cached for a day in either memory or Memcached.
- RelSpider throttles the number of requests made to each host so that a request isn't ever done more often than every 10 seconds - that way it ensures that it will stay away from being banned.
- Thanks PostgreSQL and Memcached multiple workers of RelSpider can be spawned without them going nuts and fetching the same pages multiple times. Using PostgreSQL a worker is always reserving a page for itself for 10 minutes prior to fetching it and thanks to Memcache it doesn't have to refetch a robots.tx-file if another worker has already fetched it in the last day.
- Supports a configurable number of parallel fetches and whenever all fetches isn't being utilized it scales down accordingly to go easy on the database.
- Modular - the crawler can be used separate from the API and web, one can easily replace those with ones own creations.
- Supports canonical-links and permanent redirects - saves the original URL:s as alias of their targets and don't include the aliases in results
- Refreshes nodes in its index every 24 hours
- Nodes that hasn't themselves been requested in the last week and that has no incoming relations is cleaned up and removed to keep size of index down
- Better recrawl mechanism
- Parse and save more interesting data from pages
- Investigate optional social graph parsing (not primary focus as social graphs are so interconnected that you easily end up crawling half of the internet)
RelSpider is built to work well with a Heroku-like setup and therefor uses foreman
to start itself. First install Foreman if you haven't got it installed before, then set the required RelSpider configuration as outlined below and lastly start RelSpider by typing:
foreman start
Running on Heroku is easy - you basically just push the code up there and you're of. You can read more about that in their general quick start guide and then their Node.js quick start guide.
To avoid having to configure anything it is recommended to use the PostgreSQL and GrapheneDB add-ons. It's also recommended to use the Memcache - at least if you ever want to run more than one process.
This script can be run on Heroku for free in small scale - even with all the recommended add-ons added.
To configure Foreman locally create a .env file in the top folder of RelSpider and add all required options below as well as any optional ones you would like to use.
When used with Heroku it will work automatically if the recommended add-ons are used, but of course all configurations can be specified there as well.
DATABASE_URL="postgres://foo@localhost/relspider"
- how to connect to your PostgreSQL database. Provided by PostgreSQL Heroku Add-on.
NEO4J_URL
- how to connect to your Neo4j database. Defaults tohttp://localhost:7474
. Provided by GrapheneDB Heroku Add-on.RELSPIDER_API_USER="foo"
- used withRELSPIDER_API_PASS
to lock down the API with HTTP Authentication. Default is to require no authentication.RELSPIDER_API_PASS="bar"
- seeRELSPIDER_API_USER
RELSPIDER_PARALLEL="30"
- the number of parallel fetches per process, never will more fetches than these be made. Defaults to30
parallel fetches.RELSPIDER_CACHE="memcached"
- if set tomemcached
then MemJS will be used for caching, see that module for additional configuration details. Defaults to memory cache unlessMEMCACHE_USERNAME
, which is provided by the Memcache Heroku Add-on orMEMCACHIER_USERNAME
, which is provided by the Memcachier Heroku Add-on, is set - if any of them are set MemJS is instead auto-configured to use them.
Used to fetch the identity graph of a URL. If a URL isn't yet crawled then it will be scheduled to be so.
url
- required the URL to do the lookup oncallback
- a URL, a "WebHook", to which to POST the resulting identity graph when it has been fully crawled. Only used if the identity graph isn't yet fully crawled. The format of the POST:ed body is the same as the JSON that's in the response of this request.
HTTP 202 response if identity graph isn't yet fully crawled, otherwise a HTTP 200 response with a JSON body like:
{
"url": "http://github.com/voxpelli",
"related": [
"http://twitter.com/voxpelli",
"http://github.com/voxpelli",
"http://voxpelli.com/",
"http://kodfabrik.se/"
],
"incomplete": true
}
The url
key in the response shows the URL that the lookup was made on. The related
key includes the full identity graph, including the URL used in the lookup. The incomplete
key is sometimes included - it then shows that there has been pages found in the graph that RelSpider for some reason hasn't been able to crawl and that therefor the graph might show its true extent.
Used to schedule a site for crawling. Often you want /api/lookup instead.
url
- required the URL to do schedule to crawl.
HTTP 202 with a message of success!
Used to force the refresh of a node. Useful when debugging and you don't want to wait 24 hours for the next scheduled refresh. Won't ever refresh more often than every 1 minute though.
url
- required the URL to do schedule for refresh.
HTTP 202 with a message of success!
MIT http://voxpelli.mit-license.org
Sometimes there is an open demo up and running on a free Heroku instance with all the above recommended add-ons: http://relspider.herokuapp.com/
Big refactoring, among other things:
- Separated up logic into more files
- Moved from the pg-library to the knex-library for interacting with Postgres
- Added some simple tests
- Added some linting
New features:
- Added throttling based on IP in addition to host names
- Experimental social graph indexing and feed finding