Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

A web crawler that indexes the identity/social graph of profiles on the web

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 lib
Octocat-spinner-32 .gitignore
Octocat-spinner-32 Procfile
Octocat-spinner-32 README.md
Octocat-spinner-32 package.json
Octocat-spinner-32 tables.sql
README.md

RelSpider

A web crawler that indexes relations between the different profiles of users online and ties them together into an identity graph that's similar to a social graph but instead of mapping relations between different identities it maps the relation between different representations of the same identity.

Tech stuff (or just a big huge list of different cool aspects of this script)

  • Crawls link-tags and a-tags in HTML sites that have a "me"-relation. The "me"-relation is defined by the mother of all Microformats – the XFN – and is widely supported by big sites like Twitter, Google+, GitHub, Lanyrd etc.
  • RelSpider uses a combination of PostgreSQL and Neo4j to save all relations it finds.
  • Thanks to Neo4j RelSpider supports firing of WebHooks on completion of crawling the graph of a certain identity.
  • Thanks to Neo4j all relations are indexes as one directional and thus profile A and B might be part of the same identity graph when a lookup is made on profile A, while they're not if you do the lookup on profile B.
  • Only crawls sites scheduled for crawling through some of the methods in the API.
  • RelSpider fully supports robots.txt to check whether it's allowed to index a page or not.
  • Robots.txt files are cached for a day in either memory or Memcached.
  • RelSpider throttles the number of requests made to each host so that a request isn't ever done more often than every 10 seconds - that way it ensures that it will stay away from being banned.
  • Thanks PostgreSQL and Memcached multiple workers of RelSpider can be spawned without them going nuts and fetching the same pages multiple times. Using PostgreSQL a worker is always reserving a page for itself for 10 minutes prior to fetching it and thanks to Memcache it doesn't have to refetch a robots.tx-file if another worker has already fetched it in the last day.
  • Supports a configurable number of parallel fetches and whenever all fetches isn't being utilized it scales down accordingly to go easy on the database.
  • Modular - the crawler can be used separate from the API and web, one can easily replace those with ones own creations.
  • Supports canonical-links and permanent redirects - saves the original URL:s as alias of their targets and don't include the aliases in results
  • Refreshes nodes in its index every 24 hours
  • Nodes that hasn't themselves been requested in the last week and that has no incoming relations is cleaned up and removed to keep size of index down

Roadmap

Currently not sure what the next step will be here. Will perhaps join up with a few others to make this into a properly hosted service or will see it adopted, maintained and hosted by others in the community. I myself will now start studying at the university and thus in at least short term have no time to invest heavily in this all by myself.

Run it

Locally

RelSpider is built to work well with a Heroku-like setup and therefor uses foreman to start itself. First install Foreman if you haven't got it installed before, then set the required RelSpider configuration as outlined below and lastly start RelSpider by typing:

foreman start

On Heroku

Running on Heroku is easy - you basically just push the code up there and you're of. You can read more about that in their general quick start guide and then their Node.js quick start guide.

To avoid having to configure anything it is recommended to use the PostgreSQL and Neo4j add-ons. It's also recommended to use the Memcache - at least if you ever want to run more than one process.

This script can be run on Heroku for free in small scale - even with all the recommended add-ons added.

Configuration

To configure Foreman locally create a .env file in the top folder of RelSpider and add all required options below as well as any optional ones you would like to use.

When used with Heroku it will work automatically if the recommended add-ons are used, but of course all configurations can be specified there as well.

Required

  • DATABASE_URL="postgres://foo@localhost/relspider" - how to connect to your PostgreSQL database. Provided by PostgreSQL Heroku Add-on.

Optional

  • NEO4J_URL - how to connect to your Neo4j database. Defaults to http://localhost:7474. Provided by Neo4j Heroku Add-on.
  • RELSPIDER_API_USER="foo" - used with RELSPIDER_API_PASS to lock down the API with HTTP Authentication. Default is to require no authentication.
  • RELSPIDER_API_PASS="bar" - see RELSPIDER_API_USER
  • RELSPIDER_PARALLEL="30" - the number of parallel fetches per process, never will more fetches than these be made. Defaults to 30 parallel fetches.
  • RELSPIDER_CACHE="memcached" - if set to memcached then MemJS will be used for caching, see that module for additional configuration details. Defaults to memory cache unless MEMCACHE_USERNAME, which is provided by the Memcache Heroku Add-on or MEMCACHIER_USERNAME, which is provided by the Memcachier Heroku Add-on, is set - if any of them are set MemJS is instead auto-configured to use them.

API methods

/api/lookup

Used to fetch the identity graph of a URL. If a URL isn't yet crawled then it will be scheduled to be so.

Parameters

  • url - required the URL to do the lookup on
  • callback - a URL, a "WebHook", to which to POST the resulting identity graph when it has been fully crawled. Only used if the identity graph isn't yet fully crawled. The format of the POST:ed body is the same as the JSON that's in the response of this request.

Response

HTTP 202 response if identity graph isn't yet fully crawled, otherwise a HTTP 200 response with a JSON body like:

{
  "url": "http://github.com/voxpelli",
  "related": [
    "http://twitter.com/voxpelli",
    "http://github.com/voxpelli",
    "http://voxpelli.com/",
    "http://kodfabrik.se/"
 ],
  "incomplete": true
}

The url key in the response shows the URL that the lookup was made on. The related key includes the full identity graph, including the URL used in the lookup. The incomplete key is sometimes included - it then shows that there has been pages found in the graph that RelSpider for some reason hasn't been able to crawl and that therefor the graph might show its true extent.

/api/add

Used to schedule a site for crawling. Often you want /api/lookup instead.

Parameters

  • url - required the URL to do schedule to crawl.

Response

HTTP 202 with a message of success!

/api/refresh

Used to force the refresh of a node. Useful when debugging and you don't want to wait 24 hours for the next scheduled refresh. Won't ever refresh more often than every 1 minute though.

Parameters

  • url - required the URL to do schedule for refresh.

Response

HTTP 202 with a message of success!

License

MIT http://voxpelli.mit-license.org

Demo

For the moment there is an open demo up and running on a free Heroku instance with all the above recommended add-ons: http://relspider.herokuapp.com/

Something went wrong with that request. Please try again.