Skip to content

yoz2326/urlcrawler

Repository files navigation

URL Crawler

A simple web crawler.

Getting Started

Given a starting URL it will start crawling and create a simple text sitemap in json format, showing the links between URLs. This is intened to be run using AWS serverless services and the output will be uploaded to an S3 bucket.

Prerequisites

Python 3

  • The crawler uses Scrapy (Scrapy is a free and open-source web-crawling framework written in Python).

AWS Account This is intended to be run in AWS using serverless servicess.

NodeJs

  • The crawler is intended to be run using AWS serverless services (API Gateway, Lambda, Step Functions, S3). Deployment is done via Serverless Framework which can be installed as a node package.

Serverless Framework

Serverless Framework plugins:

Installing and usage

  • Get Python 3, this was tested against 3.7 but any 3.x should do.
  • Install serverless framework: npm install -g serverless
  • Change directory to project folder and install serverless plugins: - sls install plugin -n serverless-python-requirements - sls install plugin -n serverless-python-requirements - sls install plugin -n serverless-pseudo-parameters
  • Deploy to AWS - sls deploy

Once deployed, sls will output an URL to be used similar to:

Serverless StepFunctions OutPuts
endpoints:
  POST - https://7hy0y72rb1.execute-api.eu-west-1.amazonaws.com/dev/startCrawl

Use the URL provided and post a payload similar to the ones below. Simple payload:

{
"spiderConfig" :
  {
    "url": "http://books.toscrape.com"
  }
}

More complex example:

{
"spiderConfig" :
  {
    "url": "http://books.toscrape.com",
    "dry_run": "no",
    "scrapy_settings": {
      "LOG_LEVEL": "ERROR",
      "CONCURRENT_ITEMS": "400",
      "CONCURRENT_REQUESTS": "64",
      "CONCURRENT_REQUESTS_PER_DOMAIN": "32",
      "CONCURRENT_REQUESTS_PER_IP": "0",
      "DNSCACHE_ENABLED": "True"
    }
  }
}
  • Required: url to be used as a starting point.
  • Optional: dry_run tell the spider to start crawling. Values: yes or no (defaults to no if not set).
  • Optional: scrapy_settings can be used to set spider settings. For list of possible settings see Built-in settings reference.
  • Optional: spiderConfig.scrapy_settings.FEED_URI : can override the S3 bucket to upload the results to. Defaults to s3://url-crawler-#{AWS::AccountId} (can use /tmp/results.json if testing locally).

The results will be uploaded to an S3 bucket named url-crawler-#{AWS::AccountId}.

Running it locally

To run it locally you'll have to:
- virtualenv venv --python=python3
- pip install Scrapy
- sls invoke local -f crawl -p payload.json
There is a payload.json file to be used as an example; adjust as you see fit. Results will be sent to FEED_URI as set in payload.json.

Deployment

Just run: sls deloy with desired parameters if needed (i.e. -r region -s stage).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages