Scrapy project to scrape public web directories (educational)
This is a Scrapy project to scrape websites from public web directories.

This project is only meant for educational purposes.


The items scraped by this project are websites, and the item is defined in the class:


See the source code for more details.


This project contains one spider called dmoz that you can see by running:

scrapy list

Spider: dmoz

The dmoz spider scrapes the Open Directory Project (, and it's based on the dmoz spider described in the Scrapy tutorial

This spider doesn't crawl the entire site but only a few pages by default (defined in the start_pages attribute). These pages are:

So, if you run the spider regularly (with scrapy crawl dmoz) it will scrape only those two pages. However, you can scrape any page by passing the url instead of the spider name. Scrapy internally resolves the spider to use by looking at the allowed domains of each spider.

For example, to scrape a different URL use:

scrapy crawl

You can scrape any URL from using this spider


This project uses a pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:

