Url Extractor

This custom spider routine scrapes and lists all the urls from a website given the base-url.

Install Scrapy. https://docs.scrapy.org/en/latest/intro/install.html
Clone the repo.
Go through the object (/items.py) and the spider (/url_extractor/spiders/url_extractor.py) file. It should be straight forward.
I have scraped 3 websites and saved the respective urls into 3 separate csv. You can find my scrapped result data in /output folder.

Open Terminal.
Go to the url_extractor directory.
Run scrapy crawl urlextractor -a start_urls="http://fundrazr.com/,https://www.anandabazar.com,https://www.data-blogger.com/" -o output/links_multi.csv -t csv. By default the extracted result will be stored in links_multi.csv located in the /output directory. You can change the csv file name and location at will.
This spider will accept multiple comma separated links as an argument named start_urls. It will then extract the list of urls from all of them at one go.

Please see this changelog to know about the updates.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
output		output
url_extractor		url_extractor
CHANGELOG.md		CHANGELOG.md
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback