This custom spider routine scrapes and lists all the urls from a website given the base-url.
-
Install Scrapy.
https://docs.scrapy.org/en/latest/intro/install.html
-
Clone the repo.
-
Go through the object (
/items.py
) and the spider (/url_extractor/spiders/url_extractor.py
) file. It should be straight forward. -
I have scraped 3 websites and saved the respective urls into 3 separate csv. You can find my scrapped result data in
/output
folder.
-
Open Terminal.
-
Go to the
url_extractor
directory. -
Run
scrapy crawl urlextractor -a start_urls="http://fundrazr.com/,https://www.anandabazar.com,https://www.data-blogger.com/" -o output/links_multi.csv -t csv
. By default the extracted result will be stored inlinks_multi.csv
located in the/output
directory. You can change the csv file name and location at will. -
This spider will accept multiple comma separated links as an argument named
start_urls
. It will then extract the list of urls from all of them at one go.
Please see this changelog to know about the updates.