...add your intoduction...
- Keyword-based Searches: The spider can be configured to search based on a list of custom keywords.
- Extensive Data Collection: Captures a plethora of data including ..., and more.
- Proxies & Headless Browsing: Built-in support for using ScrapeOps or custom proxies, as well as Firefox for headless browsing.
- Pagination Support: Automatically navigates through pages up to a specified limit.
- Output: Stores the scraped data in a structured JSON format (or optionally CSV)
- python 3.x
chromedriver.exe
in the same directory as scrapy.cfg (and Google Chrome browser)- proxy credentials or API e.g. Brightdata or ScrapeOps (make sure to use a service that renders JavaScript)
First, clone the repository:
git clone https://github.com/emilrueh/scrapy-template
Navigate to the project directory and install the required packages:
cd mine
pip install -r requirements.txt
- by using
.env.template
and adding your proxy credentials.
- in the spider or your
settings.py
(disable in the spider), to choose JSON or CSV or both.
# settings.py
FEEDS = {
"output.json": {"format": "json", "overwrite": True},
"backup.csv": {"format": "csv", "overwrite": False},
}
# spiders/spider_name.py
custom_settings = {
"FEEDS": {
"data.json": {"format": "json", "overwrite": True},
# "data.csv": {"format": "csv", "overwrite": True},
},
...
}
- to keep the package prices in a uniform currency
# for scrapeops
SCRAPEOPS_PROXY_SETTINGS = {"country": "us", "render_js": True}
# or for brightdata
proxy_country = "us"
- Make sure to manually check how many pages your keyword has and input into the spider pagination settings!
# line 178 in spider_name.py
if self.page <= 20:
Navigate to the 'mine' directory (cd mine
) and run:
scrapy crawl spider_name
This will generate a JSON (and or CSV) file in the project directory containing the scraped data.
If you'd like to contribute, please fork the repository and make changes as you'd like. Pull requests are welcome.
This project is licensed under the MIT License - see the LICENSE.md file for details.