# Scrapy

Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format. It's used for a wide range of data mining applications, from data processing to historical archival to data mining to text mining to information retrieval.

You can find the documentation here: http://doc.scrapy.org/en/latest/

NOTE: You may find scrapy easier to work with.

In summary, we have three general approaches to web scraping:
* use requests and BeautifulSoup
* use requests and regex
* use scrapy

### Scrapy Project Tutorial

1. Create a new Scrapy project.

Open a terminal and run the following:

```bash
scrapy startproject tutorial
````

### 2. Create a Scrapy Spider

Create a new Python file called `quotes_spider.py` inside the `spiders` folder and populate it with the following code:

```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```


### 3. Run the spider 

Go to your terminal. Make sure you are in the top folder of the scrapy project (in this case, tutorial) and then run the following command:
    
    scrapy crawl quotes

For more information on scrapy, see the [Scrapy Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html).