# Lecture 10-2

# Webscraping with scrapy

## Week 10 Wednesday

## Miles Chen, PhD

# Resources

- Scrapy documentation: https://docs.scrapy.org/en/latest/
- Scrapy overview: https://docs.scrapy.org/en/latest/intro/overview.html
- Scrapy tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html

- github repository from scrapy workshop at PyCon US 2024: https://github.com/rennerocha/pyconus2024-tutorial
- 

## installing scrapy

from the command line, run: 

```
conda install -c conda-forge scrapy
```

## Beginning a project

Scrapy is controlled through the `scrapy` command-line tool

First, you will have to set up a new Scrapy project. 

From the command line, run:
```
scrapy startproject scrapydemo
```

Where `scrapydemo` is the name of whatever scrapy projectal

After running this command, scrapy will create a folder called `scrapydemo` with files and additional folders and files inside.y

### Scrapy Project Structure

The main components of a Scrapy project:

+ `spiders/`: Directory containing spider definitions.
+ `items.py`: Define data structures.
+ `middlewares.py`: Handle requests and responses.
+ `pipelines.py`: Post-process data (e.g., save to database).
+ `settings.py`: Project settings and configurations.

Inside the `scrapydemo` folder, there you will find another folder called `scrapydemo` and inside that, there will be another folder called `spiders`. You will write python scripts inside the `spiders` folder

### Using Scrapy

To use scrapy, you will need to create a spider.

Change into the directory you created.

Run the command to create a spider script:

`scrapy genspider <spidername> <domain to scrape>`

We will scrape a site at http://quotes.toscrape.com. We will call our spider "quotes"

```
scrapy genspider quotes quotes.toscrape.com
```

## Edit the spider file

Now modify the spider file.

A spider file must have a `parse` method, which will be called on the response, which is the html from the server.

Within the `parse` method, there should be a `yield` which produces a dictionary that defines the content to be stored.

In [2]:
%%script false # this cell is not run, but contains a commented version of the quotes.py script
import scrapy


class QuotespiderSpider(scrapy.Spider):
    # Defines a new class QuotesSpider that inherits from scrapy.Spider. This class will contain the logic for our spider.
    name = "quotes"
    # Sets the name of the spider to "quotes". This is how you will refer to this spider when running it from the command line.
    allowed_domains = ["quotes.toscrape.com"]
    # attribute is used to restrict the spider to only crawl URLs from specified domains.
    # prevents the spider from accidentally crawling other websites
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    # A list of URLs where the spider will begin its crawling.
    # In this case, it starts at the first page of the quotes website.
    def parse(self, response):
        # Defines the parse method, which will be called with the response object of each request made.
        # This is where the main parsing logic of the spider resides.
        for quote in response.css('div.quote'):
            # Iterates over each div element with the class quote found in the response. 
            # This is where individual quotes are located on the page.
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
            # yield : Generates a dictionary containing the extracted data for each quote. 
            # The yield statement is used to return the data without stopping the spider.
                # 'text': quote.css(...): Extracts the text of the quote by selecting the span element with the class text
                # and retrieves its text content.
                # 'author': quote.css(...): Extracts the author's name by selecting the small element
                # with the class author and retrieves its text content.
                # 'tags': quote.css(...),: Extracts all tags associated with the quote by selecting all a 
                # elements with the class tag within the div element with the class tags, and retrieves their text content as a list.

        next_page = response.css('li.next a::attr(href)').get()
        # next_page = response.css('li.next a::attr(href)').get(): 
                # Finds the URL of the next page by selecting the a element within the li element with the class next and retrieving its href attribute.
        if next_page is not None:
            yield response.follow(next_page, self.parse)
        # if next_page is not None:: Checks if there is a next page.
        # yield response.follow(next_page, self.parse): If there is a next page, 
                # the spider follows the link and calls the parse method on the response of the next page.
                # This allows the spider to continue scraping subsequent pages.

Couldn't find program: 'false'


### Running the spider

Once you have the script written, you can execute it from the command line with:

`scrapy crawl <spidername> -o <output_file_name>`

We will run our spider called "quotes" and store the results in `quotes.csv`

`scrapy crawl quotes -o quotes.csv `

a lowercase `-o` will scrape and append to an existing file (or create a new file).

an uppercase `-O` will scrape and replace / overwrite an existing file (or create a new file).