# 06. 링크를 따라 돌며 크롤링하기

## Spider 변경
- `my_project/my_project/settings.py`의 DEPTH_LIMIT=1로 설정
- `my_project/my_project/spiders/quotes-3.py`
    - `start_requests`의 `scrapy.Request()` 부분에서 callback 부분을 지워야 Rule에 따라 돌아감
- urls : 계층적인 크롤링 대상 링크가 있는 인덱스 페이지

```python
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from my_project.items import Quote

class QuotesSpider(CrawlSpider):
    """Quote 아이템을 수집하는 크롤러"""
    
    name = 'quotes-3'
    allowed_domains = ['quotes.toscrape.com']

    rules = (
        Rule(LinkExtractor(allow=r'.*'), callback='parse_start_url', follow=True),
    )
    
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]

        return [scrapy.Request(url=url) for url in urls]
        """또는
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        """

    def parse_start_url(self, response):
        """start_urls 아래의 다른 페이지도 스크레이핑"""
        return self.parse(response)

    def parse(self, response):
        """크롤링한 페이지에서 Item을 스크레이핑"""
        items = []
        for i, quote_html in enumerate(response.css('div.quote')):
            if i > 1:
                return items
            item = Quote()
            item['author'] = quote_html.css('small.author::text').get()
            item['text'] = quote_html.css('span.text::text').get()
            item['tags'] = quote_html.css('div.tags a.tag::text').getall()
            items.append(item)
```

### 로그 확인하기
```
(...생략...)
2021-10-06 20:20:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2021-10-06 20:20:29 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': 'Albert Einstein',
 'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
 'text': '“The world as we have created it is a process of our thinking. It '
         'cannot be changed without changing our thinking.”'}
2021-10-06 20:20:29 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': 'J.K. Rowling',
 'tags': ['abilities', 'choices'],
 'text': '“It is our choices, Harry, that show what we truly are, far more '
         'than our abilities.”'}
2021-10-06 20:20:29 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-10-06 20:20:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.goodreads.com': <GET https://www.goodreads.com/quotes>
2021-10-06 20:20:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapinghub.com': <GET https://scrapinghub.com>
2021-10-06 20:20:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/simile/> (referer: http://quotes.toscrape.com/)
2021-10-06 20:20:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/simile/>
{'author': 'Steve Martin',
 'tags': ['humor', 'obvious', 'simile'],
 'text': '“A day without sunshine is like, you know, night.”'}
2021-10-06 20:20:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/simile/>
{'author': 'Albert Einstein',
 'tags': ['life', 'simile'],
 'text': '“Life is like riding a bicycle. To keep your balance, you must keep '
         'moving.”'}
2021-10-06 20:20:30 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 1): http://quotes.toscrape.com/
2021-10-06 20:20:30 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 1): http://quotes.toscrape.com/login 
2021-10-06 20:20:30 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 1): http://quotes.toscrape.com/tag/simile/page/1/
(...생략...)
```

- 끝부분의 `'downloader/request_count': 56` : 56번의 요청이 있었으며 `DEBUG: Ignoring link (depth > 1)`이 붙어있는 것들은 추출하지 않고 무시했다는 의미