### Scrapy
- 파이썬 언어를 이용한 웹 데이터 수집 프레임워크
    - 프레임워크와 라이브러리 또는 패키지의 차이
    - 프레임워크는 특정 목적을 가진 기능의 코드가 미리 설정 되어서 빈칸채우기 식으로 코드를 작성
    - 패키지는 다른 사람이 작성해 놓은 코드를 가져다가 사용하는 방법
- scrapy
    - pip install scrapy
- tree
    - sudo apt install tree

#### Index
- xpath : css-selector 역할을 해주는 문법
- 스크래피의 구조
- gmarket 베스트 상품 데이터 크롤링

In [1]:
import scrapy
import requests
from scrapy.http import TextResponse # xpath 연습

#### 1. xpath 사용법
- 네이버, 다음 실시간 검색어 데이터
- 네이버 검색어 xpath

```
//*[@id="PM_ID_ct"]/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span[2]
```

- `//` : 가장 상위 엘리먼트
- `*` : 조건에 맞는 하위 엘리먼트를 모두 살펴봄, "div .txt"
- `[@id="PM_ID_ct"]` : 조건 : id가 PM_ID_ct인 엘리먼트
- `/` : 바로 아래 엘리먼트를 살펴봄, "div > .txt"
- `div[1]` : div 태그에서 1 번째 엘리먼트를 선택
- `.`:  현재 엘리먼트를 선택
- `not` : not(조건)

In [2]:
# 웹페이지에 연결
req = requests.get("https://www.naver.com/")

# response 객체 생성
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [3]:
# 네이버 키워드 순위 데이터 가져오기
# xpath : xpath selector
# data : xpath selector로 선택된 엘리먼트
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span')

[]

In [4]:
# text를 data로 설정
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span/text()')

[]

In [5]:
# response 객체에서 data 변수만 가져옴
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span/text()').extract()

[]

In [6]:
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li/a/span[2]/text()').extract()[:3]

[]

In [7]:
response.xpath("//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()")

[<Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='사전'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='뉴스'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='증권'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='부동산'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='지도'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='VIBE'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='책'>,
 <Selector xpath="//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()" data='웹툰'>]

In [8]:
response.xpath("//*[@class='list_nav NM_FAVORITE_LIST']/li/a/text()").extract()

['사전', '뉴스', '증권', '부동산', '지도', 'VIBE', '책', '웹툰']

#### 2. Scrapy Project
- scrapy 프로젝트 생성
- scrapy 구조
- gmarket 베스트 상품 링크 수집, 링크 안에 있는 상세 정보 수집

In [9]:
# 프로젝트 생성

In [10]:
!rm -rf crawler

In [11]:
!python3 -m scrapy startproject crawler

/bin/bash: scrapy: command not found


In [14]:
!tree ~/crawler

[01;34m/home/ubuntu/crawler[00m
├── [01;34mcrawler[00m
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── [01;34mspiders[00m
│       └── __init__.py
└── scrapy.cfg

2 directories, 7 files


#### scrapy의 구조
- spiders 
    - 어떤 웹서비스를 어떻게 크롤링할것인지에 대한 코드를 작성(.py 파일로 작성)
- items.py
    - 모델에 해당하는 코드, 저장하는 데이터의 자료구조를 설정. 어떤 데이터 저장할지, column 설정
- pipelines.py
    - 스크래핑한 결과물을 item 형태로 구성하고 처리하는 방법에 대한 코드
    - 아이템을 가지고 어떻게 할지 파이프라인 설정
- settings.py
    - 스크래핑 할때의 환경 설정값을 지정
    - robots.txt : 따를지, 안따를지
    
    
- scrapy는 자동으로 multi-threading으로 해준다.    

#### gmarket 베스트 셀러 상품 수집
- 상품명, 상세페이지 URL, 원가, 판매가, 할인율
- xpath 확인
- items.py
- spider.py
- 크롤러 실행

##### 1. xpath 확인

In [15]:
req = requests.get("http://corners.gmarket.co.kr/Bestsellers")
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [17]:
links = response.xpath(
    '//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
len(links)

200

In [26]:
links[0]

'http://item.gmarket.co.kr/Item?goodscode=1791565626&ver=637676174529795740'

In [33]:
req = requests.get(links[1])
response = TextResponse(req.url, body=req.text, encoding="utf-8")
title = response.xpath('//*[@id="itemcase_basic"]/div[1]/h1/text()')[0].extract()
s_price = response.xpath(
    '//*[@id="itemcase_basic"]/div[1]/p/span/strong/text()')[0]\
.extract().replace(",", "")
o_price = response.xpath(
    '//*[@id="itemcase_basic"]/div[1]/p/span/span/text()')[0]\
.extract().replace(",", "")
discount_rate = str(round((1 - int(s_price) / int(o_price))*100, 2)) + "%"
title, s_price, o_price, discount_rate

('유통기한임박  1+등급 대관령 한우국밥 특가전 400gX10팩 ', '27750', '28900', '3.98%')

In [21]:
response.xpath('//*[@id="itemcase_basic"]/div/p/span/strong/text()').extract()[0]

'27,750'

#### 2. items.py 작성

In [39]:
!cat ~/crawler/crawler/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [41]:
%%writefile ~/crawler/crawler/items.py
import scrapy

class CrawlerItem(scrapy.Item):
    title = scrapy.Field()
    s_price = scrapy.Field()
    o_price = scrapy.Field()
    discount_rate = scrapy.Field()
    link = scrapy.Field()

Overwriting /home/ubuntu/crawler/crawler/items.py


#### 3. spider.py 작성

In [42]:
%%writefile ~/crawler/crawler/spiders/spider.py
import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    # 추후에 이 이름으로 실행시킴. 
    name = "GmarketBestsellers"
    # gmarket에 해당되는 애들만 크롤링 하겠다는 것. 갑자기 naver가 떠도 크롤링 안하겠다는 것.
    allow_domain = ["gmarket.co.kr"]
    # initial request, 여기에 여러개 써있으면 여러개를 같이 시작
    start_urls = ["http://corners.gmarket.co.kr/Bestsellers"]
    
    # 위에 있는 url로 request날리고, response를 받으면 최초에 이 parse함수 실행
    def parse(self, response):
        links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
        for link in links[:10]:
            yield scrapy.Request(link, callback=self.page_content)
            
    def page_content(self, response):
        item = CrawlerItem()
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/div[1]/h1/text()')[0].extract()
        item["s_price"] = response.xpath('//*[@id="itemcase_basic"]/div[1]/p/span/strong/text()')[0].extract().replace(",", "")
        try:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/div[1]/p/span/span/text()')[0].extract().replace(",", "")
        except:
            item["o_price"] = item["s_price"]
        item["discount_rate"] = str(round((1 - int(item["s_price"]) / int(item["o_price"]))*100, 2)) + "%"
        item["link"] = response.url
        yield item

Writing /home/ubuntu/crawler/crawler/spiders/spider.py


#### 4. Scrapy 실행

In [58]:
%%writefile ~/crawler/run.sh
cd ~/crawler/crawler
python3 -m scrapy crawl GmarketBestsellers

Overwriting /home/ubuntu/crawler/run.sh


실행 권한 추가

In [59]:
!chmod +x ~/crawler/run.sh

In [60]:
!~/crawler/run.sh

2021-09-18 18:30:10 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: crawler)
2021-09-18 18:30:10 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 17.9.0, Python 3.6.9 (default, Jan 26 2021, 15:33:00) - [GCC 8.4.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.1.4, Platform Linux-5.4.0-1054-aws-x86_64-with-Ubuntu-18.04-bionic
2021-09-18 18:30:10 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-09-18 18:30:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawler',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawler.spiders']}
2021-09-18 18:30:10 [scrapy.extensions.telnet] INFO: Telnet Password: 52b5c70dd9b6a7ea
2021-09-18 18:30:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scr

2021-09-18 18:30:12 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item.gmarket.co.kr/Item?goodscode=2216788632&ver=637676190105523671>
{'discount_rate': '3.98%',
 'link': 'http://item.gmarket.co.kr/Item?goodscode=2216788632&ver=637676190105523671',
 'o_price': '28900',
 's_price': '27750',
 'title': '유통기한임박  1+등급 대관령 한우국밥 특가전 400gX10팩 '}
2021-09-18 18:30:12 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-18 18:30:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4268,
 'downloader/request_count': 13,
 'downloader/request_method_count/GET': 13,
 'downloader/response_bytes': 482271,
 'downloader/response_count': 13,
 'downloader/response_status_count/200': 13,
 'elapsed_time_seconds': 2.284865,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 9, 18, 18, 30, 12, 489108),
 'httpcompression/response_bytes': 2780175,
 'httpcompression/response_count': 13,
 'item_scraped_count': 10,
 'log_count/DE

- 결과를 csv로 저장

In [106]:
%%writefile ~/crawler/run.sh
cd ~/crawler/crawler
python3 -m scrapy crawl GmarketBestsellers -o GmarketBestsellers.csv

Overwriting /home/ubuntu/crawler/run.sh


In [107]:
!chmod +x ~/crawler/run.sh

In [108]:
!~/crawler/run.sh

2021-09-18 18:45:01 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: crawler)
2021-09-18 18:45:01 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 17.9.0, Python 3.6.9 (default, Jan 26 2021, 15:33:00) - [GCC 8.4.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.1.4, Platform Linux-5.4.0-1054-aws-x86_64-with-Ubuntu-18.04-bionic
2021-09-18 18:45:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-09-18 18:45:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawler',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawler.spiders']}
2021-09-18 18:45:01 [scrapy.extensions.telnet] INFO: Telnet Password: f52a77493dc6bf18
2021-09-18 18:45:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scr

2021-09-18 18:45:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item.gmarket.co.kr/Item?goodscode=2216788632&ver=637676199019005045>
{'discount_rate': '3.98%',
 'link': 'http://item.gmarket.co.kr/Item?goodscode=2216788632&ver=637676199019005045',
 'o_price': '28900',
 's_price': '27750',
 'title': '유통기한임박  1+등급 대관령 한우국밥 특가전 400gX10팩 '}
2021-09-18 18:45:03 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-18 18:45:03 [scrapy.extensions.feedexport] INFO: Stored csv feed (10 items) in: GmarketBestsellers.csv
2021-09-18 18:45:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4268,
 'downloader/request_count': 13,
 'downloader/request_method_count/GET': 13,
 'downloader/response_bytes': 481822,
 'downloader/response_count': 13,
 'downloader/response_status_count/200': 13,
 'elapsed_time_seconds': 2.063709,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2

In [109]:
!ls /home/ubuntu/crawler/crawler/

GmarketBestsellers.csv	__pycache__  middlewares.py  settings.py
__init__.py		items.py     pipelines.py    spiders


In [110]:
import pandas as pd

In [112]:
files = !ls /home/ubuntu/crawler/crawler/
files

['GmarketBestsellers.csv',
 '__init__.py',
 '__pycache__',
 'items.py',
 'middlewares.py',
 'pipelines.py',
 'settings.py',
 'spiders']

In [115]:
df = pd.read_csv("/home/ubuntu/crawler/crawler/{}".format(files[0]))
df.tail(2)

Unnamed: 0,discount_rate,link,o_price,s_price,title
8,28.0%,http://item.gmarket.co.kr/Item?goodscode=19986...,30000,21600,키즈 맨디 플리스 트레이닝복 세트 (ML3CWKRL511/512/513)
9,3.98%,http://item.gmarket.co.kr/Item?goodscode=22167...,28900,27750,유통기한임박 1+등급 대관령 한우국밥 특가전 400gX10팩


#### 5. Pipelines 설정
- item 을 출력하기 전에 실행되는 코드를 정의

In [130]:
import requests
import json

def send_slack(msg):
    WEBHOOK_URL = "https://hooks.slack.com/services/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"
    payload = {
        "text": msg,
    }
    requests.post(WEBHOOK_URL, json.dumps(payload))

In [132]:
send_slack("안녕하세요")

In [133]:
!cat ~/crawler/crawler/pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class CrawlerPipeline:
    def process_item(self, item, spider):
        return item


In [134]:
%%writefile ~/crawler/crawler/pipelines.py
import requests
import json

class CrawlerPipeline(object):
    
    def __send_slack(self, msg):
        WEBHOOK_URL = "https://hooks.slack.com/services/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"
        payload = {
            "text": msg,
        }  
        requests.post(WEBHOOK_URL, json.dumps(payload))
        
    def process_item(self, item, spider):
        keyword = "세트"
        print("="*100)
        print(item["title"], keyword)
        print("="*100)
        if keyword in item["title"]:
            self.__send_slack("{},{},{}".format(
                item["title"], item["s_price"], item["link"]))
        return item

Overwriting /home/ubuntu/crawler/crawler/pipelines.py


- pipeline 설정 : settings.py
```
ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}
```

300은 아무 의미 없음. 다른 것들보다 숫자가 작을 수록 더 빨리 실행

In [135]:
!echo "ITEM_PIPELINES = {" >> ~/crawler/crawler/settings.py
!echo "    'crawler.pipelines.CrawlerPipeline': 300,"  >> ~/crawler/crawler/settings.py
!echo "}"  >> ~/crawler/crawler/settings.py

In [136]:
!tail -n 3 ~/crawler/crawler/settings.py

ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}


In [137]:
!../crawler/run.sh

2021-09-18 19:22:00 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: crawler)
2021-09-18 19:22:00 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 17.9.0, Python 3.6.9 (default, Jan 26 2021, 15:33:00) - [GCC 8.4.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.1.4, Platform Linux-5.4.0-1054-aws-x86_64-with-Ubuntu-18.04-bionic
2021-09-18 19:22:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-09-18 19:22:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawler',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawler.spiders']}
2021-09-18 19:22:00 [scrapy.extensions.telnet] INFO: Telnet Password: a790433657cad008
2021-09-18 19:22:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scr

2021-09-18 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://item.gmarket.co.kr/Item?goodscode=1791565626&ver=637676221213293861> (referer: http://corners.gmarket.co.kr/Bestsellers)
2021-09-18 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://item.gmarket.co.kr/Item?goodscode=1998637788&ver=637676221213293861> (referer: http://corners.gmarket.co.kr/Bestsellers)
(카드가능)(에그머니) 온라인 게임상품권 5만원  세트
2021-09-18 19:22:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item.gmarket.co.kr/Item?goodscode=1791565626&ver=637676221213293861>
{'discount_rate': '10.0%',
 'link': 'http://item.gmarket.co.kr/Item?goodscode=1791565626&ver=637676221213293861',
 'o_price': '50000',
 's_price': '45000',
 'title': '(카드가능)(에그머니) 온라인 게임상품권 5만원 '}
키즈 맨디 플리스 트레이닝복 세트 (ML3CWKRL511/512/513)  세트
2021-09-18 19:22:03 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): hooks.slack.com:443
2021-09-18 19:22:03 [urllib3.connectionpool] DEBUG: https://hooks.slack.com:443 "POST /