### Scrapy  
- 파이썬 언어를 이용한 웹 데이터 수집 프레임워크
    - 프레임워크 vs 라이브러리 또는 패키지
    - 프레임워크 : 특정목적을 가진 기능의 코드가 미리 설정되어서 빈칸채우기 식으로 코드를 작성
    - 패키지는 다른 사람이 작성해 놓은 코드를 가져다가 사용하는 방법
 

- scrapy
    - `$pip install scrapy`
- tree
    - `$sudo apt install tree`  
    
서버에서 설치하기~~  

scrapy는 자체적으로 강력한(?) xpath를 이용하기 때문에.
Beautifulsoup을 굳이 쓸 필요없다.-> BeautifulSoup 쓰는 이유는 css-selector 쓰기 위해서.

### 2. Scrapy Project
- scrapy 프로젝트 생성
- scrapy 구조
- gmarket 베스트 상품 링크 수집, 링크 안에 있는 상세 정보 수집

In [2]:
# 프로젝트 생성

In [3]:
!scrapy startproject crawler

Error: scrapy.cfg already exists in /home/ubuntu/02.Crawling/06.scrapy/crawler


In [4]:
!ls

06-01_iterator_generator.ipynb	06-03_xpath.ipynb  run.sh
06-02_scrapy_gmarket.ipynb	crawler		   scrapy.ipynb


In [5]:
!tree crawler

[01;34mcrawler[00m
├── [01;34mcrawler[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   │   ├── __init__.cpython-36.pyc
│   │   └── settings.cpython-36.pyc
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── [01;34mspiders[00m
│       ├── __init__.py
│       └── [01;34m__pycache__[00m
│           └── __init__.cpython-36.pyc
└── scrapy.cfg

4 directories, 10 files


#### scrapy의 구조
- spiders
    - 어떤 웹서비스를 어떻게 크롤링할것인지에 대한 코드를 작성(.py 파일로 작성)
- items.py
    - 모델에 해당하는 코드, 저장하는 데이터의 자료구조를 설정
- piplines.py
    - 스크래핑한 결과물을 items 형태로 구성하고 처리하는 방법에 대한 코드
- settings.py
    - 스크래핑 할때의 설정값을 지정
    - robots.txt : 따를지, 안따를지

#### gmarket 베스트 셀러 상품 수집
- 상품명, 상세페이지 URL, 원가, 판매가, 할인율
- xpath 확인
- items.py
- spider.py
- 크롤러 실행

### 1. xpath 확인

In [6]:
import requests
import pandas as pd
import scrapy
from scrapy.http import TextResponse

In [7]:
req = requests.get("http://corners.gmarket.co.kr/Bestsellers")
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [8]:
items = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li')
len(items)

200

In [9]:
links = response.xpath(
    '//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
len(links)

200

In [10]:
links[0]

'http://item.gmarket.co.kr/Item?goodscode=1840147374&ver=637405265711220348'

In [11]:
## 결과데이터가 무조건 list라서 [0] 써준다.

req = requests.get(links[0])
response = TextResponse(req.url, body=req.text, encoding="utf-8")
title = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
s_price = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
o_price = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
discount_rate = str(round((1 - int(s_price) / int(o_price))*100, 2)) + "%"
title, s_price, o_price, discount_rate

IndexError: list index out of range

### 2. items.py 작성

- 크롤링할 데이터의 모델을 정해줌.

In [12]:
!cat crawler/crawler/items.py

import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    name = "GmarketBestsellers"
    allow_domain = ["gmarket.co.kr"]
    start_urls = ["http://corners.gmarket.co.kr/BestSellers"]
    
    def parse(self, response):
        links = response.xpath("//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li")
        for link in links[:10]:   ##실행하는데 시간이 너무많이 걸려서 links 10개까지만 하자.
            yield scrapy.Request(link, callback=self.page_content)
    
    def page_content(self, response):
        item = CrawlerItem()
        
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
        item["s_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
        try:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
        except:
            item["o_price"] = response.xpath('//*[@id="itemcase_bas

In [13]:
%%writefile crawler/crawler/items.py
import scrapy


class CrawlerItem(scrapy.Item):
    title = scrapy.Field()
    s_price = scrapy.Field()
    o_price = scrapy.Field()
    discount_rate = scrapy.Field()
    link = scrapy.Field()

Overwriting crawler/crawler/items.py


### 3. spider.py 작성

실제로 작성하는건 요정도밖에 안된다~~

In [14]:
%%writefile crawler/crawler/items.py
import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    name = "GmarketBestsellers"
    allow_domain = ["gmarket.co.kr"]
    start_urls = ["http://corners.gmarket.co.kr/BestSellers"]
    
    def parse(self, response):
        links = response.xpath("//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li")
        for link in links[:10]:   ##실행하는데 시간이 너무많이 걸려서 links 10개까지만 하자.
            yield scrapy.Request(link, callback=self.page_content)
    
    def page_content(self, response):
        item = CrawlerItem()
        
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
        item["s_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
        try:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
        except:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
        item["discount_rate"] = str(round((1 - int(s_price) / int(o_price))*100, 2)) + "%"
        item["link"] = response.url
        yield item

Overwriting crawler/crawler/items.py


### 4. Scrapy 실행

In [15]:
!ls crawler

crawler  scrapy.cfg


In [16]:
## scrapy.cfg 파일이 있는 디렉토리 안에서 다음의 커맨드를 실행해야한다

%%writefile run.sh
cd crawler           # crawler 디렉토리로 이동
scrapy crawl GmarketBestsellers     # shell script를 실행한다.

SyntaxError: invalid syntax (<ipython-input-16-08c2c89694a7>, line 4)

In [17]:
# +x : 파일에 대한 실행권한을 추가해준다
!chmod +x run.sh

In [18]:
!./run.sh

2020-11-09 14:01:05 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: crawler)
2020-11-09 14:01:05 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Oct 19 2020, 01:13:33) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Linux-5.4.0-1029-aws-x86_64-with-debian-buster-sid
2020-11-09 14:01:05 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.6.9/envs/python3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 75, in load
    return self._spiders[spider_name]
KeyError: 'GmarketBestsellers'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/python3/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/ubuntu/.pyenv/v

- 결과를 CSV로 저장

In [19]:
%%writefile run.sh
cd crawler
scrapy crawl GmarketBestsellers -o GmarketBestsellers.csv

Overwriting run.sh


In [20]:
! ./run.sh

2020-11-09 14:24:44 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: crawler)
2020-11-09 14:24:44 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Oct 19 2020, 01:13:33) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Linux-5.4.0-1029-aws-x86_64-with-debian-buster-sid
2020-11-09 14:24:44 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.6.9/envs/python3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 75, in load
    return self._spiders[spider_name]
KeyError: 'GmarketBestsellers'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/python3/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/ubuntu/.pyenv/v

In [21]:
!ls crawler/

crawler  scrapy.cfg


In [None]:
## 정상적으로 크롤링되어 데이터가 저장되어 있는지 확인

In [None]:
import pandas as pd

In [None]:
path = !pwd
files = !ls crawler/
files

In [None]:
files = !ls crawler/  # crawler프로젝트 안의 파일을 불러옴.
files

In [None]:
"crawler/{}".format(files[0])

In [None]:
df = pd.read_csv("GmarketBestsellers.csv")
df.tail(2)

### 5. Pipelines 설정
- item 을 출력하기 전에 실행되는 코드를 정의

In [None]:
import requests
import json

class send_slack(object):
    
    def __send_slack(self, msg):
        WEBHOOK_URL = ""
        payload = {
            "channel": "",
            "username": "",
            "text": msg,
        }
        requests.post(WEBHOOK_URL, json.dumps(payload))
    
    def process_item(self, item, spider):
        keyword = "세트",
        print("="*100)
        print(item["title"], keyword)
        print("="*100)
        if keyword in item["title"]:
            self.__send_slack("{},{},{}".format(
                item["title"], item["s_price"], item["link"]))
            
        return item     #return된 item이 출력된당

##실행해서 Overwriting 해준당

- pipeline 설정 : settings.py
```
ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,   
}
```

- 숫자 : 1~1000까지 가능. 파이프라인의 순서 정해줌. 숫자 낮을수록 더 일찍 실행된다.

In [None]:
# echo : 뒤에오는 문자열을 출력시켜주는 shell 커맨드

!echo "ITEM_PIPELINES = {" >> crawler/crawler/settings.py
!echo "    'crawler.pipelines.CrawlerPipeline': 300," >> crawler/crawler/settings.py
!echo "}" >> crawler/crawler/settings.py

In [None]:
!tail -n 3 cralwer/crawler/settings.py

In [None]:
#slack?? 으로 전송되는지 확인~~~

!./run.sh