### playnomore
- http://playnomore.co.kr/
- scrapy에서 fake-useragent 사용
- scrapy를 실행할때 아규먼트를 설정해서 실행
- pipelines에서 데이터 베이스로 데이터를 저장

In [3]:
import scrapy
import requests
from scrapy.http import TextResponse

#### 1. 프로젝트 생성

In [4]:
!rm -rf playnomore
!scrapy startproject playnomore

New Scrapy project 'playnomore', using template directory '/Users/sanghyuk/anaconda/envs/py38/lib/python3.8/site-packages/scrapy/templates/project', created in:
    /Users/sanghyuk/Documents/0B-python/crawling_selenium/4_crawling_scrapy/playnomore

You can start your first spider with:
    cd playnomore
    scrapy genspider example example.com


#### 2. items.py
- title, price, img, link

In [5]:
%%writefile playnomore/playnomore/items.py
import scrapy

class PlaynomoreItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    img = scrapy.Field()
    link = scrapy.Field()

Overwriting playnomore/playnomore/items.py


#### 3. xpath 확인
- 링크
- 링크 -> 상세페이지(제목, 이미지URL, 가격)
- fake_useragent 설치
    - pip install fake_useragent

In [8]:
from fake_useragent import UserAgent
url = "http://playnomore.co.kr/category/bag/24/"
headers = { "User-Agent": UserAgent().chrome }
#headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" }
req = requests.get(url, headers=headers)
response = TextResponse(req.url, body=req.text, encoding="utf-8") 
response

<200 http://playnomore.co.kr/category/bag/24/>

In [12]:
# 링크
links = response.xpath(
    '//*[@id="contents"]/div[2]/div/ul/li/div/a/@href'
).extract()

In [13]:
links

['/product/detail.html?product_no=536&cate_no=24&display_group=1',
 '/product/detail.html?product_no=575&cate_no=24&display_group=1',
 '/product/detail.html?product_no=580&cate_no=24&display_group=1',
 '/product/detail.html?product_no=581&cate_no=24&display_group=1',
 '/product/detail.html?product_no=552&cate_no=24&display_group=1',
 '/product/detail.html?product_no=535&cate_no=24&display_group=1',
 '/product/detail.html?product_no=534&cate_no=24&display_group=1',
 '/product/detail.html?product_no=550&cate_no=24&display_group=1',
 '/product/detail.html?product_no=506&cate_no=24&display_group=1',
 '/product/detail.html?product_no=532&cate_no=24&display_group=1',
 '/product/detail.html?product_no=551&cate_no=24&display_group=1',
 '/product/detail.html?product_no=507&cate_no=24&display_group=1',
 '/product/detail.html?product_no=493&cate_no=24&display_group=1',
 '/product/detail.html?product_no=539&cate_no=24&display_group=1',
 '/product/detail.html?product_no=540&cate_no=24&display_group

In [14]:
links = list(map(response.urljoin, links))

In [25]:
# 상세페이지 : 제목, 가격, 이미지URL
url = links[0]
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" }
req = requests.get(url, headers=headers)
response = TextResponse(req.url, body=req.text, encoding="utf-8") 
response

<200 http://playnomore.co.kr/product/preorder10off-180-micro-candy-midnight/536/?cate_no=24&display_group=1>

In [None]:
//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/

In [31]:
title1 = response.xpath(
        '//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/font/text()'
    ).extract()
title2 = response.xpath(
        '//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/text()'
    ).extract()
title = "".join(title1) + "".join(title2)
price = response.xpath(
        '//*[@id="contents"]/div[1]/div[1]/div[2]/div[2]/text()'
    ).extract()[0]
img = response.xpath(
        '//*[@id="contents"]/div[1]/div[1]/div[1]/div[1]/img/@src'
    ).extract()[0]
title, price, img

('[preorder/10%off]  MICRO CANDY midnight',
 '$ 162',
 'https://cafe24img.poxo.com/playnomore/web/product/big/202007/13c6846e51721f079e2fd501e71d042f.jpg')

#### 4. spider.py
- scrapy-fake-useragent 설치
    - pip install scrapy-fake-useragent

In [2]:
!pip3 list | grep fake

fake-useragent                     0.1.11


In [10]:
%%writefile playnomore/playnomore/spiders/spider.py
import scrapy
from playnomore.items import PlaynomoreItem

class PlaynomoreSpider(scrapy.Spider):
    name = "Playnomore"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            # Remove Default User Agent
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            # Set New User Agent
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
        }
    }
    
    # 우선적으로 실행됨
    def start_requests(self):
        url = "http://playnomore.co.kr/category/bag/24/"
        yield scrapy.Request(url, callback=self.parse)
        
    def parse(self, response):
        links = response.xpath('//*[@id="contents"]/div[2]/div/ul/li/div/a/@href').extract()
        links = list(map(response.urljoin, links))
        for link in links:
            yield scrapy.Request(link, callback=self.page_parse)
    
    def page_parse(self, response):
        item = PlaynomoreItem()
        title1 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/font/text()').extract()
        title2 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/text()').extract()
        item["title"] = "".join(title1) + "".join(title2)
        item["price"] = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[2]/text()').extract()[0]
        item["img"] = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[1]/div[1]/img/@src').extract()[0]
        item["link"] = response.url
        yield item

Overwriting playnomore/playnomore/spiders/spider.py


참고로, csv또 하면, 현재 있던 파일에 append가 되는 것.

In [11]:
%%writefile playnomore/run.sh
cd playnomore
python3 -m scrapy crawl Playnomore -o playnomore.csv

Overwriting playnomore/run.sh


In [12]:
!chmod +x playnomore/run.sh

In [13]:
!playnomore/run.sh

2021-09-20 00:39:59 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: playnomore)
2021-09-20 00:39:59 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.16-x86_64-i386-64bit
2021-09-20 00:39:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-09-20 00:39:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'playnomore',
 'NEWSPIDER_MODULE': 'playnomore.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['playnomore.spiders']}
2021-09-20 00:39:59 [scrapy.extensions.telnet] INFO: Telnet Password: d2bf28ff0fe8f850
2021-09-20 00:39:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.

2021-09-20 00:40:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/bag-micro-candy-tweed-black/534/?cate_no=24&display_group=1> (referer: http://playnomore.co.kr/category/bag/24/)
2021-09-20 00:40:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/preorder10off-180-micro-candy-midnight/536/?cate_no=24&display_group=1> (referer: http://playnomore.co.kr/category/bag/24/)
2021-09-20 00:40:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/bag-micro-candy-pink/575/?cate_no=24&display_group=1> (referer: http://playnomore.co.kr/category/bag/24/)
2021-09-20 00:40:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/bag-micro-moon-chocolate/540/?cate_no=24&display_group=1> (referer: http://playnomore.co.kr/category/bag/24/)
2021-09-20 00:40:02 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/bag-micro-candy-tweed-black/534/?cate_no=24&display_

In [14]:
import pandas as pd
df = pd.read_csv("playnomore/playnomore.csv")
df.tail(1)

Unnamed: 0,img,link,price,title
13,https://cafe24img.poxo.com/playnomore/web/prod...,http://playnomore.co.kr/product/bag-micro-cand...,$ 180,[BAG] MICRO CANDY natural


실행시킬때, shoes 이런 argument를 받아서, url을 만들어서 scraping을 해보자. 

#### 5. argument 설정

In [26]:
%%writefile playnomore/playnomore/spiders/spider.py
import scrapy
from playnomore.items import PlaynomoreItem

class PlaynomoreSpider(scrapy.Spider):
    name = "Playnomore"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
        }
    }
    
    def __init__(self, category1="bag", category2=24, **kwargs):
        self.start_url = "http://playnomore.co.kr/category/{}/{}".format(category1, category2)
        super().__init__(**kwargs)
        
    def start_requests(self):
        url = self.start_url
        yield scrapy.Request(url, callback=self.parse)
        
    def parse(self, response):
        links = response.xpath('//*[@id="contents"]/div[2]/div/ul/li/div/a/@href').extract()
        links = list(map(response.urljoin, links))
        for link in links:
            yield scrapy.Request(link, callback=self.page_parse)
    
    def page_parse(self, response):
        item = PlaynomoreItem()
        title1 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/font/text()').extract()
        title2 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/text()').extract()
        item["title"] = "".join(title1) + "".join(title2)
        item["price"] = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[2]/text()').extract()[0]
        item["img"] = "http:" + response.xpath('//*[@id="contents"]/div[1]/div[1]/div[1]/div[1]/img/@src').extract()[0]
        item["link"] = response.url
        yield item

Overwriting playnomore/playnomore/spiders/spider.py


In [29]:
%%writefile playnomore/run.sh
cd playnomore
scrapy crawl Playnomore -o playnomore2.csv -a category1=apparel -a category2=26

Overwriting playnomore/run.sh


In [30]:
!playnomore/run.sh

2021-09-20 00:50:24 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: playnomore)
2021-09-20 00:50:24 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.16-x86_64-i386-64bit
2021-09-20 00:50:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-09-20 00:50:24 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'playnomore',
 'NEWSPIDER_MODULE': 'playnomore.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['playnomore.spiders']}
2021-09-20 00:50:24 [scrapy.extensions.telnet] INFO: Telnet Password: b0c7532c38346864
2021-09-20 00:50:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.

2021-09-20 00:50:27 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/shy-sweat-ivory/59/?cate_no=26&display_group=1>
{'img': 'http:https://cafe24img.poxo.com/playnomore/web/product/big/201702/59_shop7_739643.jpg',
 'link': 'http://playnomore.co.kr/product/shy-sweat-ivory/59/?cate_no=26&display_group=1',
 'price': '$ 222',
 'title': 'SHY SWEAT ivory'}
2021-09-20 00:50:27 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-20 00:50:27 [scrapy.extensions.feedexport] INFO: Stored csv feed (6 items) in: playnomore2.csv
2021-09-20 00:50:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5843,
 'downloader/request_count': 14,
 'downloader/request_method_count/GET': 14,
 'downloader/response_bytes': 143473,
 'downloader/response_count': 14,
 'downloader/response_status_count/200': 8,
 'downloader/response_status_count/301': 6,
 'elapsed_time_seconds': 2.623476,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reas

In [32]:
import pandas as pd
df = pd.read_csv("playnomore/playnomore2.csv")
df.tail(5)

Unnamed: 0,img,link,price,title
1,http:https://cafe24img.poxo.com/playnomore/web...,http://playnomore.co.kr/product/soldout-knight...,$ 237,[soldout] KNIGHT HOODIE khaki
2,http:https://cafe24img.poxo.com/playnomore/web...,http://playnomore.co.kr/product/shy-sweat-grey...,$ 222,SHY SWEAT grey
3,http:https://cafe24img.poxo.com/playnomore/web...,http://playnomore.co.kr/product/shy-bubble-swe...,$ 222,SHY BUBBLE SWEAT grey
4,http:https://cafe24img.poxo.com/playnomore/web...,http://playnomore.co.kr/product/soldout-shy-fr...,$ 222,[soldout] SHY FRIENDS SWEAT oatmeal
5,http:https://cafe24img.poxo.com/playnomore/web...,http://playnomore.co.kr/product/shy-sweat-ivor...,$ 222,SHY SWEAT ivory


#### 6. Mongodb에 저장
- pymongo를 pipelines.py에 적용
- pip install pymongo==2.8.1

In [34]:
!pip3 list | grep pymongo

pymongo                            3.12.0


In [35]:
import pymongo

In [52]:
client = pymongo.MongoClient('mongodb://[EC2IP]:27017/')
client

MongoClient(host=['ec2ip:27017'], document_class=dict, tz_aware=False, connect=True)

In [53]:
db = client.playnomore
collection = db.apparel
collection

Collection(Database(MongoClient(host=['ec2ip:27017'], document_class=dict, tz_aware=False, connect=True), 'playnomore'), 'apparel')

In [41]:
data = {"title":"의류"}
collection.insert(data)

  collection.insert(data)


ObjectId('61476ba59a84349c5a285342')

##### Mongodb 모듈 파일 생성

In [42]:
%%writefile playnomore/playnomore/mongodb.py
import pymongo

client = pymongo.MongoClient('mongodb://[EC2IP]:27017/')
db = client.playnomore
collection = db.apparel

Writing playnomore/playnomore/mongodb.py


In [43]:
%%writefile playnomore/playnomore/pipelines.py
from .mongodb import collection

class PlaynomorePipeline(object):
    
    def process_item(self, item, spider):
        
        data = { "title": item["title"], 
                 "price": item["price"],
                 "img": item["img"], 
                 "link": item["link"],
               }
        
        collection.insert(data)
        
        return item

Overwriting playnomore/playnomore/pipelines.py


In [44]:
!echo "ITEM_PIPELINES = {"  >> playnomore/playnomore/settings.py

In [45]:
!echo "   'playnomore.pipelines.PlaynomorePipeline': 300," >> playnomore/playnomore/settings.py

In [46]:
!echo "}" >> playnomore/playnomore/settings.py

In [47]:
!tail -n 5 playnomore/playnomore/settings.py

#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
ITEM_PIPELINES = {
   'playnomore.pipelines.PlaynomorePipeline': 300,
}


In [50]:
!cat ./playnomore/run.sh

cd playnomore
scrapy crawl Playnomore -o playnomore2.csv -a category1=apparel -a category2=26


In [51]:
!./playnomore/run.sh

2021-09-20 01:02:20 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: playnomore)
2021-09-20 01:02:20 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.16-x86_64-i386-64bit
2021-09-20 01:02:20 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-09-20 01:02:20 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'playnomore',
 'NEWSPIDER_MODULE': 'playnomore.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['playnomore.spiders']}
2021-09-20 01:02:20 [scrapy.extensions.telnet] INFO: Telnet Password: 6a91e225afefd692
2021-09-20 01:02:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.

2021-09-20 01:02:24 [scrapy.extensions.feedexport] INFO: Stored csv feed (6 items) in: playnomore2.csv
2021-09-20 01:02:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5726,
 'downloader/request_count': 14,
 'downloader/request_method_count/GET': 14,
 'downloader/response_bytes': 143485,
 'downloader/response_count': 14,
 'downloader/response_status_count/200': 8,
 'downloader/response_status_count/301': 6,
 'elapsed_time_seconds': 2.726557,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 9, 19, 17, 2, 24, 105448),
 'httpcompression/response_bytes': 453411,
 'httpcompression/response_count': 7,
 'item_scraped_count': 6,
 'log_count/DEBUG': 22,
 'log_count/INFO': 12,
 'memusage/max': 68018176,
 'memusage/startup': 68018176,
 'request_depth_max': 1,
 'response_received_count': 8,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/respo