### Naver Movie
- robots.txt 설정으로 크롤링 안됩니다.

#### 1. 프로젝트 생성

In [1]:
!scrapy startproject naver_movie

New Scrapy project 'naver_movie', using template directory 'c:\users\challenge\anaconda3\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\challenge\Downloads\updated_documents\09_[추가강의] 크롤링\크롤링4\Scrapy_20191122_01\naver_movie

You can start your first spider with:
    cd naver_movie
    scrapy genspider example example.com


#### 2. Items 설정
- https://movie.naver.com/movie/running/current.nhn
- 제목, 관객수, 평점

In [None]:
# !cat naver_movie/naver_movie/items.py

In [2]:
%%writefile naver_movie/naver_movie/items.py
import scrapy

class NaverMovieItem(scrapy.Item):
    title = scrapy.Field()
    count = scrapy.Field()
    star = scrapy.Field()

Overwriting naver_movie/naver_movie/items.py


#### 3. xpath 확인

In [3]:
import requests
import scrapy
from scrapy.http import TextResponse

In [4]:
# 링크
req = requests.get("https://movie.naver.com/movie/running/current.nhn")
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [5]:
links = response.xpath(
    '//*[@id="content"]/div[1]/div[1]/div[3]/ul/li/dl/dt/a/@href').extract()
len(links), links[0]

(126, '/movie/bi/mi/basic.nhn?code=192620')

In [6]:
link = response.urljoin(links[0])
link

'https://movie.naver.com/movie/bi/mi/basic.nhn?code=192620'

In [7]:
# 상세 데이터 수집
req = requests.get(link)
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [8]:
title = response.xpath(
    '//*[@id="content"]/div[1]/div[2]/div[1]/h3/a[1]/text()').extract()[0]
count = response.xpath(
    '//*[@id="content"]/div[1]/div[2]/div[1]/dl/dd[5]/div/p[2]/text()').extract()[0]
star = response.xpath(
    '//*[@id="actualPointPersentBasic"]/div/em/text()').extract()
star = "".join(star)
title, count, star

('비와 당신의 이야기', '31,400명', '9.47')

#### 4. spider 작성

In [9]:
%%writefile naver_movie/naver_movie/spiders/spider.py
import scrapy
from naver_movie.items import NaverMovieItem


class MovieSpider(scrapy.Spider):
    name = "NaverMovie"
    allow_domain = ["https://movie.naver.com"]
    start_urls = ["https://movie.naver.com/movie/running/current.nhn"]

    def parse(self, response):
        links = response.xpath('//*[@id="content"]/div[1]/div[1]/div[3]/ul/li/dl/dt/a/@href').extract()
        for link in links:
            link = response.urljoin(link)
            yield scrapy.Request(link, callback=self.parse_page_contents)

    def parse_page_contents(self, response):
        item = NaverMovieItem()
        item["title"] = response.xpath('//*[@id="content"]/div[1]/div[2]/div[1]/h3/a[1]/text()').extract()[0]
        try:
            item["count"] = response.xpath('//*[@id="content"]/div[1]/div[2]/div[1]/dl/dd[5]/div/p[2]/text()').extract()[0]
        except:
            item["count"] = "0명"
        star = response.xpath('//*[@id="actualPointPersentBasic"]/div/em/text()').extract()
        item["star"] = "".join(star)
        yield item

Writing naver_movie/naver_movie/spiders/spider.py


#### 5. Scrapy 실행

In [28]:
%%writefile run.sh
cd naver_movie
scrapy crawl NaverMovie -o naver_movie.csv

Overwriting run.sh


In [29]:
!chmod +x run.sh

In [30]:
!./run.sh

2019-11-22 07:39:48 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: naver_movie)
2019-11-22 07:39:48 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1054-aws-x86_64-with-debian-buster-sid
2019-11-22 07:39:48 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'naver_movie', 'FEED_FORMAT': 'csv', 'FEED_URI': 'naver_movie.csv', 'NEWSPIDER_MODULE': 'naver_movie.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['naver_movie.spiders']}
2019-11-22 07:39:48 [scrapy.extensions.telnet] INFO: Telnet Password: 72e3a700d3ddecf4
2019-11-22 07:39:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',

#### 6. settings.py 파일 변경
- Forbidden by robots.txt

In [10]:
!head -n 25 naver_movie/naver_movie/settings.py | tail -n 5


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)


In [11]:
!sed -i 's/ROBOTSTXT_OBEY = True/ROBOTSTXT_OBEY = False/' naver_movie/naver_movie/settings.py

In [39]:
!./run.sh

2019-11-22 07:44:50 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: naver_movie)
2019-11-22 07:44:50 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1054-aws-x86_64-with-debian-buster-sid
2019-11-22 07:44:50 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'naver_movie', 'FEED_FORMAT': 'csv', 'FEED_URI': 'naver_movie.csv', 'NEWSPIDER_MODULE': 'naver_movie.spiders', 'SPIDER_MODULES': ['naver_movie.spiders']}
2019-11-22 07:44:50 [scrapy.extensions.telnet] INFO: Telnet Password: a058c0feefb11b97
2019-11-22 07:44:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.log

2019-11-22 07:44:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=26756>
{'count': '0명', 'star': '8.83', 'title': '모리스'}
2019-11-22 07:44:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=179307>
{'count': '0명', 'star': '9.01', 'title': '벌새'}
2019-11-22 07:44:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=174797> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=174748>
{'count': '0명', 'star': '10.0', 'title': '프란치스코 교황: 맨 오브 히스 워드'}
2019-11-22 07:44:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=155123>
{'count': '0명', 'star': '9.36', 'title': '미스 사이공: 25주년 특별 공연'}
2019-11-22 07:44:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.

2019-11-22 07:44:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=10526> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=174903>
{'count': '0명', 'star': '9.00', 'title': '엑시트'}
2019-11-22 07:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=170824>
{'count': '0명', 'star': '8.75', 'title': '에브리데이'}
2019-11-22 07:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=155310>
{'count': '0명', 'star': '8.93', 'title': '비치온더비치'}
2019-11-22 07:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=136890>
{'count': '0명', 'star': '9.12', 'title': '에이미'}
2019-11-22 07:44:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?

2019-11-22 07:44:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=177381> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=185192> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=119591>
{'count': '0명', 'star': '10.0', 'title': '사선의 끝'}
2019-11-22 07:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=182756>
{'count': '0명', 'star': '', 'title': '늑대의 아이들'}
2019-11-22 07:44:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=175878> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.co

2019-11-22 07:44:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=183132> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=179398> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=180385>
{'count': '0명', 'star': '9.00', 'title': '니나 내나'}
2019-11-22 07:44:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=182205>
{'count': '0명', 'star': '8.57', 'title': '가장 보통의 연애'}
2019-11-22 07:44:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.naver.com/movie/bi/mi/basic.nhn?code=10676> (referer: https://movie.naver.com/movie/running/current.nhn)
2019-11-22 07:44:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.nav

2019-11-22 07:44:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=163831>
{'count': '166,024명', 'star': '9.26', 'title': '엔젤 해즈 폴른'}
2019-11-22 07:44:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=154298>
{'count': '5,030명', 'star': '9.20', 'title': '아이리시맨'}
2019-11-22 07:44:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=182387>
{'count': '52,764명', 'star': '9.42', 'title': '윤희에게'}
2019-11-22 07:44:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=167605>
{'count': '2,360,029명', 'star': '8.60', 'title': '터미네이터: 다크 페이트'}
2019-11-22 07:44:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.naver.com/movie/bi/mi/basic.nhn?code=179159>
{'count': '1,990,626명', 'star': '8.91', 'title': '신의 한 수: 귀수편'}
2019-11-22 07:44:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https:

In [12]:
import pandas as pd

In [13]:
df = pd.read_csv("naver_movie/naver_movie.csv")
df.tail(2)

Unnamed: 0,count,star,title
124,"965,067명",8.32,미나리
125,"28,276명",9.54,더 스파이
