# Introduction
In this tutorial we're going to write a (very minimal) web crawl using Scrapy, which is a very hot and convenient Python scraping tool, here I will talk much more details than the documents shows, cause I would include some industrail tricks when using this smart scraping tool, talking about the whole scraping process, encoding issue, Xpath skills and some tricks dealing with IP forbidden.

Here we need to note that it is not the first choice to run a Scrapy web crawler on Jupyter Notebook, so here I just write how to run it, if you want to start your own project, just start a local Scrapy project and copy my code in it.

The example website is: http://www.imdb.com/chart/top?ref_=nv_mv_250_6

# My understanding of Scrpay Machenism

Scrapy uses **Twisted** as a framework. **Twisted** is special because it is event-driven and is more suitable for asynchronous code. Operations that block threads include accessing files, databases, or the Web, generating new processes and processing the output of new processes (such as running shell commands), and executing system-level code (such as waiting for system queues). **Twisted** provides permission to execute above operation and does not block the code execution method.

Here are the process sequence from what I understand:
    1. The engine gets its initial request to start crawling.
    2. The engine starts requesting the scheduler and prepares to crawl the next request.
    The crawler scheduler returns the next request to the crawler engine.
    4. The engine request is sent to the downloader and the web data is downloaded by downloading middleware.
    5. Once the downloader completes the page download, the download result is returned to the engine.
    6. The engine returns the downloader's response to the spider for processing through the middleware.
    7. The spider responds and returns the processed items through the middleware and new requests to the engine.
    8. The engine sends the processed items to the project pipeline, and then returns the processing result to the scheduler. The scheduler plans to process the next request.
    9. Repeat the process (continue to step 1) until all url requests are crawled.
    
![alt text](scrapy.png "Title")
*image citation: https://blog.csdn.net/yancey_blog/article/details/53888473*

## Set up Scrapy
First of all, you need to set up Scrapy, just follow the official website, you can use native *pip* but I strongly recommend using *Anaconda* package, which is 
    > $ conda install -c anaconda scrapy
    
After installation, we can initial a project:
    > $ scrapy startproject [tutorial] #[tutorial] is the name of your project.  
    
And then you will find a directory as following, and I attach the explanations:
    tutorial/
        scrapy.cfg            # deploy configuration file

        tutorial/             # project's Python module, you'll import your code from here
            __init__.py

            items.py          # project items definition file

            middlewares.py    # project middlewares file

            pipelines.py      # project pipelines file

            settings.py       # project settings file

            spiders/          # a directory where you'll later put your spiders
                __init__.py


## Xpath introduction
To write a web crawler, we need to navigate items in HTML page, here we use Xpath language, XPath is a language for selecting nodes in XML documents, which can also be used with HTML, in Xpath we may use CSS items, which define selectors to associate those styles with specific HTML elements. Scrapy has its own selector, Scrapy selectors are built over the lxml library, which means they’re very similar in speed and parsing accuracy.

Here we will go through a Scrapy selector example to get familar with Xpath opertion in Scrapy.

In [8]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = '<html><body>Hello<span>good</span>World</body></html>'
print (Selector(text=body).xpath('//span/text()').extract())
print (Selector(text=body).xpath('//body/text()').extract())
print (Selector(text=body).xpath('//span/text()').extract_first())
print (Selector(text=body).xpath('string(//body)').extract())
print (Selector(text=body).xpath('string(//span)').extract_first())

['good']
['Hello', 'World']
good
['HellogoodWorld']
good


From above example, there are some tricks:
    1. text() would retrive all texts in their seperated form, while string() would combine them together.
    2. extract() returns a list, extract_first() returns the first element of all items which are searched.

## Write a spider
Ok, after builing a Scrpy framework, we need to write a spider to scraping something:
    > $ scrapy genspider [name] [domain]
Here we generate a spider named **crawl_data.py**, the Python script is as following.
We want to crawl all top rated movies' information including: title, url, year, rate points, director and introduction. Then get all these information into json files, one movie for one json file.

In [10]:
# -*- coding: utf-8 -*-
import scrapy
from urllib2 import urlopen
import re

# ImdbItem is a class of dictionary, its object is used to transmit data from spider to pipeline.
from imdb.items import ImdbItem


class CrawlDataSpider(scrapy.Spider):
    name = 'crawl_data'
    
    # All scraping web cannot escape from these domains
    allowed_domains = ['http://www.imdb.com/']
    
    # We can define initial starting urls here, it is a list, so we may have a list of entries.
    start_urls = ['http://www.imdb.com/chart/top?ref_=nv_mv_250_6'
                  ]

    def parse(self, response):
        for url in response.selector.xpath('//tbody[@class="lister-list"]/tr/td/a/@href'):
            # print url
            url = response.urljoin(url.extract())
            # Just go to next layer.
            yield scrapy.Request(url, self.parse_d1)

    def parse_next(self, response):
        item = ImdbItem()
        
        url = response.url
        
        year = response.selector.xpath(
            'string(//div[@class="titleBar"]/div[@class="title_wrapper"]/h1[@class=""]/span/a)').extract_first()
        
        introduction = response.selector.xpath(
            'string(//div[@class="plot_summary_wrapper"]/div[@class="plot_summary "]/div[@class="summary_text"])').extract_first()
        
        title = response.selector.xpath(
            '//div[@class="titleBar"]/div[@class="title_wrapper"]/h1[@class=""]/text').extract_first()
        
        rate_points = response.selector.xpath(
            'string(div[@class="ratingValue"]/strong/span)').extract_first()
        
        director = response.selector.xpath(
            'string(//div[@class="plot_summary "]/div[@class="credit_summary_item"]/span[@itemprop="director"]/a/span)').extract_first()
        
        item['url'] = response.url
        item['title'] = title
        item['introduction'] = CrawlDataSpider.cleanup(introduction)
        item['rate_points'] = rate_points
        item['director'] = director
        
        return item


    # This static method is used to clean the raw text, eliminating space or new_line_break.
    @staticmethod
    def cleanup(text):
        if text is None:
            return ''
        tmp0 = re.sub(' +', ' ', text)
        tmp1 = tmp0.strip().replace('\t', '\n').replace('\r', '\n')
        tmp2 = re.sub('\n +', '\n', tmp1)
        tmp3 = re.sub('\n+', '\n', tmp2)
        return tmp3


## items.py
Since we have defined the spider, we would define the data we want to send in pipeline, the item in **items.py** should be consestent with those in spider file.

In [11]:
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    introduction = scrapy.Field()
    director = scrapy.Field()
    rate_points = scrapy.Field()

## pipelines.py
We also need to define our pipeline operation, after data is scraped and pre-processed in spider, we can do some storage operation in **piplelines.py**, either store them in database or just store them as local files. Here when output text into json files, we need to pay attention that the encoding type:
    1. obviously utf-8 as encoding
    2. decline ascii encoding, sometimes even you set utf-8 but the system still recognize ascii encoding.

In [None]:
# -*- coding: utf-8 -*-
import codecs
import json
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class ImdbPipeline(object):
    def process_item(self, item, spider):
        name = item['title']
        path = 'json_file/' + name + '.json'
        with codecs.open(path, 'w+', encoding='utf-8') as f:
            f.write(json.dumps(dict(item), ensure_ascii=False))

## Settings.py
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves. Here are some frequently useful settings:
    1. ROBOTSTXT_OBEY: whether to obey web crawler robot rule, if you want to make it do as what you want, choose False.
    2. DOWNLOAD_DELAY: the delay time between two requests, usually you need to set it arount 0.5s, if too freqent, your IP may be banned by server.
    3. COOKIES_ENABLED, DEFAULT_REQUEST_HEADERS: whether to use self-defined cookies, which is key when avoiding ban from server, you can have mulytipul cookieses at one time and return one randomly for each request.
    4. DOWNLOADER_MIDDLEWARES: whether to use self-defined download middleware.
    5. ITEM_PIPELINES: whether to use pipeline, if you want some storage operations in 
    6. HTTPCACHE_ENABLED: whether to cache the original HTML files, since cached, each time scraping a html the system only need to process local HTML file.

In [13]:
# -*- coding: utf-8 -*-

# Scrapy settings for imdb project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'imdb'

SPIDER_MODULES = ['imdb.spiders']
NEWSPIDER_MODULE = 'imdb.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'imdb (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'imdb.middlewares.ImdbSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'imdb.middlewares.MyCustomDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'imdb.pipelines.ImdbPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

## middlewares.py
Middlewares in Scrapy can be seen as two seperated parts: Spider Middlewares and Downloader Middlewares, one responsible customely define spider processing mechanism such as request or response; the other is to do customer modifity when download HTML files, most time we may not use these middlewares.

### user-agent
User-agent is an important parameter for us to simulate the browser, mainly to prevent reptiles from being ban. In the previous chapters we learned that user-agent can be set in settings.py, such as:

In [14]:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'

However, this situation still has a potential ban situation. In addition, we can increase the delay in crawling, which can also reduce the risk of being banned. Of course, these are relatively simple camouflage techniques. It is enough to use as a crawler.
Then we set up more user-agent to simulate the browser to download web data, and randomly set a user-agent each time we download it, which is less likely to be banned.

In [15]:
# -*-coding:utf-8-*-


import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware


class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            # print(ua)
            request.headers.setdefault('User-Agent', ua)

    # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [ \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

Then we need to change middlewares setting in **settings.py**, just like following:

In [17]:
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'imdb.middlewares.MyCustomDownloaderMiddleware': 543,
}

# Final Run
At last, in terminal we type:
    >scrapy crawl [spider_name]
    
So our spider is starting crawling web pages!

## How to run Scrapy on Jupyter Notebook?

Create a new notebook and use CrawlerProcess or CrawlerRunner classes to run in a cell:

In [19]:
# from scrapy.crawler import CrawlerProcess
# from scrapy.utils.project import get_project_settings

# process = CrawlerProcess(get_project_settings())
# 
# process.crawl('your-spider')
# process.start() # the script will block here until the crawling is finished
# # citation: https://stackoverflow.com/a/45341285/8299533

## 2. Deploy your web crawler to cloud

### Distributing crawlers: scrapy-redis-AWS

The combination of scrapy and redis, with multiple hosts to build a distributed crawler development environment, if the reptile advanced development of distributed crawler is very necessary.

### Scrapinghub

To deploy your web crawler to cloud, most people would think AWS or GCP, however there is a professional scraping cloud service: Scrapinghub, https://scrapinghub.com/scrapy-cloud. Scrapy Cloud removes the need to setup and monitor servers and provides a nice UI to manage spiders and review scraped items, logs and stats. Of course you can use IP address agent to avoid IP ban from server.

*citation: https://doc.scrapy.org/en/latest/topics/logging.html*



## 3. How to avoid IP from being banned

    1. Construct a Reasonable HTTP Request Header
    2. set cookie learning
    3. Normal time access path Delayed access
    4. Note Implicit Input Field Values
    5. How do reptiles generally avoid honeypots?
    6. Use of Variable Remote IP Address Use of Proxy