# How  to write a Web Crawler in Python3

## Introduction

In data science, the very basic is to collect data. In some cases, the data is well documented and we know the quantity of the data well, like the reviews we collected from Yelp, which only requires multiple requests. However, when we want to collect data from social network, we don't know how much data is there or where exactly we can find data we want. In this case, we need web crawlers to scan across the Internet to  collect all the data we want. 

This tutorial will first give a basic understanding of a web crawler with writeup and easy Python 3 code. We will then introduce several important issues in writing a crawler. At last we will introduce a popular Python sraping library/framework 'Scrapy', which helps us to solve the problems mentioned and buidl a web crawler/scraper with little work. 

### prerequisites

#### Basic knowlegde of HTTP requests

Due to the limit of words, people who are intereted can read NTU's introduction to HTTP, which is very helpful. 

- [HTTP (HyperText Transfer Protocol)](https://www.ntu.edu.sg/home/ehchua/programming/webprogramming/HTTP_Basics.html)

#### Basic knowlegde of HTML 

If you don't know this topic well, you can refer to w3school's HTML tutorial which is given below.

- [HTML5 Tutorial](https://www.w3schools.com/html)

#### Some Python libraries

You should have experience with Python `Requests` and `Beautiful Soup` libraries. Here are the official documents:
- [Requests: HTTP for Humans](http://docs.python-requests.org/en/master/#requests-http-for-humans)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)

#### Xpath language

Xpath is a language used to search for elements in XML document. You will need Xpath basics when using Scrapy. Here is a good tutorial provided by W3Schools.

- [XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp)

### Environment specification: 
- Python: 3.6.4
- Beautiful Soup: 4.6.0
- Requests: 2.18.4
- Scrapy: 1.5.0
- Linux Shell is used in this tutorial, thus run this jupyter notebook on Linux/MacOS only. Yet Scrapy command is not OS-denpendent.

### Tutorial content

- [What is a web crawler](#What-is-a-web-crawler)
- [Write your first web crawler](#Write-your-first-web-crawler)
- [Issues to consider in a crawler](#Issues-to-consider-in-a-crawler)
- [Using Scrapy to build a web crawler](#Using-Scrapy-to-build-a-web-crawler)
- [Reference](#Reference)

## What is a web crawler

A [web crawler](https://en.wikipedia.org/wiki/Web_crawler), also refered as a `Web spider`, or an `ant`. It is a program that sends requests to websites and then downloads web content. Usually, it starts with a list/queue of URLs to visit. When the crawler visits a website, it collects page content and identify the hyperlinks in the web page, add them (or first filter them by user-specified rules) to the list/queue of URLs to visit. Doing visiting websites and adding new urls recursively, web crawlers traverse the websites across the Internet or on a specific host, collecting web page data.

[<img src="500px-WebCrawlerArchitecture.png">](https://upload.wikimedia.org/wikipedia/commons/d/df/WebCrawlerArchitecture.svg)
<center>Figure 1: Web Crawler Architecture (from [WikiPedia](https://en.wikipedia.org/wiki/Web_crawler))</center>

Here is some cool stuff achieved with a web crawler:

###  Google search engine
Google uses crawlers to record the info of web pages and make an index of them and store the index locally.
[<img src="hqdefault.jpg">](https://youtu.be/BNHR6IQJGZs "How Search Works")
<center>Video 1: How Search Works</center>
Or you can read Google's introduction if you are interested:
- [Google search engine|Crawling and indexing](https://www.google.com/intl/ALL/search/howsearchworks/crawling-indexing) 
### More web crawler case studies
This website collected web crawler use cases in ecommerce, travel, restaurant, etc, which should be a good resource if you are very interested in web crawler application and need more information. 

- [Web crawlers use cases](http://promptcloud.dev.onpressidium.com/web-crawl-use-cases)


## Write your first web crawler

Let's first implement a very basic version to consolidate our basic understanding of a web crawler. 

In [14]:
import queue
import requests
from bs4 import BeautifulSoup
import time
import re
from types import MethodType 

In [20]:


# This naive crawler class
# @url_queue:   the queue which stores urls to be crawled
# @visited_url: the set which stores urls already added to url_queue
# @pg_cng:      the max count of web pages to be downloaded and stored 
# @saved_pg:    the dictionary storing downloaded pages, format:{url:content}
# @max_qsize:   the max size of url_queue
class NaiveCrawler:
    url_queue = queue.Queue()
    visited_url = set()
    pg_cng = 0
    saved_pg = {}
    max_qsize = 0
    
    # @init_pg:     list of initial pages for crawling
    def __init__(self, init_pg, max_qsize=10000):
        self.pg_cng = 0
        self.url_queue = queue.Queue()
        self.saved_pg = {}
        for url in init_pg:
            self.url_queue.put(url)
            self.visited_url.add(url)
        self.max_qsize = max_qsize
        
    def __del__(self):
        self.saved_pg.clear()


First we define a spider class `NaiveCrawler`, which contains basic parameters for a crawler, a queue storing urls to be scraped, a set storing urls already visited, web page contents.

In [21]:
# crawling function
# @pg_limit:  max number of pages to crawl
# @max_qsize: max number of pages stored in the url_queue
# @headers:   the http GET request headers
# @params:    the http GET request params
# @interval:  interval between two consecutive requests
def crawl(self, pg_limit=100, 
    headers={}, params={},
    interval=0.1):
    #only download pg_limit number of web pages
    while(self.pg_cng < pg_limit):
        time.sleep(interval)
        if self.url_queue.qsize()>0:
            #get the first url in the queue
            current_url = self.url_queue.get() 
            try:
                #request web pages  
                response = requests.get(current_url, headers=headers, params=params)
                html = response.text
                self.process_page(html,current_url)
            except Exception as e:
                print(e)
            pass
        else:
            break

The `crawl` function is the main function called by our crawler. In each cycle it picks the first url in `url_queue`, sends a request to download the corresponding web page contents with `requests.get`, store the web page contents & add new url links to `url_queue` with the function `process_page`. The loop stops when there are enough pages downloaded or url_queue is full (This is an exception). 

In [22]:
# process the downloaded web page, including store/filter content 
# and add new url to url_queue
# @html: the html doc downloaded
# @url:  the url of the downloaded web page 
def process_page(self,html,url):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        #simply store the whole html doc
        self.saved_pg[url] = html
        self.pg_cng += 1
        #identify all possible externel links appearing in the html doc
        #example externel link: "//facebook.com", "www.google.com"
        for a in soup.find_all("a", href=re.compile("^(http|www|//)*$")):
            #only if one url is not seen before and queue size does
            #not exceeed max limit, we add it to url_queue
            ex_url = a['href']
            if ex_url is not None and ex_url[0]=='/':
                if self.url_queue.qsize() < self.max_qsize:
                    if ex_url not in self.visited_url:
                        self.url_queue.put(ex_url)
                        #once we add a url to url_queue, we consider it 'visited'
                        self.visited_url.add(ex_url)

        #identify all possible internel links 
        #example internel links: "/ebooks/56580"
        for a in soup.find_all("a", href=re.compile("^/[^/].*$")):
            # concatenate the present web page's url with the relative url 
            in_url = a['href']
            if in_url is not None:
                in_url = url + in_url
                if self.url_queue.qsize() < self.max_qsize:
                    if in_url not in self.visited_url:
                        self.url_queue.put(in_url)
                        #once we add a url to url_queue, we consider it 'visited'
                        self.visited_url.add(in_url)

    except Exception as e:
        print(e)



Function `process_page` first parse the downloaded html file with BeautifulSoup library, which we are quite familiar with. It then adds the web page content to the dictionary `saved_pg` and extract all hyperlinks in the html file and store them into `url_queue`, the queue storing urls to be crawled. Here we deal with internel links and externel links separately.

You may need some knowledge of internel link and externel link in HTML (similar to relative path and absolute path in OS), and here is a good introduction if you don't know them before.

[Creating Internal & External HTML Links](https://clearlydecoded.com/creating-internal-external-html-links)

In [23]:
#Test case
naive_crawler = NaiveCrawler(["http://www.gutenberg.org/"])
# add defined functions to NaiveCrawler instance
# this is not a good code style, I only do this to
# make it easier to cut the code into chunks to 
# insert writeup between chunks
naive_crawler.crawl = MethodType(crawl,naive_crawler)
naive_crawler.process_page = MethodType(process_page,naive_crawler)

naive_crawler.crawl(pg_limit=20) 
print("crawler dump result: ",len(naive_crawler.saved_pg)," pages in total")
for url in naive_crawler.saved_pg:
    print(url)
    
key = list(naive_crawler.saved_pg.keys())[0] 
print(key, naive_crawler.saved_pg[key][:500], "....")

del naive_crawler
#print(naive_crawler.saved_pg)

crawler dump result:  20  pages in total
http://www.gutenberg.org/
http://www.gutenberg.org//wiki/Category:Bookshelf
http://www.gutenberg.org//wiki/Gutenberg:Contact_Information#Electronic_Mail
http://www.gutenberg.org//wiki/Gutenberg:Terms_of_Use
http://www.gutenberg.org//wiki/Gutenberg:Project_Gutenberg_Needs_Your_Donation
http://www.gutenberg.org//wiki/Gutenberg:Feeds
http://www.gutenberg.org//wiki/Gutenberg:Offline_Catalogs
http://www.gutenberg.org//wiki/Gutenberg:MobileReader_Devices_How-To
http://www.gutenberg.org//wiki/Category:How-To
http://www.gutenberg.org//wiki/Gutenberg:Contact_Information
http://www.gutenberg.org//wiki/Gutenberg:Readers%27_FAQ#R.26._I.27ve_found_some_obvious_typos_in_a_Project_Gutenberg_text._How_should_I_report_them.3F
http://www.gutenberg.org//wiki/Category:Volunteering
http://www.gutenberg.org//wiki/Gutenberg:Promote_Project_Gutenberg
http://www.gutenberg.org//wiki/Gutenberg:About
http://www.gutenberg.org//wiki/Gutenberg:No_Cost_or_Freedom%3F
http://www

## Issues to consider in a crawler

#### Avoid visiting the same urls: 

If you do not exclude the urls already visited from the request/url queue, you r program may get very slow or even generate request loops. However, in the code above, we can't recognize "https://www.gutenberg.org/ebooks/56857", "www.gutenberg.org/ebooks/56857", "//gutenberg.org/ebooks/56857", "/ebooks/56857"(embeded on www.gutenberg.org) as one url.


#### Avoid being blocked: 

Website hosts always have many techniques to prevent cyberattacks. To avoid a server crash, hosts will usually block an IP if it sends a flood of requests in a short time. This is why we should add an interval between requests. Besides, if you want to scrape trillions of bytes of data from one host (e.g. You want all reviews on Yelp), you should use many different IPs to scrape data, like using proxy servers.







#### User agent:

In http request packet, the [UserAgent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) is a key-value pair in a request header that allows hosts to identify the application type, operating system, software vendor, version of the software user agent sending the request.

Sometimes website hosts will parse the 'UserAgent' in a request header and deny the request if it is not sent by a browser. In this case, we should add 'UserAgent' entry in our request headr to disguise the request packet we send. 

#### Others:
- cookies
- fast method to find duplicate urls (Bloom Filter)
- parallel spiders, distributed spiders
- ...

## Using Scrapy to build a web crawler

If we write our own code to deal with all the issues mentioned above, that will be a big project. Luckily, there is a Python web scraping library, which is able to handle framwork and generates part of spider code with little configuration code. 

Here I'd like to introduce to you how to build a web crawler with Scrapy. BTW, Installation info is given here:
- [Installation Guide](https://doc.scrapy.org/en/latest/intro/install.html#installation-guide)


The framework/pipeline of Scrapy is a little bit different from the code of `NaiveCrawler` shown above though the basic ideas are the same. We can still use Scrapy to implement our web crawler without knowing its inner data flow. If you are quite interested in its structure and data flow, you can refer to Scrapy's offical document:

- [Architecture overview](https://doc.scrapy.org/en/latest/topics/architecture.html#architecture-overview)

### Creating a project

Srapy has a very powerful command-line tool which provides several easy but very powerful commands. With several lines of these commands, you can create your crawler framework project, generate simple spider code, launch a crawlers and write output files in the specified format. The command usage is `scrapy <command> [options] [args]`. If you want to know more about Scrapy's command-line tool, you can refer to its official document:

- [Command line tool](https://doc.scrapy.org/en/latest/topics/commands.html#command-line-tool)

Now we first create a project using its `startproject` command.

In [27]:
%%bash
#rm -r ./web_crawler
scrapy startproject web_crawler

Error: scrapy.cfg already exists in /home/vincent/Documents/practical_data/tutorial/web_crawler


With command `scrapy startproject web_crawler`, scrapy automatically generates a working directory `web_crawler`, which contains a scraping framework. The directory structure is like this:
<img src="create_project.png">
In the working directory, `scrapy.cfg` is theproject configuration file and the inner `spiders` directory is where we should put our web crawler files, which I will cover later.

### Create a spider

In Scrapy, spiders are classes that Scrapy framework uses to scrape information from a website. A spider class you create must inherits from `scrapy.Spider`. You must define the initial requests/urls in the spider class.
You should put your spider class in the `spiders` directory. 

Optionally, you can define how your crawler will follow links in the pages, and how to parse the downloaded page content/how to extract data in your spider class.

You can of course write your own spider or you can use Scrapy commands to generate a spider and adjust the code generated. Let's first generate a spider automatically and see how to write a spider class.

You can refer to official document for more information:
- [Scrapy commands | genspider](https://doc.scrapy.org/en/latest/topics/commands.html#genspider)

In [46]:
%%bash
cd ./web_crawler

#scrapy genspider [-t template] <name> <domain>
scrapy genspider guntenberg_crawler www.gutenberg.org
cat ./web_crawler/spiders/guntenberg_crawler.py

cd ..

Spider 'guntenberg_crawler' already exists in module:
  web_crawler.spiders.guntenberg_crawler
# -*- coding: utf-8 -*-
import scrapy


class GuntenbergCrawlerSpider(scrapy.Spider):
    name = 'guntenberg_crawler'
    allowed_domains = ['www.gutenberg.org']
    start_urls = ['http://www.gutenberg.org/']

    def parse(self, response):
        pass


With command `scrapy genspide guntenberg_crawler www.gutenberg.org`, we generate a spider class  `guntenberg_crawler`, which is located in `spiders` directory. In class `GuntenbergCrawlerSpider`, `name` is the   name of the spider, which is needed when we execute `crawl` command. `allowed_domains` is an optional list of strings containing domains that this spider is allowed to crawl. `start_urls` is a list of urls where the spider  begins to crawl from by default.

For more spider attributes and details about how scraping cycle goes, you can refer to this page:
- [Spiders](https://doc.scrapy.org/en/latest/topics/spiders.html)


### Run a spider 
Now we can run the spider just generated easily with Scrapy command `scrapy crawl <name_spider>`, but pay attention that all Scrapy commands except from `startproject` should run in a Scrapy working directory.

In [None]:
%%bash
cd ./web_crawler
scrapy crawl guntenberg_crawler
cd ..

When we launch a spider, Scrapy will generate an initial queue of requests from `start_urls` or you can specify your own initial queue of requests with function `start_requests()`.

These two classes have same initial queue of requests:

In [48]:
import scrapy

class Spider_1(scrapy.Spider):
    name = 'spider_1'
    start_urls = ['http://quotes.toscrape.com']
    
    def parse(self, response):
        pass    
    
class Spider_2(scrapy.Spider):
    name = 'spider_2'
    
    def start_requests(self):
        urls = ['http://quotes.toscrape.com']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        
    def parse(self, response):
        pass

`start_requests()` provides us with a more flexible initialization strategy, but in our tutorial simply using `start_urls` is enough. 

If you want to know more about `start_requests()`, you can refer to:
- [start_requests()](https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests)

Each time a Scrapy receives a response, it will execute `parse()` by default to parse the response. Yet `parse()` has a unique feature that you can generate an iterator of `scrapy.Request` object in `parse()`, which will be added to Scrapy's requests queue. With this feature, after we extract urls in `parse()`, we can easily add them to queue of urls waiting to be crawled. 

### Write a web crawler

The following code is in file `web_crawler.py`

In [69]:
#web_crawler.py

import scrapy


class WebCrawler(scrapy.Spider):
    name = "web_crawler"
    #allowed_domains = ['http://quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']
    
    def parse(self, response):
        # record the scraped urls
        filename = 'web_crawler_log.txt'
        with open(filename, 'a+') as f:
            f.writeln(response.url)
        #extract all urls and crawl them later
        next_pages = response.xpath('//a/@href').extract()
        for next_page in next_pages:
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse)

In `parse()` function, we use [XPATH SELECTOR](https://doc.scrapy.org/en/latest/topics/selectors.html#selectors) `response.xpath` to select all `href`s in tag `a` in the html document just downloaded. Then we extract the texts of hrefs using `.extract()` method. 

We use `response.follow(next_page, callback=self.parse)` to generate a Scrapy request. Remember we have to parse internal links and external links separately when we use `Beautiful Soup` and `Requests`, but now Scrapy will deal with them both automatically. At last, we generate an iterator of Scrapy requests, which will be caught by Scrapy framework and be sent to requests queue.

See there are around 20 lines of Python code for a crawler, much shorter than our 'original version'.

In [70]:
%%bash
cd ./web_crawler
scrapy crawl web_crawler --set CLOSESPIDER_PAGECOUNT=20
cat ./web_crawler_log.txt
cd ..

http://quotes.toscrape.com/author/Andre-Gide/tein/eedbackct-privacy/http://quotes.toscrape.comhttp://quotes.toscrape.com/tag/value/page/1/http://quotes.toscrape.com/tag/success/page/1/http://quotes.toscrape.com/tag/adulthood/page/1/https://scrapinghub.comhttp://quotes.toscrape.com/tag/be-yourself/page/1/https://www.goodreads.com/quoteshttps://scrapinghub.com/privacy-policyhttps://scrapinghub.com/abuse-reporthttp://quotes.toscrape.com/tag/humor/page/1/https://scrapinghub.com/terms-of-servicehttp://quotes.toscrape.com/tag/classic/page/1/https://www.goodreads.com/about/privacyhttps://www.goodreads.com/helphttps://scrapinghub.wufoo.com/forms/m1f2hw8b00tckmh/http://quotes.toscrape.com/tag/books/page/1/https://www.goodreads.com/about/privacy1http://quotes.toscrape.com/tag/aliteracy/page/1/https://www.goodreads.com/questions/guidelineshttp://quotes.toscrape.com/tag/miracles/page/1/https://www.goodreads.com/help/show/108-how-do-i-add-a-goodreads-tab-to-my-facebook-pagehttp://quotes.toscrape.co

2018-03-30 22:50:53 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: web_crawler)
2018-03-30 22:50:53 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) - [GCC 7.2.0], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.1.4, Platform Linux-4.13.0-37-generic-x86_64-with-debian-stretch-sid
2018-03-30 22:50:53 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'web_crawler', 'CLOSESPIDER_PAGECOUNT': '20', 'DOWNLOAD_DELAY': 0.25, 'NEWSPIDER_MODULE': 'web_crawler.spiders', 'SPIDER_MODULES': ['web_crawler.spiders']}
2018-03-30 22:50:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2018-03-30 22:50:53 [scrapy.middlewar

We use `--set CLOSESPIDER_PAGECOUNT=20` arguments to set the page count limit, but the exact page count is not 20. This is because when scraped page count is 20, Scrapy starts its `CloseSpider` process, and Scrapy first waits all requests in the queue to finish and then terminates. Thus there will few more pages downloaded.

### Using UserAgent with Scrapy

As has been mentioned, some websites identify robots/scripts by UserAgent and deny/drop requests sent by them. Now let's see how we can easily handle this with scrapy.

I wrote a naive scraper called `zhihu.py`, which only requests a web page from 'www.zhihu.com' and write the returned content to `zhihu.html`. Zhihu is a Quora-like forum website where people raise and answer questions. Zhihu uses UserAgent to identify robots/scripts.

In [71]:
%%bash
#example WITHOUT user_agent 
cd ./web_crawler
cat ./web_crawler/spiders/zhihu.py
echo -e '\n\n'
scrapy crawl zhihu
echo -e '\n\n'
cat ./zhihu.html
cd ..

# -*- coding: utf-8 -*-
import scrapy


class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    #allowed_domains = ['www.zhihu.com']
    start_urls = ['http://www.zhihu.com']

    def parse(self, response):
        filename = 'zhihu.html'
        with open(filename, 'wb') as f:
            f.write(response.body)








2018-03-30 22:56:24 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: web_crawler)
2018-03-30 22:56:24 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) - [GCC 7.2.0], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.1.4, Platform Linux-4.13.0-37-generic-x86_64-with-debian-stretch-sid
2018-03-30 22:56:24 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'web_crawler', 'DOWNLOAD_DELAY': 0.25, 'NEWSPIDER_MODULE': 'web_crawler.spiders', 'SPIDER_MODULES': ['web_crawler.spiders']}
2018-03-30 22:56:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-03-30 22:56:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpa

The log lines `'downloader/response_status_count/500': 3` shows that we have HTTP 500 Error (Internal Server Error) in scraping. Since there is no content downloaded, there is no `zhihu.html` file.

To use user agent, we can set the default headers in `settings.py`:
   
    DEFAULT_REQUEST_HEADERS = {
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/',
        'Accept-Language': 'en',
    }

    

In [72]:
%%bash 
cd ./web_crawler
# write DEFAULT_REQUEST_HEADERS to settings.py
echo -e "DEFAULT_REQUEST_HEADERS = {\n'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/',\n'Accept-Language': 'en',\n}" >> ./web_crawler/settings.py

#execute crawl command again!
scrapy crawl zhihu 
echo -e '\n\n'
cat ./zhihu.html
#scrapy crawl zhihu 
cd ..




<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charset="utf-8"/><title data-react-helmet="true">知乎 - 发现更大的世界</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><link rel="shortcut icon" type="image/x-icon" href="https://static.zhihu.com/static/favicon.ico"/><link rel="dns-prefetch" href="//static.zhimg.com"/><link rel="dns-prefetch" href="//pic1.zhimg.com"/><link rel="dns-prefetch" href="//pic2.zhimg.com"/><link rel="dns-prefetch" href="//pic3.zhimg.com"/><link rel="dns-prefetch" href="//pic4.zhimg.com"/><link href="https://static.zhihu.com/heifetz/main.app.a427cd79f99bfefcc7d0.css" rel="stylesheet"/></head><body class="EntrySign-body"><div id="root"><div data-zop-user

2018-03-30 22:57:00 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: web_crawler)
2018-03-30 22:57:00 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) - [GCC 7.2.0], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.1.4, Platform Linux-4.13.0-37-generic-x86_64-with-debian-stretch-sid
2018-03-30 22:57:00 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'web_crawler', 'DOWNLOAD_DELAY': 0.25, 'NEWSPIDER_MODULE': 'web_crawler.spiders', 'SPIDER_MODULES': ['web_crawler.spiders']}
2018-03-30 22:57:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-03-30 22:57:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpa

Now with user agent, we get the page content we want. LOL

## Reference
[1] https://en.wikipedia.org/wiki/Web_crawler  
[2] Typical Uses For Web Crawlers:https://blog.datafiniti.co/typical-uses-for-web-crawlers-c0860c5863ca  
[3] Scrapy 1.5 documentation:https://doc.scrapy.org/en/latest/index.html