Here, we will define three types of tools

1. webscraper
bs4
2. webcrawler
general
3. webspider
scrapy

In [1]:
import requests

In [2]:
url = 'http://www.ftchinese.com/channel/ce.html'

In [3]:
response = requests.get(url)

In [4]:
response.text

'<!DOCTYPE html>\n<html class="no-js core" data-next-app="front-page">\n<head>\n<meta charset="utf-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<title>双语阅读 - FT中文网</title>\n<meta http-equiv="Content-Language" content="zh-CN"/>\n<meta content="英国《金融时报》每日精选文章，中英文对照" name="description"/>\n<meta name="apple-mobile-web-app-status-bar-style" content="black" />\n<link rel="apple-touch-icon-precomposed" href="http://static.ftchinese.com/img/ipad_icon.png" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0" />\n<link rel="preconnect" href="http://static.ftchinese.com" />\n<link rel="preconnect" href="http://i.ftimg.net" />\n<script type="text/javascript">\nwindow.errorBuffer = window.errorBuffer || [];\nfunction beaconCssError(e) {\nwindow.errorBuffer.push({\nerror: e ? e : new Error(\'CSS failed to load.\'),\ncontext: {\nisMobileNetork: document.cookie.replace(/(?:(?:^|.*;\\s*)h2_isMobile\\s*\\=\\s*([^;]*).*$)|^.*$/, "$1") === \'\' ? false : true\n}\n})

In [5]:
import urllib.request

In [8]:
request = urllib.request.Request(url)

In [10]:
response = urllib.request.urlopen(request)

In [11]:
html =response.read()

In [12]:
print(html)

b'<!DOCTYPE html>\n<html class="no-js core" data-next-app="front-page">\n<head>\n<meta charset="utf-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<title>\xe5\x8f\x8c\xe8\xaf\xad\xe9\x98\x85\xe8\xaf\xbb - FT\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x91</title>\n<meta http-equiv="Content-Language" content="zh-CN"/>\n<meta content="\xe8\x8b\xb1\xe5\x9b\xbd\xe3\x80\x8a\xe9\x87\x91\xe8\x9e\x8d\xe6\x97\xb6\xe6\x8a\xa5\xe3\x80\x8b\xe6\xaf\x8f\xe6\x97\xa5\xe7\xb2\xbe\xe9\x80\x89\xe6\x96\x87\xe7\xab\xa0\xef\xbc\x8c\xe4\xb8\xad\xe8\x8b\xb1\xe6\x96\x87\xe5\xaf\xb9\xe7\x85\xa7" name="description"/>\n<meta name="apple-mobile-web-app-status-bar-style" content="black" />\n<link rel="apple-touch-icon-precomposed" href="http://static.ftchinese.com/img/ipad_icon.png" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0" />\n<link rel="preconnect" href="http://static.ftchinese.com" />\n<link rel="preconnect" href="http://i.ftimg.net" />\n<script type="text/javascript">\nwindo

As we can see from above, the urllib and reqests has different encoding 

for the remaining parts, I'll use  urllib.request to build the webscaper


A basic form of downloading function would be like the following form

In [13]:
def download(url): 
    return urllib.request.urlopen(url).read() 

for the purpose of making a more robust one, I'm adding some exception terms

In [14]:
import urllib.request

from urllib.error import URLError, HTTPError, ContentTooShortError

def download(url):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
    return html

sometimes, we may encounter some server side error like 503, for these errors, we can retry the download after a short time. but for client side like 404, we'll not download again.

Now, let's make our code more capable of handling different senarios. 

In [15]:
def download(url,number_retries = 2):
    print("Downloading:", url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if number_retries > 0:
            if hasattr(e, "code") and 500 <= e.code <= 600:
                #recursive download process
                return download(url,number_retries -1)
    return html

In [16]:
download('http://httpstat.us/500')

Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error


In [17]:
download(url)

Downloading: http://www.ftchinese.com/channel/ce.html


b'<!DOCTYPE html>\n<html class="no-js core" data-next-app="front-page">\n<head>\n<meta charset="utf-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<title>\xe5\x8f\x8c\xe8\xaf\xad\xe9\x98\x85\xe8\xaf\xbb - FT\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x91</title>\n<meta http-equiv="Content-Language" content="zh-CN"/>\n<meta content="\xe8\x8b\xb1\xe5\x9b\xbd\xe3\x80\x8a\xe9\x87\x91\xe8\x9e\x8d\xe6\x97\xb6\xe6\x8a\xa5\xe3\x80\x8b\xe6\xaf\x8f\xe6\x97\xa5\xe7\xb2\xbe\xe9\x80\x89\xe6\x96\x87\xe7\xab\xa0\xef\xbc\x8c\xe4\xb8\xad\xe8\x8b\xb1\xe6\x96\x87\xe5\xaf\xb9\xe7\x85\xa7" name="description"/>\n<meta name="apple-mobile-web-app-status-bar-style" content="black" />\n<link rel="apple-touch-icon-precomposed" href="http://static.ftchinese.com/img/ipad_icon.png" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0" />\n<link rel="preconnect" href="http://static.ftchinese.com" />\n<link rel="preconnect" href="http://i.ftimg.net" />\n<script type="text/javascript">\nwindo

however,website may block the robot scraper, therefore, we need to change our robot name for default "Python-urllib/3.x" to whatever we want. I'm using "naruto" as the new robot and let's update our function,and add request to our function. 

In [20]:
def download(url,user_agent='naruto',number_retries = 2):
    print("Downloading:", url)
    request = urllib.request.Request(url)
    request.add_header("User-agent", user_agent)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if number_retries > 0:
            if hasattr(e, "code") and 500 <= e.code <= 600:
                #recursive download process
                return download(url,number_retries -1)
    return html


wala, now it works. 

In [21]:
header = {"xiaowenhao":"xiaowenhao1234@yeah.net"}
def download_with_new_header(url,user_agent=header,number_retries = 2):
    print("Downloading:", url)
    request = urllib.request.Request(url)
    request.add_header("User-agent", user_agent)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if number_retries > 0:
            if hasattr(e, "code") and 500 <= e.code <= 600:
                #recursive download process
                return download(url,number_retries -1)
    return html

Now, let's use the sitemap discovered in the website's robots.tt to download all the web pages.

Normally robots.txt can be added to the top of the web.

during the previous example,we explain the css selector ,now let's focus on a simple one 

In [25]:
import re

def download(url,user_agent='naruto',number_retries = 2, charset='utf-8'):
    print("Downloading:", url)
    request = urllib.request.Request(url)
    request.add_header("User-agent", user_agent)
    try:
        resp = urllib.request.urlopen(request)
        #verify if web charset is utf-8, if not ,we'll use utf-8 to decode, haha, 
        #it will throw error if no cs returned, but
        #we hope utf-8 will help
        #read more on pypi.python.org/pypi/chardet to implement a more robust decoder
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if number_retries > 0:
            if hasattr(e, "code") and 500 <= e.code <= 600:
                #recursive download process
                return download(url,number_retries -1)
    return html

def crawl_sitemap(url):
    # dowload the sitemap file
    sitemap = download(url)
    #extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    #download each link
    for link in links:
        html = download(link)
        #scrape html here
        # ... 

what we've done here is using sitemap to get the info as we want, but it may not return the entire webpages

next, we'll look at a simple example to use another way of sovling the problem