<hr>
# 第5讲-分布式爬虫简介

1. 基本概念介绍
2. Python多线程
3. Python多进程
4. 集群化爬取介绍

- 引子

> 程序中存在3中类型的bug，你的bug，我的bug，和多线程

- 爬虫分布式一般是指多个分布式爬虫（worker） 从一个集中的任务队列（master）里拿任务，然后分布式的去爬

- 爬虫属于IO密集型程序（网络IO和磁盘IO），这类程序的瓶颈大多在网络和磁盘读写的速度上
- 多线程在一定程度上可加速爬虫的效率，但无法超过出口带宽，磁盘写的速度
- Python的多线程，存在GIL的存在，存在编写难度

- GIL全称Global Interpreter Lock，是一个防止多线程并发执行机器码的一个Mutex（互斥锁）
- ref：http://cenalulu.github.io/python/gil-in-python/

## 1.基本概念介绍

#### 并发 vs 并行

![](./dataTm/work_pic/para.png)

线程 vs 进程

![](./dataTm/work_pic/mult1.png)

![](./dataTm/work_pic/mult2.png)

**初学者建议使用框架**

- 队列

Queue是Python标准库中的线程安全的队列（FIFO）实现,提供了一个适用于**多线程编程**的先进先出的数据结构，即队列，用来在生产者和消费者线程之间的信息传递

- 基本FIFO队列

class queue.Queue(maxsize=0)

FIFO即First in First Out,先进先出。Queue提供了一个基本的FIFO容器，使用方法很简单,maxsize是个整数，指明了队列中能存放的数据个数的上限。一旦达到上限，插入会导致阻塞，直到队列中的数据被消费掉。如果maxsize小于或者等于0，队列大小没有限制。

In [1]:
import queue
#queue.Queue类
q = queue.Queue()

for i in range(5):
    q.put(i)

while not q.empty():
    print(q.get())

0
1
2
3
4


- LIFO队列

class Queue.LifoQueue(maxsize=0)

LIFO即Last in First Out,后进先出。与栈的类似

In [2]:
import queue
#queue.LifoQueue类
q = queue.LifoQueue()

for i in range(5):
    q.put(i)

#empty 如果队列为空，返回True,反之返回False
while not q.empty():
    print(q.get())

4
3
2
1
0


- 优先级队列

class Queue.PriorityQueue(maxsize=0)

构造一个优先队列。maxsize用法同上

In [9]:
import queue
import threading

class Task(object):
    def __init__(self, priority, description):
        self.priority = priority
        self.description = description
        print('Task:',description)
        return
    
    def __lt__(self, other):
        return self.priority < other.priority
    
    #def __cmp__(self, other):
     #   return cmp(self.priority, other.priority)

q = queue.PriorityQueue()
#put将item放入队列中。
q.put(Task(3, 'Level 3 task'))
q.put(Task(10, 'Level 10 task'))
q.put(Task(1, 'Level 1 task'))

def process_task(q):
    while True:
        #get从队列中移除并返回一个数据
        next_task = q.get()
        print('For:', next_task.description)
        #队列的消费者线程调用,意味着入队的任务完成
        q.task_done()

workers = [threading.Thread(target=process_task, args=(q,)),
        threading.Thread(target=process_task, args=(q,))
        ]

for w in workers:
    w.setDaemon(True)
    w.start()
    
#阻塞调用线程，直到队列中的所有任务（元素）处理完毕。
q.join()

Task: Level 3 task
Task: Level 10 task
Task: Level 1 task
For: Level 1 task
For: Level 3 task
For: Level 10 task


关于threading.Thread()的使用：

In [None]:
# 第一种，PhantomJS是一个进程
driver = webdriver.PhantomJS()   
def test(url):
    driver.get(url)

url_list=["http://www.baidu.com"]*10
for url in url_list:
     threading.Thread(target=test,args=(url,)).start() 
d.quit()

#第二种，PhantomJS是每次开一个进程，10个进程
def test(url):
    driver = webdriver.PhantomJS()
    driver.get(url)
    driver.quit()

url_list=["http://www.baidu.com"]*10
for url in url_list:
    threading.Thread(target=test,args=(url,)).start()

## 2.Python多线程

In [3]:
#线程切换时乱掉的弊端的. 没有mutex
import threading
import time

global_value = 0

def run(threadName, lock):
    global global_value
    # 请求一个线程锁
    lock.acquire()
    local_copy = global_value
    print("%s with value %s" % (threadName, local_copy))
    global_value = local_copy + 1
    #释放一个线程锁
    lock.release()

lock = threading.Lock()

for i in range(10):
    t = threading.Thread(target = run, args = ("Thread-" + str(i), lock))
    t.start()

Thread-0 with value 0
Thread-1 with value 1Thread-2 with value 2Thread-3 with value 3

Thread-4 with value 4Thread-5 with value 5Thread-6 with value 6Thread-7 with value 7Thread-8 with value 8
Thread-9 with value 9







In [6]:
import threading
import time

class MyThread(threading.Thread):

    def __init__(self, count):
        threading.Thread.__init__(self)
        self.total = count

    def run(self):

        for i in range(self.total):
            time.sleep(1)
            print("Thread: %s - %s" % (self.name, i))

t = MyThread(4)
t2 = MyThread(6)

t.start()
t.join()
t2.start()
# 如果这里有文件操作？
#t.join()
t2.join()

print("This program has finished")


Thread: Thread-20 - 0
Thread: Thread-20 - 1
Thread: Thread-20 - 2
Thread: Thread-20 - 3
Thread: Thread-21 - 0
Thread: Thread-21 - 1
Thread: Thread-21 - 2
Thread: Thread-21 - 3
Thread: Thread-21 - 4
Thread: Thread-21 - 5This program has finished



### 多线程遍历网站列表-1

In [2]:
import urllib

In [7]:
#%%timeit
import urllib
import time
from urllib.request import urlopen

sites = [
        "http://www.fudan.edu.cn",
        "http://www.douban.com",
       "http://zimp.zju.edu.cn/",
       "http://www.pku.edu.cn/",
    "http://www.tsinghua.edu.cn",
    "http://www.ruc.edu.cn/"
        ]

def check_http_status(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

    request = urllib.request.Request(url = url, headers=headers)
    response = urllib.request.urlopen(request)
    
    return response.getcode()

http_status = {}

start = time.time()

for url in sites:
    http_status[url] = check_http_status(url)

    
for url in http_status:
    print("%s: %s" % (url, http_status[url]))
end = time.time()
print(end - start)


http://www.fudan.edu.cn: 200
http://zimp.zju.edu.cn/: 200
http://www.ruc.edu.cn/: 200
http://www.pku.edu.cn/: 200
http://www.douban.com: 200
http://www.tsinghua.edu.cn: 200
2.5551462173461914


### 多线程遍历网站列表-2

In [1]:
from urllib.request import urlopen
import urllib
import threading
import time


sites = [
        "http://www.fudan.edu.cn",
        "http://www.douban.com",
       "http://zimp.zju.edu.cn/",
       "http://www.pku.edu.cn/",
    "http://www.tsinghua.edu.cn",
    "http://www.ruc.edu.cn/"
        ]


class HTTPStatusChecker(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url
        self.status = None

    def getURL(self):
        return self.url

    def getStatus(self):
        return self.status

    def run(self):
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
        request = urllib.request.Request(url = url, headers=headers)
        response = urllib.request.urlopen(request)
        self.status = response.getcode()


threads = []
start = time.time()
for url in sites:
    t = HTTPStatusChecker(url)
    t.start() #线程启动
    threads.append(t) 


#主线程阻塞，等待其他完成
for t in threads:
    t.join()

for  t in threads:
    print("%s: %s" % (t.url, t.status))
end = time.time()

print(end - start)

### 线程安全的问题，用队列queue

- 线程间同步与互斥，线程间数据的共享（都涉及线程安全）
- 同步、互斥机制：
  - mutex
  - condition
  - event

- 死锁and线程安全

- 使用队列（保证线程的安全）

In [10]:
import os
import queue
import threading
from urllib.request import urlopen
import urllib
import time

class DownloadThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            url = self.queue.get()
            print(self.name + "begin download"+url+"...")
            self.download_file(url)
            self.queue.task_done()
            print(self.name + "download completed!")
    def download_file(self, url):
        
        
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
        request = urllib.request.Request(url = url, headers=headers)
        response = urllib.request.urlopen(request)
        
        fname = os.path.basename(url) + ".html"
        with open(fname, "wb") as f:
            while True:
                chunk = response.read(1024)
                if not chunk:
                    break
                f.write(chunk)

if __name__ == '__main__':
    sites = ["http://wiki.python.org/moin/WebProgramming",
            "http://wiki.python.org/moin/Documentation",
            "https://wiki.python.org/moin/WebFrameworks",
            "https://wiki.python.org/moin/WebApplications",
            "https://wiki.python.org/moin/BeginnersGuide",
            "https://wiki.python.org/moin/BeginnersGuide/Overview",
            "https://book.douban.com/subject/25862578",
            "https://book.douban.com/subject/26698660",
            "https://book.douban.com/subject/26957760",
            "https://book.douban.com/subject/6082808",
            "https://book.douban.com/subject/26878124"
            ]
    
    urls = [
        "http://www.fudan.edu.cn",
        "http://www.douban.com",
       "http://zimp.zju.edu.cn",
       "http://www.pku.edu.cn",
    "http://www.tsinghua.edu.cn",
    "http://www.ruc.edu.cn"
        ]
    
    start = time.time()
    queue = queue.Queue()

    # 建立线程池，组合一个队列
    for i in range(5):
        t = DownloadThread(queue)  # 启动5个线程
        # setDaemon用来设定线程的daemon属性，True表示主线程的退出可以不用等待子线程完成
        # 默认为False，即所有非守护线程结束后主线程才结束
        #thread不支持守护线程
        t.setDaemon(True)
        t.start()

    for url in urls:
        queue.put(url)
        
    #?queue.join()
    end = time.time()
    print(end - start)

0.0009999275207519531Thread-35begin downloadhttp://www.fudan.edu.cn...Thread-36begin downloadhttp://www.pku.edu.cn...Thread-34begin downloadhttp://www.tsinghua.edu.cn...Thread-38begin downloadhttp://www.douban.com...Thread-37begin downloadhttp://zimp.zju.edu.cn...





Thread-37download completed!
Thread-37begin downloadhttp://www.ruc.edu.cn...
Thread-34download completed!
Thread-36download completed!
Thread-35download completed!
Thread-38download completed!
Thread-37download completed!


-----------

### 线程安全的问题，用线程池模块

In [9]:
#pip install threadpool
# conda无法安装

In [2]:
import threadpool

In [3]:
import urllib
from urllib.request import urlopen
import os
import time
import threadpool

def download_file(url):
    print("开始下载", url)
    
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
    request = urllib.request.Request(url = url, headers=headers)
    response = urllib.request.urlopen(request)
    
    fname = os.path.basename(url)+".html"
    with open(fname, "wb") as f:
        while True:
            chunk = response.read(1024)
            if not chunk:
                break
            f.write(chunk)

urls = [
        "http://wiki.python.org/moin/WebProgramming",
        "http://wiki.python.org/moin/Documentation"
        ]

pool_size = 2
pool = threadpool.ThreadPool(pool_size)

# 创建工作请求
requests = threadpool.makeRequests(download_file, urls)
# 将工作请求放入队列
[pool.putRequest(req) for req in requests]

print("将请求放入线程池")
pool.putRequest(threadpool.WorkRequest(download_file, args=["http://zimp.zju.edu.cn",]))
pool.putRequest(threadpool.WorkRequest(download_file, args=["http://www.pku.edu.cn",]))

# 处理队列中新请求
pool.poll()

# 阻塞用于等待所有执行结果
pool.wait()
print("退出前销毁所有线程")
#告知pool_size大小的工作进程，在执行完当前任务退出
pool.dismissWorkers(pool_size, do_join=True)

将请求放入线程池开始下载
 http://wiki.python.org/moin/WebProgramming
开始下载 http://wiki.python.org/moin/Documentation
开始下载 http://zimp.zju.edu.cn
开始下载 http://www.pku.edu.cn
退出前销毁所有线程


-----------

- 爬虫多线程优化案例（I）

In [4]:
from bs4 import BeautifulSoup
import requests
import json
import time


SO_URL = "http://scifi.stackexchange.com"
QUESTION_LIST_URL = SO_URL + "/questions"
MAX_PAGE_COUNT = 2

global_results = []
#初始页面，第一页
initial_page = 1

def get_author_name(body):
    link_name = body.select(".user-details a")
    if len(link_name) == 0:
        text_name = body.select(".user-details")
        return text_name[0].text if len(text_name) > 0 else 'N/A'
    else:
        return link_name[0].text

def get_question_answers(body):
    answers = body.select(".answer")
    a_data = []
    if len(answers) == 0:
        return a_data

    for a in answers:
        data = {
            'body': a.select(".post-text")[0].get_text(),
            'author': get_author_name(a)
        }
        a_data.append(data)
    return a_data

def get_question_data ( url ): 
    print("Getting data from question page: %s " % (url))
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
    resp = requests.get(url, headers=headers)
    if resp.status_code != 200:
        print("Error while trying to scrape url: %s" % (url))
        return
    body_soup = BeautifulSoup(resp.text, 'lxml')
    # 将输出定义为JSON格式
    q_data = {
        'title': body_soup.select('#question-header .question-hyperlink')[0].text,
        'body': body_soup.select('#question .post-text')[0].get_text(),
        'author': get_author_name(body_soup.select(".post-signature.owner")[0]),
        'answers': get_question_answers(body_soup)
    }
    return q_data


def get_questions_page ( page_num, partial_results ):
    print("=====================================================")
    print(" Getting list of questions for page %s" % (page_num))
    print("=====================================================")

    url = QUESTION_LIST_URL + "?sort=newest&page=" + str(page_num)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
    resp = 	requests.get(url, headers=headers)
    if resp.status_code != 200:
        print("Error while trying to scrape url: %s" % (url))
        return
    body = resp.text
    main_soup = BeautifulSoup(body, 'lxml')

    #获取每个提问的url
    questions = main_soup.select('.question-summary .question-hyperlink')
    urls = [ SO_URL + x['href'] for x in questions]
    for url in urls:
        q_data = get_question_data(url)
        partial_results.append(q_data)
    if page_num < MAX_PAGE_COUNT:
        get_questions_page(page_num + 1, partial_results)


get_questions_page(initial_page, global_results)
with open('scrapping-results.json', 'w') as outfile:
    json.dump(global_results, outfile, indent=4)

print('----------------------------------------------------')
print('Results saved')


![](./dataTm/work_pic/io1.png)

![](./dataTm/work_pic/io2.png)

- 爬虫多线程优化案例（II）

In [15]:
from bs4 import BeautifulSoup
import requests
import json
import threading
import time


SO_URL = "http://scifi.stackexchange.com"
QUESTION_LIST_URL = SO_URL + "/questions"
MAX_PAGE_COUNT = 2


class ThreadManager:
    instance = None
    final_results = []
    threads_done = 0
    totalConnections = 2
    #并行线程的数量

    @staticmethod
    def notify_connection_end( partial_results ):
        print("==== Thread is done! =====")
        ThreadManager.threads_done += 1
        ThreadManager.final_results += partial_results
        if ThreadManager.threads_done == ThreadManager.totalConnections:
            print("==== Saving data to file! ====")
            with open('scrapping-results-optimized.json', 'w') as outfile:
                json.dump(ThreadManager.final_results, outfile, indent=4)


def get_author_name(body):
    link_name = body.select(".user-details a")
    if len(link_name) == 0:
        text_name = body.select(".user-details")
        return text_name[0].text if len(text_name) > 0 else 'N/A'
    else:
        return link_name[0].text

def get_question_answers(body):
    answers = body.select(".answer")
    a_data = []
    if len(answers) == 0:
        return a_data

    for a in answers:
        data = {
            'body': a.select(".post-text")[0].get_text(),
            'author': get_author_name(a) 
        }
        a_data.append(data)
        return a_data



def get_question_data ( url ):
    print("Getting data from question page: %s " % (url))
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
    
    resp = requests.get(url, headers=headers)
    if resp.status_code != 200:
        print("Error while trying to scrape url: %s" % (url))
        return
    
    body_soup = BeautifulSoup(resp.text, 'lxml')
    #转成JSON格式
    q_data = {
        'title': body_soup.select('#question-header .question-hyperlink')[0].text,
        'body': body_soup.select('#question .post-text')[0].get_text(),
        'author': get_author_name(body_soup.select(".post-signature.owner")[0]),
        'answers': get_question_answers(body_soup)
    }
    return q_data


def get_questions_page ( page_num, end_page, partial_results  ):
    print("=====================================================")
    print(" Getting list of questions for page %s" % (page_num))
    print("=====================================================")

    url = QUESTION_LIST_URL + "?sort=newest&page=" + str(page_num)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
    
    resp = 	requests.get(url, headers=headers)
    if resp.status_code != 200:
        print("Error while trying to scrape url: %s" % (url))
    else:
        body = resp.text
        main_soup = BeautifulSoup(body, 'lxml')

        #获取每个问题的url
        questions = main_soup.select('.question-summary .question-hyperlink')
        urls = [ SO_URL + x['href'] for x in questions]
        for url in urls:
            q_data = get_question_data(url)
            partial_results.append(q_data)
    if page_num + 1 < end_page:
        get_questions_page(page_num + 1,  end_page, partial_results)
    else:
        ThreadManager.notify_connection_end(partial_results)

pages_per_connection = MAX_PAGE_COUNT / ThreadManager.totalConnections


for i in range(ThreadManager.totalConnections):
    init_page = i * pages_per_connection
    end_page = init_page + pages_per_connection
    t = threading.Thread(target=get_questions_page, args=(init_page, end_page, [],  ),name='connection-%s' % (i))
    t.start()




Getting data from question page: http://scifi.stackexchange.com/questions/157909/trying-to-identify-an-old-series-of-scifi-adventure-puzzle-picture-books 
Getting data from question page: http://scifi.stackexchange.com/questions/157912/what-are-the-tv-movie-discworld-adaptations 
Getting data from question page: http://scifi.stackexchange.com/questions/157922/how-many-pok%c3%a9mons-can-a-pok%c3%a9-ball-keep-at-a-time 
Getting data from question page: http://scifi.stackexchange.com/questions/157900/apparent-earth-centricity-of-the-federation 
Getting data from question page: http://scifi.stackexchange.com/questions/157909/trying-to-identify-an-old-series-of-scifi-adventure-puzzle-picture-books 
Getting data from question page: http://scifi.stackexchange.com/questions/157922/how-many-pok%c3%a9mons-can-a-pok%c3%a9-ball-keep-at-a-time 
Getting data from question page: http://scifi.stackexchange.com/questions/157912/what-are-the-tv-movie-discworld-adaptations 
Getting data from question p

------------

## 3.Python多进程

### 多线程遇到的GIL问题，所以用多进程

线程：
1. 频繁IO操作
2. 并行任务通过并发解决
3. GUI开发

不用线程：
1. 频繁CPU操作
2. 利用多核操作系统

--------

- 多进程的优势：

  - 可使用多核
  - 进程使用独立内存空间，避免竞争问题
  - 避开GIL限制
  
- 多进程劣势：

  - 更多内存消耗
  - 进程间数据共享困难
  - 进程间通信处理比线程更困难

- jupyter下不能跑
- 可以建立.py源文件，在terminal里运行


1 简单进程

In [25]:
import multiprocessing

def run(pname):
    print(pname)
    
if __name__ == '__main__':
    for i in range(10):
        p = multiprocessing.Process(target=run, args=("Process-" + str(i), ))
        p.start()
        p.join()

2 进程通讯  queue， pipe 

In [None]:
# 两个进程利用Queue进行通信

from multiprocessing import Queue, Process
import random

def generate(q):
    while True:
        value = random.randrange(10)
        q.put(value)
        print("Value added to queue: %s" % (value))

def reader(q):
    while True:
        value = q.get()
        print("Value from queue: %s" % (value))

if __name__ == '__main__':
    queue = Queue()
    p1 = Process(target=generate, args=(queue,))
    p2 = Process(target=reader, args=(queue,))
    p1.start()
    p2.start()



-----------------

In [None]:
# 两个进程利用Pipe进行通信

from multiprocessing import Pipe, Process
import random

def generate(pipe):
    while True:
        value = random.randrange(10)
        pipe.send(value)
        print("Value sent: %s" % (value))

def reader(pipe):
    f = open("output.txt", "w")
    while True:
        value = pipe.recv()
        f.write(str(value))
        print("... ...")

if __name__ == '__main__':
    input_p, output_p = Pipe()
    p1 = Process(target=generate, args=(input_p,))
    p2 = Process(target=reader, args=(output_p,))
    p1.start()
    p2.start()



- 使用requests库，爬取漫画(单线程)

In [16]:
### 爬取漫画的案例，使用requests库
import requests
import os
from bs4 import BeautifulSoup

url = 'http://xkcd.com'
os.makedirs('./dataTm/xkcd', exist_ok = True)
while not url.endswith('#'):
    # 下载页面
    print('下载页面 %s' % url)
    res = requests.get(url)
    # 如果下载发生问题，抛出异常，同时终止程序
    res.raise_for_status()

    soup = BeautifulSoup(res.text, "html.parser")
    # 寻找漫画的地址
    comicElement = soup.select('#comic img')
    #print(comicElement)
    if comicElement == []:
        print('无法找到图片.')
    else:
        comicURL = 'http:' + comicElement[0].get('src')
        # 下载漫画.
        print('正在下载漫画 %s' % (comicURL))
        res = requests.get(comicURL)
        res.raise_for_status()

    # 下载完成之后，将图片保存到xkcd文件夹
    imageFile = open(os.path.join('./dataTm/xkcd', os.path.basename(comicURL)), 'wb')
    # 这里与之前的保存文件的方式不同，利用的是迭代写入，提高性能
    for chunk in res.iter_content(10000):
        imageFile.write(chunk)
    imageFile.close()

    # 下载完之后，找前一副图片的地址
    PrevLink = soup.select('a[rel="prev"]')[0]
    #print(soup.select('a[rel="prev"]'))
    url = 'http://xkcd.com' + PrevLink.get('href')

下载页面 http://xkcd.com
正在下载漫画 http://imgs.xkcd.com/comics/survivorship_bias.png
下载页面 http://xkcd.com/1826/
正在下载漫画 http://imgs.xkcd.com/comics/birdwatching.png
下载页面 http://xkcd.com/1825/
正在下载漫画 http://imgs.xkcd.com/comics/7_eleven.png
下载页面 http://xkcd.com/1824/
正在下载漫画 http://imgs.xkcd.com/comics/identification_chart.png
下载页面 http://xkcd.com/1823/
正在下载漫画 http://imgs.xkcd.com/comics/hottest_editors.png
下载页面 http://xkcd.com/1822/
正在下载漫画 http://imgs.xkcd.com/comics/existential_bug_reports.png
下载页面 http://xkcd.com/1821/
正在下载漫画 http://imgs.xkcd.com/comics/incinerator.png
下载页面 http://xkcd.com/1820/
正在下载漫画 http://imgs.xkcd.com/comics/security_advice.png
下载页面 http://xkcd.com/1819/
正在下载漫画 http://imgs.xkcd.com/comics/sweet_16.png
下载页面 http://xkcd.com/1818/


KeyboardInterrupt: 

tips：

以上为强行中断

- 使用requests库，爬取漫画(多线程)

In [17]:
### 爬取漫画的案例，使用requests库
import requests
import os
from bs4 import BeautifulSoup
import threading

os.makedirs('xkcd', exist_ok = True)

def downloadXkcd(startComic, endComic):
    for urlNumber in range(startComic, endComic):
        # 下载页面
        print('下载页面 http://xkcd.com/%s' % (urlNumber))
        res = requests.get('http://xkcd.com/%s' % (urlNumber))
        res.raise_for_status()

        soup = BeautifulSoup(res.text, 'lxml')

        #确认漫画的页面url
        comicElem = soup.select('#comic img')
        if comicElem == []:
            print('找不到该页面')
        else:
            comicUrl = comicElem[0].get('src')
            comicUrl = 'http:'+comicUrl
            # 下载图片
            print('下载图片 %s' % (comicUrl))
            res = requests.get(comicUrl)
            res.raise_for_status()

            #保存图片
            imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
            for chunk in res.iter_content(100000):
                imageFile.write(chunk)
            imageFile.close()

#建立和开启线程
downloadThreads = []
#一个线程对象的列表
for i in range(1, 100, 10):
    downloadThread = threading.Thread(target = downloadXkcd, args=(i, i+9))
    downloadThreads.append(downloadThread)
    downloadThread.start()



#等待各个线程结束
for downloadThread in downloadThreads:
    downloadThread.join()
print('完成')

下载页面 http://xkcd.com/1
下载页面 http://xkcd.com/11
下载页面 http://xkcd.com/21
下载页面 http://xkcd.com/31
下载页面 http://xkcd.com/41
下载页面 http://xkcd.com/51
下载页面 http://xkcd.com/71
下载页面 http://xkcd.com/61下载页面 http://xkcd.com/81

下载页面 http://xkcd.com/91
下载图片 http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg
下载图片 http://imgs.xkcd.com/comics/barrel_mommies.jpg
下载图片 http://imgs.xkcd.com/comics/kepler.jpg
下载图片 http://imgs.xkcd.com/comics/pwned.png
下载图片 http://imgs.xkcd.com/comics/attention_shopper.jpg
下载图片 http://imgs.xkcd.com/comics/in_the_trees.jpg
下载页面 http://xkcd.com/2
下载图片 http://imgs.xkcd.com/comics/malaria.jpg
下载图片 http://imgs.xkcd.com/comics/staceys_dad.jpg
下载图片 http://imgs.xkcd.com/comics/barrel_part_5.jpg
下载页面 http://xkcd.com/12
下载页面 http://xkcd.com/92
下载图片 http://imgs.xkcd.com/comics/unspeakable_pun.jpg
下载图片 http://imgs.xkcd.com/comics/tree_cropped_(1).jpg
下载页面 http://xkcd.com/22
下载页面 http://xkcd.com/52
下载页面 http://xkcd.com/72
下载页面 http://xkcd.com/82
下载图片 http://imgs.xkcd.com/comics/poisson.j

- 关于多线程的并发问题的notes：

- 创建多线程，同时运行，容易
- 但由于线程同时读写，容易互相干扰，导致并发问题，且难以调试
- 避免让多个线程读取或写入相同的变量，即，当创建一个新的Thread对象时，确保其目标函数只使用该函数中的局部变量

## 4.集群化爬取介绍

- 一台主机存放队列，其它机器负责去抓
- 所有集群内机器能够充分有效利用队列进行抓取
- MongoDB，redis作为队列

基本的爬虫：维护一个队列

1. 设置一个队列（可以用queue，可以用数据库比如mongdo或者redis）
2. 把初始页面放入队列
3. 从队列中提取url，进行爬取
4. 从该url下所有的页面url，放入队列
5. 从队列中抽取url，继续爬取
6. 重复上述过程

集群化的方式：
1. 一台机器维护队列
2. 其余机器通过网络通信，从该台机器上取得url，进行爬取

- 重复上述过程
- 机器越多，越快
- 争论：使用多机集群的目的，是为了速度还是防止反爬虫？

集群化的写法：

1. 在master机器上设置一个队列（可以用queue，可以用数据库比如mongdo或者redis）
2. slaver机器（分布式集群）通过网络通信，从master机器的队列中提取url，进行爬取
3. 从该url下所有的页面url，放入队列
4. 从队列中抽取url，继续爬取
5. 重复上述过程

- 涉及的问题较多，推荐使用成熟的框架和库

比如：https://scrapy-redis.readthedocs.io/en/stable/
见scrapy的章节介绍