## python协程（coroutine）
* 协程（coroutine）是一种特殊的函数，可以在执行过程中暂停并在稍后恢复。
* 协程是异步编程的重要组成部分，特别适用于处理 I/O 密集型任务，如网络请求、文件操作等。
* Python 从版本 3.5 开始引入了 async 和 await 关键字，使得编写协程变得更加直观和简洁。

### 协程函数
* 使用 async def 定义的函数称为协程函数。
* 协程函数返回一个协程对象，该对象需要通过 await 或事件循环来运行。
### 协程对象
* 协程函数返回的协程对象，可以通过 next() 或 send() 方法来控制协程的运行。
* next() 方法用于第一次运行协程，send() 方法用于在协程中发送数据。
* 协程对象可以通过 yield 关键字来暂停和恢复其运行。
* yield from 语句可以简化协程的嵌套调用。
* 协程对象需要通过 await 或事件循环来启动和管理。

### await 关键字
* await 关键字用于等待一个协程对象，直到其完成。
* await 关键字只能在协程函数中使用，不能在普通函数中使用。
* 当 await 后面跟着一个表达式时，Python 会将表达式的结果作为参数传递给 next() 方法，然后等待协程对象完成。
* 当 await 后面跟着一个协程对象时，Python 会将协程对象作为参数传递给 next() 方法，然后等待协程对象完成。
* 当 await 后面跟着一个 Future 对象时，Python 会将 Future对象作为参数传递给 next() 方法，然后等待 Future 对象完成
### 事件循环
* 事件循环是 Python 中的一个重要概念，它负责管理协程的生命周期，包括创建、调度和销毁。
* Python 的事件循环由 asyncio 模块提供，它提供了各种工具，如 Future、Task、Event 和 Semaphore 等，用于管理协程的生命周期。
* 事件循环可以通过 asyncio.run() 或 asyncio.get_event_loop() 来创建和运行。
* 事件循环可以通过 asyncio.create_task() 或 asyncio.ensure_future() 来创建任务。

In [5]:
#简单示例
import asyncio

async def main():
    print('hello')
    await asyncio.sleep(1)
    print('world')

await main()
# 在jupyter notebook中直接运行下面的代码会报错,是因为jupyter中已经运行了一个event loop
# asyncio.run(main())

hello
world


In [8]:
# 使用任务方式运行
# 允许事件嵌套
import nest_asyncio
import asyncio

nest_asyncio.apply()
async def my_coroutine():
    print("hello")
    await asyncio.sleep(1)
    print("World!")

# 获取当前的事件循环
loop = asyncio.get_event_loop()

# 创建任务并等待它完成
task = loop.create_task(my_coroutine())
loop.run_until_complete(task)

hello
World!


In [9]:
## 串行化的爬虫示例
import time

def crawl_page(url):
    print('crawling {}'.format(url))
    sleep_time = int(url.split('_')[-1])
    time.sleep(sleep_time)
    print('OK {}'.format(url))

def main(urls):
    for url in urls:
        crawl_page(url)

%time main(['url_1', 'url_2', 'url_3', 'url_4'])

########## 输出 ##########

crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
CPU times: total: 0 ns
Wall time: 10 s


* await 执行的效果，和 Python 正常执行是一样的，也就是说程序会阻塞在这里，进入被调用的协程函数，执行完毕返回后再继续

In [10]:
## 使用协程 await实现同步协程
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def crawl_page(url):
    print('crawling {}'.format(url))
    sleep_time = int(url.split('_')[-1])

    await asyncio.sleep(sleep_time)
    print('OK {}'.format(url))

async def main(urls):
    for url in urls:
        await crawl_page(url)

%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))

########## 输出 ##########


crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
CPU times: total: 0 ns
Wall time: 10 s


* 异步协程

In [11]:
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def crawl_page(url):
    print('crawling {}'.format(url))
    sleep_time = int(url.split('_')[-1])
    await asyncio.sleep(sleep_time)
    print('OK {}'.format(url))
# 将任务放入到main中，以cortinue的方式在main中执行
async def main(urls):
    tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
    for task in tasks:
        await task
%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))

########## 输出 ##########


crawling url_1
crawling url_2
crawling url_3
crawling url_4
OK url_1
OK url_2
OK url_3
OK url_4
CPU times: total: 0 ns
Wall time: 4.03 s


In [12]:
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def crawl_page(url):
    print('crawling {}'.format(url))
    sleep_time = int(url.split('_')[-1])
    await asyncio.sleep(sleep_time)
    print('OK {}'.format(url))

async def main(urls):
    # 更简洁的写法
    tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
    await asyncio.gather(*tasks)

%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))

########## 输出 ##########


crawling url_1
crawling url_2
crawling url_3
crawling url_4
OK url_1
OK url_2
OK url_3
OK url_4
CPU times: total: 0 ns
Wall time: 4.02 s


In [13]:
import nest_asyncio
import asyncio
# 运行原理理解,不使用任务同步方式
nest_asyncio.apply()

async def worker_1():
    print('worker_1 start')
    await asyncio.sleep(1)
    print('worker_1 done')

async def worker_2():
    print('worker_2 start')
    await asyncio.sleep(2)
    print('worker_2 done')

async def main():
    print('before await')
    await worker_1()
    print('awaited worker_1')
    await worker_2()
    print('awaited worker_2')

%time asyncio.run(main())

########## 输出 ##########


before await
worker_1 start
worker_1 done
awaited worker_1
worker_2 start
worker_2 done
awaited worker_2
CPU times: total: 0 ns
Wall time: 3.01 s


In [15]:
# 使用task，异步方式
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def worker_1():
    print('worker_1 start')
    await asyncio.sleep(2)
    print('worker_1 done')

async def worker_2():
    print('worker_2 start')
    await asyncio.sleep(1)
    print('worker_2 done')

async def main():
    task1 = asyncio.create_task(worker_1())
    task2 = asyncio.create_task(worker_2())
    print('before await')
    # 以任务的形式放入到event loop中
    await task1
    print('awaited worker_1')
    await task2
    print('awaited worker_2')

%time asyncio.run(main())

########## 输出 ##########


before await
worker_1 start
worker_2 start
worker_2 done
worker_1 done
awaited worker_1
awaited worker_2
CPU times: total: 0 ns
Wall time: 1.99 s


In [16]:
## task取消
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def worker_1():
    await asyncio.sleep(1)
    return 1

async def worker_2():
    await asyncio.sleep(2)
    return 2 / 0

async def worker_3():
    await asyncio.sleep(3)
    return 3

async def main():
    task_1 = asyncio.create_task(worker_1())
    task_2 = asyncio.create_task(worker_2())
    task_3 = asyncio.create_task(worker_3())

    await asyncio.sleep(2)
    task_3.cancel()
    # 同时将多个任务放入event loop中
    res = await asyncio.gather(task_1, task_2, task_3, return_exceptions=True)
    print(res)

%time asyncio.run(main())

########## 输出 ##########

[1, ZeroDivisionError('division by zero'), CancelledError('')]
CPU times: total: 0 ns
Wall time: 1.99 s


#### 使用协程实现消费者生产者模式

In [17]:
import nest_asyncio
import asyncio
import random

nest_asyncio.apply()
# 消费者coroutine
async def consumer(queue, id):
    while True:
        val = await queue.get()
        print('{} get a val: {}'.format(id, val))
        await asyncio.sleep(1)
# 生产者coroutine
async def producer(queue, id):
    for i in range(5):
        val = random.randint(1, 10)
        await queue.put(val)
        print('{} put a val: {}'.format(id, val))
        await asyncio.sleep(1)

async def main():
    queue = asyncio.Queue()

    consumer_1 = asyncio.create_task(consumer(queue, 'consumer_1'))
    consumer_2 = asyncio.create_task(consumer(queue, 'consumer_2'))

    producer_1 = asyncio.create_task(producer(queue, 'producer_1'))
    producer_2 = asyncio.create_task(producer(queue, 'producer_2'))

    await asyncio.sleep(10)
    consumer_1.cancel()
    consumer_2.cancel()
    
    await asyncio.gather(consumer_1, consumer_2, producer_1, producer_2, return_exceptions=True)

%time asyncio.run(main())

########## 输出 ##########

producer_1 put a val: 6
producer_2 put a val: 3
consumer_1 get a val: 6
consumer_2 get a val: 3
producer_1 put a val: 3
producer_2 put a val: 5
consumer_2 get a val: 3
consumer_1 get a val: 5
producer_1 put a val: 2
producer_2 put a val: 3
consumer_1 get a val: 2
consumer_2 get a val: 3
producer_1 put a val: 10
producer_2 put a val: 10
consumer_2 get a val: 10
consumer_1 get a val: 10
producer_1 put a val: 1
producer_2 put a val: 2
consumer_1 get a val: 1
consumer_2 get a val: 2
CPU times: total: 15.6 ms
Wall time: 10 s


#### 简单的爬虫程序

In [22]:
import nest_asyncio
import requests
from bs4 import BeautifulSoup

nest_asyncio.apply()

def main():
    url = "https://movie.douban.com/"
    init_page = requests.get(url).content
    init_soup = BeautifulSoup(init_page, 'lxml')

    all_movies = init_soup.find('div', id="showing-soon")
    for each_movie in all_movies.find_all('div', class_="item"):
        all_a_tag = each_movie.find_all('a')
        all_li_tag = each_movie.find_all('li')

        movie_name = all_a_tag[1].text
        url_to_fetch = all_a_tag[1]['href']
        movie_date = all_li_tag[0].text

        response_item = requests.get(url_to_fetch).content
        soup_item = BeautifulSoup(response_item, 'lxml')
        img_tag = soup_item.find('img')

        print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))

%time main()

########## 输出 ##########

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

In [28]:
import requests

from bs4 import BeautifulSoup

# 发送 HTTP 请求
url = 'https://example.com'
response = requests.get(url)

# 使用 lxml 解析器创建 BeautifulSoup 对象
soup = BeautifulSoup(response.content, 'lxml')

# 打印解析后的 HTML
print(soup.prettify())

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

In [27]:
import bs4
print(bs4.__version__)

4.12.3
