
## 4. Iterators and Generators

迭代器和生成器。

迭代是Python的一个重要特性。


### 4.1 手动消耗一个迭代器

In [1]:
print('-' * 50)
with open('../_import.py') as f:
    try:
        while True:
            line = next(f)      # next 手动迭代
            print(line, end='')
    except StopIteration:       # 遇到迭代终止异常
        pass

print('\n' + '-'*50 + '\n')
items = [1, 2, 3]
it = iter(items)
print(it)
for i in it:
    print(i)
    
it = iter(items)
next(it)
next(it)
next(it)
next(it)  # raise StopIteration

--------------------------------------------------


import datetime
--------------------------------------------------

<list_iterator object at 0x110d506a0>
1
2
3


StopIteration: 

### 4.2 定义对象的迭代方法（将对象作为容器）

定义 `__iter__` 方法

In [2]:
class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def add_child(self, node):
        self._children.append(node)

    def __iter__(self):
        return iter(self._children)    # 返回一个迭代器

# Example
if __name__ == '__main__':
    root = Node(0)
    child1 = Node(1)
    child2 = Node(2)
    root.add_child(child1)
    root.add_child(child2)
    for ch in root:
        print(ch)
    # Outputs: Node(1), Node(2)


Node(1)
Node(2)


### 4.3 自定义 generator，类似 range或 reversed方法

In [51]:
def frange(start, stop, increment):
    x = start
    while x < stop:
        yield x
        x += increment

for n in frange(0, 4, 0.5):
    print(n, end=',')


0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,

In [4]:
def countdown(n):
    print('Starting to count from', n)
    while n > 0:
        yield n
        n -= 1
    print('Done!')
    
c = countdown(3)
print(c)   #  c 是一个generator对象

for i in c:
    print(i)
    

c = countdown(3)
print(next(c))
print(next(c))
print(next(c))
print(next(c))



<generator object countdown at 0x110e009e8>
Starting to count from 3
3
2
1
Done!
Starting to count from 3
3
2
1
Done!


StopIteration: 


### 4.4 自定义迭代方法（Iterator Protocol）

可以自定义迭代方法，比如树结构的深度搜索遍历，或者广度搜索遍历

In [5]:
class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def add_child(self, node):
        self._children.append(node)

    def __iter__(self):
        return iter(self._children)

    def depth_first(self):   #  自定义的一个generator
        yield self
        for c in self:   # 用到自己的迭代器
            yield from c.depth_first()

# Example
if __name__ == '__main__':
    root = Node(0)
    child1 = Node(1)
    child2 = Node(2)
    root.add_child(child1)
    root.add_child(child2)
    child1.add_child(Node(3))
    child1.add_child(Node(4))
    child2.add_child(Node(5))

    for ch in root.depth_first():
        print(ch)


Node(0)
Node(1)
Node(3)
Node(4)
Node(2)
Node(5)


### 4.5 反向迭代

`reversed`， 这里注意 reversed 的操作需要完整的信息（size），无法使用generator。

或者定义 `__reversed__` 方法

In [6]:
class Countdown:
    def __init__(self, start):
        self.start = start

    # Forward iterator
    def __iter__(self):
        n = self.start
        while n > 0:
            yield n
            n -= 1

    # Reverse iterator
    def __reversed__(self):
        n = 1
        while n <= self.start:
            yield n
            n += 1

c = Countdown(5)
print("Forward:")
for x in c:
    print(x)

print("Reverse:")
for x in reversed(c):
    print(x)


Forward:
5
4
3
2
1
Reverse:
1
2
3
4
5


### 4.6 记录状态的 generator（略）

其实就是在迭代器对象（或者generator）中记录一个状态，很好实现。

### 4.7 Iteratror的切片（slice）

正常一个迭代器，是逐个迭代的，所以不方便。 这里可以用 `itertools.islice`

In [7]:
def count(n):
    while True:
        yield n
        n += 1

c = count(0)
c[10:20]

TypeError: 'generator' object is not subscriptable

In [8]:
from itertools import islice
for x in islice(c, 10, 20):
    print(x, end=',')

10,11,12,13,14,15,16,17,18,19,

### 4.8 跳过开头的几行（需要判断条件）

使用 `itertools.dropwhile`。

注意这里是开头的连续符合条件的几行会忽略掉。 中间如果也有符合条件的，会保留。

In [9]:
from itertools import dropwhile
with open('/etc/passwd') as f:
    for line in dropwhile(lambda line: line.startswith('#'), f):
        print(line, end='')

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:Mac App Store Service:/var/empty:/usr/bin/false
_mcxalr:*:54:54:MCX AppLaunch:/var/empty:/usr/bin/false
_appleevents:*:55:55:AppleEvents Daemon:/var/empty:/usr/bin/false
_geod:*:56:56:Geo Services Daemon:/var/db/geod:/usr/bin/false
_serialnumberd:*:58:58:Serial Number Daemon:/va


### 4.9 遍历所有排列、组合

In [10]:
items = ['a', 'b', 'c']

from itertools import permutations   # 排列

for p in permutations(items):   # A(3, 3) = 3!
    print(p)
    
for p in permutations(items, 2):  # A(3, 2) = 3!
    print(p)

('a', 'b', 'c')
('a', 'c', 'b')
('b', 'a', 'c')
('b', 'c', 'a')
('c', 'a', 'b')
('c', 'b', 'a')
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')


In [11]:

from itertools import combinations   # 组合

for p in combinations(items, 3):   # C(3, 3) = 1
    print(p)
    
for p in combinations(items, 2):  # C(3, 2) = 3
    print(p)

('a', 'b', 'c')
('a', 'b')
('a', 'c')
('b', 'c')


In [12]:

from itertools import combinations_with_replacement   # 有放回式的组合

for p in combinations_with_replacement(items, 3):   # C(3, 3) = 1
    print(p)

('a', 'a', 'a')
('a', 'a', 'b')
('a', 'a', 'c')
('a', 'b', 'b')
('a', 'b', 'c')
('a', 'c', 'c')
('b', 'b', 'b')
('b', 'b', 'c')
('b', 'c', 'c')
('c', 'c', 'c')


### 4.10 遍历序列，带上index

这个简单，用 `enumerate`

In [13]:
# Example of iterating over lines of a file with an extra lineno attribute
def parse_data(filename):
    with open(filename, 'rt') as f:
        for lineno, line in enumerate(f, 1):
            fields = line.split()
            try:
                count = int(fields[1])
            except ValueError as e:
                print('Line {}: Parse error: {}'.format(lineno, e))

parse_data('sample.dat')


Line 3: Parse error: invalid literal for int() with base 10: 'N/A'


### 4.11 同时迭代遍历多个序列

用 `zip`

还有一个 `zip_longest`， 短的那个用 `fillvalue`补上

In [14]:
a = [1, 2, 3]
b = ['w', 'x', 'y', 'z']

for i in zip(a, b):
    print(i)
    

(1, 'w')
(2, 'x')
(3, 'y')


In [15]:

from itertools import zip_longest

for i in zip_longest(a, b, fillvalue=0):
    print(i)

(1, 'w')
(2, 'x')
(3, 'y')
(0, 'z')


### 4.12 串行迭代多个序列

用 `chain`

In [16]:
a = [1, 2, 3]
b = ['w', 'x', 'y', 'z']

from itertools import chain
for i in chain(a, b):
    print(i, end=',')

1,2,3,w,x,y,z,

### 4.13 数据处理的 pipelines

下面例子中每个 `gen`开头的都返回一个generator对象。

在 `gen_concatenate`里， 是将多个文件句柄，组合成一整个迭代器，迭代出每一行。

In [17]:
import os
import fnmatch
import gzip
import bz2
import re

# 文件 generator
def gen_find(filepat, top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepat):
            print(os.path.join(path, name))
            yield os.path.join(path, name)

# 打开文件， 文件句柄 的 generator
def gen_opener(filenames):
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        elif filename.endswith('.bz2'):
            f = bz2.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f
        f.close()

# ！！！！！ yield from
def gen_concatenate(iterators):
    '''
    Chain a sequence of iterators together into a single sequence.
    '''
    for it in iterators:
        print(type(it))
        yield from it

# 在行里面找到 pattern
def gen_grep(pattern, lines):
    '''
    Look for a regex pattern in a sequence of lines
    '''
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

if __name__ == '__main__':

    # Example 1
    lognames = gen_find('access-log*', 'www')
    files = gen_opener(lognames)
    lines = gen_concatenate(files)
    pylines = gen_grep('(?i)python', lines)
    for line in pylines:
        pass
#         print(line, end='')

    # Example 2
    lognames = gen_find('access-log*', 'www')
    files = gen_opener(lognames)
    lines = gen_concatenate(files)
    pylines = gen_grep('(?i)python', lines)
    bytecolumn = (line.rsplit(None,1)[1] for line in pylines)
    _bytes = (int(x) for x in bytecolumn if x != '-')
    print('Total', sum(_bytes))


www/foo/access-log
<class '_io.TextIOWrapper'>
www/foo/access-log-0208.gz
<class '_io.TextIOWrapper'>
www/foo/access-log-0108.gz
<class '_io.TextIOWrapper'>
www/bar/access-log
<class '_io.TextIOWrapper'>
www/bar/access-log-0208.bz2
<class '_io.TextIOWrapper'>
www/bar/access-log-0108.bz2
<class '_io.TextIOWrapper'>
www/foo/access-log
<class '_io.TextIOWrapper'>
www/foo/access-log-0208.gz
<class '_io.TextIOWrapper'>
www/foo/access-log-0108.gz
<class '_io.TextIOWrapper'>
www/bar/access-log
<class '_io.TextIOWrapper'>
www/bar/access-log-0208.bz2
<class '_io.TextIOWrapper'>
www/bar/access-log-0108.bz2
<class '_io.TextIOWrapper'>
Total 18159780


### 4.14 Flattening a Nested Sequence

嵌套的序列， 展平迭代。 用到 `yield from` ，与上个例子一样。

In [19]:
# Example of flattening a nested sequence using subgenerators

from collections import Iterable


def flatten(items, ignore_types=(str, bytes)):
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, ignore_types):
            yield from flatten(x)
        else:
            yield x

items = [1, 2, [3, 4, [5, 6], 7], 8]

# Produces 1 2 3 4 5 6 7 8
for x in flatten(items):
    print(x)

items = ['Dave', 'Paula', ['Thomas', 'Lewis']]
for x in flatten(items):
    print(x)


1
2
3
4
5
6
7
8
Dave
Paula
Thomas
Lewis


### 4.15 按顺序遍历两个排好序的迭代器（Iterables）

类似于插入排序的方式。

第一反应是： 同时取出，然后比大小。 先 yield小的。 直到耗尽。

推广到多个排好序的队列呢？

也可以用 `heapq.merge`方法。 详见 <https://github.com/python/cpython/blob/master/Lib/heapq.py>

粗略看了一下，他是这么做的：

1. 从每个 iterable 取出第一个，放到一个堆结构里。（这样堆的大小不会超过所有序列的个数，注意不是所有元素的个数）
2. yield第一个（根元素），然后替换成其下一个，维护一下堆结构。 如果堆的第一个元素没有下一个了，把这个iterable从堆中踢出去。
3. 重复第2步，直到耗尽所有的iterables。当剩下最后一个 iterable时，可以单独迭代其剩余的，这样比较快。


由于堆的大小只与序列的个数有关，所以算法效率为 $O(n \log m)$ 其中 $m$ 是序列个数， $n$ 是元素总个数，可以认为是 $O(n)$

In [23]:
import heapq
a = [1, 4, 7, 10]
b = [2, 5, 6, 11]
c = [3, 7, 9, 12, 33]
for x in heapq.merge(a, b, c):
    print(x, end=', ')



1, 2, 3, 4, 5, 6, 7, 7, 9, 10, 11, 12, 33, 

### 4.16 用 iter取代 无限循环

`iter`的第二个参数可以作为终止条件，见 <https://docs.python.org/3/library/functions.html#iter> ，但是第一个必须是个callable的方法

感觉用在 IO的比较多。 其他的地方找不到用法。

In [38]:
import sys

# 无限循环写法
def reader(f):
    while True:
        data = f.read(10)
        if data == '':
            break
        n = sys.stdout.write(data)
        
# iter 写法：
def reader2(f):
    for data in iter(lambda: f.read(10), ''):
        n = sys.stdout.write(data)

reader(open('somefile.txt'))
reader2(open('somefile.txt'))

hello world
this is a test
of iterating over lines with a history
python is fun
hello world
this is a test
of iterating over lines with a history
python is fun


In [50]:

def gen(n=0):
    while True:
        yield n
        n += 1
g = gen()
print(g)
for i in islice(g, 2, 10):
    print(i, end=',')
print()

g = gen()
for d in iter(lambda: next(g), 10):   # 当遇到哨兵 sentinel 时，终止
    print(d, end=',')

<generator object gen at 0x110e00db0>
2,3,4,5,6,7,8,9,
0,1,2,3,4,5,6,7,8,9,