
## 5. Files and IO




### 5.1 Reading and Writing Text Data

就是文件读写，很简单。

注意文件的编码。默认是 utf-8 编码。

参考 <https://www.programiz.com/python-programming/file-operation> 


Mode | Description
-- |--
'r' | Open a file for reading. (default)
 'w' | Open a file for writing. Creates a new file if it does not exist or truncates the file if it exists.
 'x' | Open a file for exclusive creation. If the file already exists, the operation fails.
 'a' | Open for appending at the end of the file without truncating it. Creates a new file if it does not exist.
 't' | Open in text mode. (default)
 'b' | Open in binary mode.
 '+' | Open a file for updating (reading and writing)

In [37]:


print("Reading a simple text file (UTF-8)")
with open('./samples/sample.txt', 'rt') as f:
    for line in f:
        print(repr(line))

# (b) Reading a text file with universal newlines turned off
print("Reading text file with universal newlines off")
with open('./samples/sample.txt', 'rt', newline='') as f:
    for line in f:
        print(repr(line))

# (c) Reading text file as ASCII with replacement error handling
print("Reading text as ASCII with replacement error handling")
with open('./samples/sample.txt', 'rt', encoding='ascii', errors='replace') as f:
    for line in f:
        print(repr(line))

# (d) Reading text file as ASCII with ignore error handling
print("Reading text as ASCII with ignore error handling")
with open('./samples/sample.txt', 'rt', encoding='ascii', errors='ignore') as f:
    for line in f:
        print(repr(line))


Reading a simple text file (UTF-8)
'Hello World\n'
'Spicy Jalapeño\n'
Reading text file with universal newlines off
'Hello World\r\n'
'Spicy Jalapeño\r\n'
Reading text as ASCII with replacement error handling
'Hello World\n'
'Spicy Jalape��o\n'
Reading text as ASCII with ignore error handling
'Hello World\n'
'Spicy Jalapeo\n'


### 5.2 打印重定向到 File

指定  `print`参数 `file` 即可。

感觉没什么用， 直接用 write好了。

In [38]:
with open('./samples/somefile2.txt', 'wt') as f:
    print('Hello World!', file=f)

### 5.3 打印分隔符与 line ending

`print`的两个参数 

- `sep`，分隔符，默认是 ' '
- `end`，line ending，默认是 '\n'

In [7]:
print('ACME', 50, 91.5)
print('ACME', 50, 91.5, sep=',')
print('ACME', 50, 91.5, sep=',', end='!!\n')

ACME 50 91.5
ACME,50,91.5
ACME,50,91.5!!


### 5.4 读写二进制数据文件



In [40]:
with open('./samples/somefile.bin', 'wb') as f:
    f.write(b'Hello world')

with open('./samples/somefile.bin', 'rb') as f:
    d = f.read()
    print(d.decode('utf-8'))

Hello world


### 5.5 写文件，但避免覆盖原文件

我们知道， 写文件如果用 'w'模式，如果文件不存在，则会创建文件；如果文件已存在，则会覆盖原文件。

需要避免的是 覆盖原文件这一点。可以用 'x' 模式。


In [41]:
with open('./samples/somefile.bin', 'xb') as f:
    f.write(b'\u4e2d')

FileExistsError: [Errno 17] File exists: './samples/somefile.bin'

### 5.6  String上的I/O操作

In [21]:
import io
s = io.StringIO()
s.write('Hello world\n')
print('This is a test', file=s)

print(s.getvalue())

Hello world
This is a test



In [23]:
## 或者 byte io
import io
b = io.BytesIO()
b.write(b'Hello\n')
# print(b'This is a test', file=b)

print(b.getvalue())

b'Hello\n'


### 5.7 压缩文件操作（gzip，bz2）


In [42]:
import gzip
import bz2

with gzip.open('./samples/somefile.gz', 'wt') as f:
    f.write('Hello, gzip')

with gzip.open('./samples/somefile.gz', 'rt') as f:
    print(f.read())
    
with bz2.open('./samples/somefile.bz2', 'wt') as f:
    f.write('Hello, bz2')

with bz2.open('./samples/somefile.bz2', 'rt') as f:
    print(f.read())

Hello, gzip
Hello, bz2


### 5.8 Iterating Over Fixed-Sized Record

这个例子与之前那个 iter 的例子相同， 就是读取 相同大小的块，然后处理。

注意 iter 第一个参数是一个callable对象， 第二个参数是 sentinel。

callable对象，可以用 partial 构建。

In [27]:
from functools import partial
with open('somefile.txt', 'rt') as f:
    records = iter(partial(f.read, 10), '')
    for r in records:
        print(r)

hello worl
d
this is 
a test
of 
iterating 
over lines
 with a hi
story
pyth
on is fun



### 5.9 Reading Binary Data into a Mutable Buffer

就是将内容读进一个缓存区。可以用 bytearray

In [46]:
import os.path

def read_into_buffer(filename):
    buf = bytearray(os.path.getsize(filename))
    with open(filename, 'rb') as f:
        f.readinto(buf)
    return buf

buf = read_into_buffer('./samples/somefile.bin')
print(buf)
buf[0:5] = b'Hallo'
print(buf)

bytearray(b'Hello world')
bytearray(b'Hallo world')


### 5.10 Memory Mapping Binary Files

内存与文件二进制映射：可以直接修改。

In [52]:
import os
import mmap

def memory_map(filename, access=mmap.ACCESS_WRITE):
    size = os.path.getsize(filename)
    fd = os.open(filename, os.O_RDWR)
    return mmap.mmap(fd, size, access=access)


# 创建文件
size = 1000    # 1KB 文件
with open('./samples/data', 'wb') as f:
    f.seek(size-1)
    f.write(b'\x00')
    
# 内存映射
with memory_map('./samples/data') as m:
    print(len(m))
    m[0:11] = b'Hello World'

# 读取文件
with open('./samples/data', 'rb') as f:
    print(f.read(11))

1000
b'Hello World'


### 5.11 处理文件路径

In [60]:
import os

path = '/Users/yangdong/Study/Github/notebooks/PythonCookbook/samples/sample.txt'

print('文件名', os.path.basename(path))
print('目录路径', os.path.dirname(path))
print('拼接路径', os.path.join('tmp', 'data', os.path.basename(path)))

path = '~/Study/Github/notebooks/PythonCookbook/samples/sample.txt'
print('expanduser', os.path.expanduser(path))
print('分割扩展后缀', os.path.splitext(path))

文件名 sample.txt
目录路径 /Users/yangdong/Study/Github/notebooks/PythonCookbook/samples
拼接路径 tmp/data/sample.txt
expanduser /Users/yangdong/Study/Github/notebooks/PythonCookbook/samples/sample.txt
分割扩展后缀 ('~/Study/Github/notebooks/PythonCookbook/samples/sample', '.txt')


### 5.12 文件测试

In [66]:
import os

assert os.path.exists('/etc/passwd')
assert os.path.isfile('/etc/passwd')
assert os.path.isdir('/etc/passwd') == False
assert os.path.islink('/usr/local/bin/python2')

print(os.path.getsize('/etc/passwd'))
import time
print(time.ctime(os.path.getmtime('/etc/passwd')))

6774
Tue Oct  3 08:36:26 2017


### 5.13 列出目录文件

`glob`模块，用于匹配文件名。（或者用 `fnmatch.fnmatch`）

In [74]:
import os

names = os.listdir('.')
# print(names)

# 当前目录所有文件：
all_files = [name for name in os.listdir('.') if os.path.isfile(os.path.join('.', name))]
print('\n'.join(all_files))

# 当前目录下所有目录
all_dirs = [name for name in os.listdir('.') if os.path.isdir(os.path.join('.', name))]
print(all_dirs)

# 匹配搜索文件
import glob
all_ipynb = glob.glob('./*.ipynb')
print(all_ipynb)

.DS_Store
somefile.txt
1. Data Structures and Algorithms.ipynb
3. Numbers Dates and Times.ipynb
4. Iterators and Generators.ipynb
sample.dat
5. Files and IO.ipynb
2. Strings and Text.ipynb
['www', 'samples', '.ipynb_checkpoints']
['./1. Data Structures and Algorithms.ipynb', './3. Numbers Dates and Times.ipynb', './4. Iterators and Generators.ipynb', './5. Files and IO.ipynb', './2. Strings and Text.ipynb']


### 5.14 文件编码

In [98]:
samples_path = './samples/'
with open(samples_path + 'encoding.txt', 'w', encoding='utf-16') as f:
    print(f)

<_io.TextIOWrapper name='./samples/encoding.txt' mode='w' encoding='utf-16'>


In [104]:
# 参考 http://graphemica.com/%E4%B8%AD

def print_code(s, encoding, formats):
    b = s.encode(encoding)
    print(b)
    for bit in b:
        print(format(bit, formats), end=',')
    print()
print_code('中', 'utf-8', '02x')
print_code('中', 'utf-16', '02x')
print_code('中', 'utf-32', '02x')

b'\xe4\xb8\xad'
e4,b8,ad,
b'\xff\xfe-N'
ff,fe,2d,4e,
b'\xff\xfe\x00\x00-N\x00\x00'
ff,fe,00,00,2d,4e,00,00,
