<a name='filechap'></a>
# Files and file system utilities

## Persistence

Most of the programs so far encountered are "transient" in that they run for a short time and produce some output, but when they end, their data disappears. If you run the program again, it starts with a clean slate.

Other programs are [persistent](glossary.ipynb#persistent): they run for long periods of time
(or all the time) and keep at least some of their data in permanent storage (a hard drive, for example).  If persistent programs shut down and restart, they pick up where they left off.

Examples of persistent programs are operating systems, which run pretty much whenever a computer is on, and web servers, which run all the time, waiting for requests to come in on the network.

One of the simplest ways for programs to maintain their data is by reading and writing data to files.  In this notebook, examples of how to read and write to several file formats will be covered.

## Text files

A text file is a sequence of characters stored on a permanent medium like a hard drive, flash memory, or CD-ROM.

### Reading 

Use the builtin `open()` function to create a an object for reading data from a text file. The object can be used as an iterator to process the lines of the file in order. For example:

In [None]:
fh = open('aux/memo.txt', 'r')
for line in fh:
    print(line.rstrip())

The optional mode argument `'r'` indicates that the file is to be opened for reading (default).

### Line breaks

Iterating through the file object does not strip line breaks at the end of each line of the file, which is why  the `rstrip()` method was applied to each line (otherwise, an empty line would be printed after each line).  Generally speaking, line breaks are indicated by special characters and are operating system dependent.  Some systems (Linux, OS X, FreeBSD, AIX, Xenix, BeOS, and others) use the "line feed", represented by `\n`. Others (Apple II family, Mac OS up to version 9 and OS-9) use a "carriage return", represented by `\r`. Still others (Windows, DOS, and others) use a combination of line feed and cariage return (Windows uses `\r\n`). If you move files between different systems, these inconsistencies might cause problems.

For most systems, there are applications to convert from one format to another. You can find them (and read more about this issue) at [Newline](http://en.wikipedia.org/wiki/Newline). Or, of course, you could write one yourself.

Alternatively, the file object's `readlines()` method reads all of the lines in a file and splits them on line breaks:

In [None]:
print(open('aux/memo.txt', 'r').readlines())

`readlines()` reads the entire contents of a file in to memory.  For very large files, the file can be read in "chunks"

In [None]:
chunk = 128  # bytes
fh = open('aux/memo.txt', 'r')
while True:
    lines = fh.read(chunk)
    if not lines:
        break
    print(lines)
    print('*** end of chunk ***')

### Writing

To write a file, open it in write mode by setting the model argument to `'w'`:

In [None]:
fout = open('aux/output.txt', 'w')
print(fout)

If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created.

The ``write()`` method puts data into the file.

In [None]:
line1 = "This here's the wattle,\n"
fout.write(line1)

The file object keeps track of where it is, so if you call ``write()`` again, it adds the new data to the files end.

In [None]:
line2 = "the emblem of our land.\n"
fout.write(line2)

Unlike `print`, the `write()` method does not insert line breaks, thus, any intended line breaks must be explicitly declared.

When you are done writing, you must close the file.

In [None]:
fout.close()

<a name='format'></a>
## String formatting

The argument of ``write()`` must be a string, meaning that any non-string value must first be converted to a string before writing it to a file. The easiest way to do that is with ``str``:

In [None]:
f = open('aux/output.txt', 'w')
x = 52
f.write(str(x))

### format method

An alternative is to use the [format method](glossary.ipynb#format_method), ``.format()``.  The string on which `.format()` is called can contain literal text or replacement fields delimited by braces `{}`. Each replacement field contains either the numeric index of a positional argument, or the name of a keyword argument. Returns a copy of the string where each replacement field is replaced with the string value of the corresponding argument.

For example, the replacement field `'{0:d}'` means that the first argument to `.format()` should be formatted as an integer (`d` stands for “decimal”):

In [None]:
camels = 42
'{0:d}'.format(camels)

The result is the string `'42'`, which is not to be confused with the integer value ``42``.

A replacement field can appear anywhere in the string, so you can embed a value in a sentence:

In [None]:
camels = 42
'I have spotted {0:d} camels.'.format(camels)

`format()` is a versatile method, the following examples (using `'d'` to format an integer, `'g'` to format a floating-point number, and `'s'` to format a string) return the same string:

In [None]:
'In {0:d} years I have spotted {1:g} {2:s}.'.format(3, 0.1, 'camels')

In [None]:
'In {:d} years I have spotted {:g} {:s}.'.format(3, 0.1, 'camels')

In [None]:
'In {three:d} years I have spotted {pt1:g} {camels:s}.'.format(three=3, pt1=0.1, camels='camels')

If using implicit string replacement is used, the number of arguments to `.format()` must match the number of replacement fields

In [None]:
'{:d} {:d} {:d}'.format(1, 2)

The types of the replacement fields must match the type of the object

In [None]:
'{:d}'.format('dollars')

Older code may use the `'%'` [format operator](glossary.ipynb#format_operator), but new code should use the `.format()` method, when possible.

<a name='os.path'></a><a name='os'></a>
## Filenames and paths

Files are organized into [directories](glossary.ipynb#directory) (also called
“folders”).  The collection of directories and files is referred to as the computer's file system.  The `os` module provides functions for working with the file system (“os” stands for “operating system”). `os.getcwd()` returns the name of the current directory:

In [None]:
import os
cwd = os.getcwd()
print(cwd)

``cwd`` stands for “current working directory.”

A string like ``cwd`` that identifies a file is called a [path](glossary.ipynb#path). A [relative path](glossary.ipynb#relative_path) starts from the current directory; an [absolute path](glossary.ipynb#absolute_path) starts from the topmost directory in the file system.

The paths encountered so far are simple filenames, so they are relative to the current directory. To find the absolute path to a file, you can use ``os.path.abspath``:

In [None]:
os.path.abspath('aux/memo.txt')

The `os.path` module provides platform independent functions for working with file names. To test whether or not a file (or directory) exists, use ``os.path.exists``:

In [None]:
os.path.exists('aux/memo.txt')

If it exists, ``os.path.isdir`` checks whether it’s a directory:

In [None]:
os.path.isdir('aux/memo.txt')

In [None]:
os.path.isdir('music')

Similarly, ``os.path.isfile`` checks whether it’s a file.

``os.listdir`` returns a list of the files (and other directories) in the given directory:

In [None]:
os.listdir(cwd)

### Path parsing

Path parsing depends on a few variable defined in `os`:

- `os.sep` - The separator between portions of the path (e.g., “/” or “\”).
- `os.extsep` - The separator between a filename and the file “extension” (e.g., “.”).
- `os.pardir` - The path component that means traverse the directory tree up one level (e.g., “..”).
- `os.curdir` - The path component that refers to the current directory (e.g., “.”).

`os.path.split()` breaks the path into 2 separate parts and returns the tuple. The second element is the last component of the path, and the first element is everything that comes before it.

In [None]:
filename = os.path.abspath('aux/memo.txt')
os.path.split(filename)

`os.path.basename()` returns a value equivalent to the second part of the `os.path.split()` value.

In [None]:
os.path.basename(filename)

os.path.dirname() returns a value equivalent to the first part of the os.path.split() value.

In [None]:
os.path.dirname(filename)

`os.path.splitext()` works like `os.path.split()` but divides the path on the extension separator, rather than the directory separator

In [None]:
os.path.splitext(filename)

### Path building

To combine several path components into a single value, use `os.path.join()`:

In [None]:
filename = os.path.join(os.getcwd(), 'aux', 'memo.txt')
filename

It’s also easy to work with paths that include “variable” components that can be expanded automatically. For example, `os.path.expanduser()` converts the tilde (`~`) character to a user’s home directory.

In [None]:
os.path.expanduser('~')

`os.path.expandvars()` is more general, and expands any shell environment variables present in the path.

In [None]:
os.environ['MYVAR'] = 'VALUE'
print(os.path.expandvars('/path/to/$MYVAR'))

### Path globbing

"globbing" is useful in situations where your program needs to look for a list of files on the filesystem with names matching a pattern. If you need a list of filenames that all have a certain extension, prefix, or any common string in the middle, use `glob`` instead of writing code to scan the directory contents yourself.

The pattern rules for `glob` are not regular expressions. Instead, they follow standard Unix path expansion rules. There are only a few special characters: two different wild-cards, and character ranges are supported. The patterns rules are applied to segments of the filename (stopping at the path separator, /). Paths in the pattern can be relative or absolute. Shell variable names and tilde (`~`) are not expanded.

#### Wildcards

An asterisk (`*`) matches zero or more characters in a segment of a name. For example:

In [None]:
import glob
glob.glob('*.ipynb')

#### Single Character Wildcard

The other wildcard character supported is the question mark (`?`). It matches any single character in that position in the name. For example,

In [None]:
glob.glob('numpy_?.ipynb')   

matches all of the filenames which begin with “numpy_”, have one more character of any type, then end with ”.ipynb”.

#### Character Ranges

When you need to match a specific character, use a character range instead of a question mark. For example, to find all of the files which have a digit in the name before the extension:

In [None]:
glob.glob('numpy_[0-9].ipynb')

The character range [0-9] matches any single digit. The range is ordered based on the character code for each letter/digit, and the dash indicates an unbroken range of sequential characters. The same range value could be written `[0123456789]`.

## Catching exceptions

A lot of things can go wrong when you try to read and write files. If you try to open a file that doesn’t exist, you get an ``IOError``:

In [None]:
fin = open('bad_file')

If you don’t have permission to access a file:

In [None]:
fout = open('/etc/passwd', 'w')

And if you try to open a directory for reading, you get

In [None]:
fin = open('/home')

To avoid these errors, you could use functions like ``os.path.exists`` and ``os.path.isfile``, but it would take a lot of time and code to check all the possibilities (if
“``Errno 21``” is any indication, there are at least 21 things that can go wrong).

It is better to go ahead and try—and deal with problems if they happen—which is exactly what the ``try`` statement does. The syntax is similar to an ``if`` statement:

In [None]:
try:
    fin = open('bad_file')
    for line in fin:
        print(line)
    fin.close()
except:
    print('Something went wrong.')

Python starts by executing the ``try`` clause. If all goes well, it skips the ``except`` clause and proceeds. If an exception occurs, it jumps out of the ``try`` clause and executes the ``except`` clause.

Handling an exception with a ``try`` statement is called [catching](glossary.ipynb#catch) an exception. In this example, the ``except`` clause prints an error message that is not very helpful. In general, catching an exception gives you a chance to fix the problem, or try again, or at least end the program gracefully.

<div style="background-color: #FFFFFF; margin-right: 10px; padding-bottom: 8px; padding-left: 8px; padding-right: 8px; padding-top: 8px; border: 2px solid black;">Write a function called ``sed`` that takes as arguments a pattern string, a replacement string, and two filenames; it should read the first file and write the contents into the second file (creating it if necessary). If the pattern string appears anywhere in the file, it should be replaced with the replacement string.</div>

If an error occurs while opening, reading, writing or closing files, your program should catch the exception, print an error message, and exit.

## CSV files

CSV stands for "comma separated values".  A CSV file is a file in which fields are separated by a comma.  CSV files are useful for working with data exported from spreadsheets into text files formatted with fields and records.  The `csv` module provides convenient functions for working with CSV files.

### Reading

Use `reader()` to create a an object for reading data from a CSV file. The reader can be used as an iterator to process the rows of the file in order. For example:

In [None]:
import csv
import sys

f = open('aux/file.csv', 'rt')
try:
    reader = csv.reader(f)
    for row in reader:
        print(row)
finally:
    f.close()

The first argument to `reader()` is the source of text lines.  Other optional arguments can be given to control how the input data is parsed.

As the text lines are read, each row of the input data is parsed and converted to a list of strings.

### Writing

Writing CSV files is just as easy as reading them. Use `writer()` to create an object for writing, then iterate over the rows, using `writerow()` to print them.

In [None]:
f = open('aux/output.csv', 'wt')
try:
    writer = csv.writer(f)
    writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
    for i in range(10):
        writer.writerow( (i+1, chr(ord('a') + i), '08/%02d/07' % (i+1)) )
finally:
    f.close()

print(open('aux/output.csv', 'rt').read())

The output does not look exactly like the exported data used in the reader example.  The default quoting behavior is different for the writer, so the string column is not quoted. That is easy to change by adding a quoting argument to quote non-numeric values:

In [None]:
f = open('aux/output.csv', 'wt')
try:
    writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)
    writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
    for i in range(10):
        writer.writerow( (i+1, chr(ord('a') + i), '08/%02d/07' % (i+1)) )
finally:
    f.close()

print(open('aux/output.csv', 'rt').read())

### Quoting

There are four different quoting options, defined as constants in the csv module.

- QUOTE_ALL
  
  Quote everything, regardless of type.
  
- QUOTE_MINIMAL

  Quote fields with special characters (anything that would confuse a parser configured with the same dialect and options). This is the default

- QUOTE_NONNUMERIC

  Quote all fields that are not integers or floats. When used with the reader, input fields that are not quoted are converted to floats.

- QUOTE_NONE

  Do not quote anything on output. When used with the reader, quote characters are included in the field values (normally, they are treated as delimiters and stripped).

## Pickling

The `pickle` module implements an algorithm for turning an arbitrary Python object into a series of bytes. This process is also called "serializing" the object. The byte stream representing the object can then be transmitted or stored, and later reconstructed to create a new object with the same characteristics.

### Importing

It is common to first try to import `cPickle`, giving an alias of "`pickle`". If that import fails for any reason, you can then fall back on the native Python implementation in the pickle module. This gives you the faster implementation, if it is available, and the portable implementation otherwise.

In [None]:
try:
    import pickle as pickle
except:
    import pickle

### Encoding and Decoding Data in Strings

``pickle.dumps`` takes an object as a parameter and returns a string representation (``dumps`` is short for “dump string”):

In [None]:
import pickle
t = [1, 2, 3]
pickle.dumps(t)

The format isn’t obvious to human readers; it is meant to be easy for ``pickle`` to interpret. 

``pickle.loads`` (“load string”) reconstitutes the object:

In [None]:
t1 = [1, 2, 3]
s = pickle.dumps(t1)
t2 = pickle.loads(s)
print(t2)

Although the new object has the same value as the old, it is not (in general) the same object:

In [None]:
t1 == t2

In [None]:
t1 is t2

In other words, pickling and then unpickling has the same effect as copying the object.

### Working with file (like) objects

In addition to `dumps()` and `loads()`, pickle provides convenience functions for working with file-like streams. It is possible to write multiple objects to a stream, and then read them from the stream without knowing in advance how many objects are written or how big they are.

In [None]:
try:
    import pickle as pickle
except:
    import pickle
import pprint

data = [list(range(5))]
letters = 'abcdefghijklmnopqrstuvwxyz'
data.append(letters)
data.append(dict(list(zip(list(range(26)), letters))))

fout = open('aux/output.pkl', 'wb')

# Write to the stream
for o in data:
    pickle.dump(o, fout)
    fout.flush()
fout.close()

# Set up a read-able stream
fin = open('aux/output.pkl', 'rb')

# Read the data
while True:
    try:
        o = pickle.load(fin)
        print(o)
    except EOFError:
        break     