# Files

# Paths

* Files are identified by paths.
* Depending on the platform, paths use different separators. Examples:
  * `/usr/bin/python` (Unix)
  * `C:\Program Files\Python34\python.exe` (Windows)
* The `os.path` and `os` module help to manage paths in a platform independant way.

## Computing paths

Get the current working directory:

In [1]:
import os
os.getcwd()

'/home/roskakori/workspace/talks/python_for_testers'

To compute a path from different parts (folders and optional filename), use `os.path.join()` instead of string concatenation because it automatically takes care of the different path separators: slash (`/`) under Unix, backslash (`\`) under Windows.

In [2]:
import os.path
text_path = os.path.join(os.getcwd(), 'examples', 'der_rote_komet.txt')
text_path

'/home/roskakori/workspace/talks/python_for_testers/examples/der_rote_komet.txt'

Paths obtained from outside (command line argument, config file etc) can contain references to environment variables. Use `os.path.expandvars()` to resolve them:

In [3]:
config_path = os.path.join('$HOME', '.some.cfg')
print(config_path)
print(os.path.expandvars(config_path))

$HOME/.some.cfg
/home/roskakori/.some.cfg


## Parts of a path

In [4]:
os.path.dirname(text_path)  # folder part

'/home/roskakori/workspace/talks/python_for_testers/examples'

In [5]:
os.path.basename(text_path)  # name part

'der_rote_komet.txt'

In [6]:
os.path.split(text_path)  # tuple with folder and name

('/home/roskakori/workspace/talks/python_for_testers/examples',
 'der_rote_komet.txt')

In [7]:
text_name = os.path.basename(text_path)
os.path.splitext(text_name)  # tuple with name and suffix

('der_rote_komet', '.txt')

# Text files

## Read a text line by line

In [8]:
with open(text_path, 'r', encoding='utf-8') as text_file:
    for line in text_file:
        print(line.rstrip('\n'))

"Siehst du die purpurne Röte, die in gerader Linie sich herab auf
die Erde senkt?" fragte Romulus Futurus in größter Aufregung seinen
Freund John Crofton, den berühmten Berichterstatter des "New York
Herald" in Berlin. "Bist du nun überzeugt, daß ich die Wahrheit
gesprochen habe? Noch kannst du den roten Kometen nicht erkennen,
und niemand wird imstande sein, ihn mit bloßem Auge zu sehen. Aber
jetzt gibst du zu, daß meine Diagnose richtig war?"
                            (Robert Heymann, 1909, "Der rote Komet")



(The full text is available from http://www.gutenberg.org/ebooks/37991.)

## Text files in Python

* Text files can be interpreted as a sequence of lines.
* The [`open()`](https://docs.python.org/3/library/functions.html#open) function takes a path and a mode with `'r'` for read.
* Always specify the `encoding` parameter; otherwise Python uses some "magic" that differs across versions and might or might not work.
* By default, all kinds of newlines are converted to `'\n'`; use the `newline` parameter to change this.
* The `with` clause takes care of calling `close()` on the file once the `with` block is finished. This even works in case of errors.

## Write a text file

In [9]:
import tempfile
tempdir = tempfile.gettempdir()
target_path = os.path.join(tempdir, 'some.txt')
print('Write', target_path)
with open(target_path, 'w', encoding='utf-8') as target_file:
    target_file.write('"Siehst du die purpurne Röte, die in gerader Linie \n')
    target_file.write('sich herab auf die Erde senkt?"\n')

Write /tmp/some.txt


Note: in production code, we would either remove the temporary file here or use `tempfile.TemoraryFile()` instead of `open()`to automatically remove the file on `close()`.

# Codecs

## Encoding text

* In theory, Unicode can define 2147483647 different characters (in practice it's a lot less).
* To represent such a large number, you need 4 bytes per character.
* Files are a sequence of bytes.
* The encoding specifies how to translate a character to one or more bytes (in vice versa).
* Some encoding support all Unicode characters, some only a part (e.g. 256 characters).

## Important encodings (single byte)

* ASCII provides 128 common characters, but no Umlauts or Euro sign. It is mostly useless in practice.
* Latin-1 / ISO-8859-1 extends ASCII and provides many western characters including Umlauts but no Euro sign (because it is older than the Euro).
* **CP1252** ("code page 1252", "Windows ANSI") extends Latin-1 and (among others) adds the Euro sign.
* There are many other reginal codepages, e.g. CP1251 for cyrillic script.

## Important encodings (multiple bytes)

* **UTF-8** can represent all Unicode characters. ASCII characters require a single byte, other characters can require up to 6 bytes. It does not require 0-bytes and works well with programs written in C (where 0 marks the end of a string).
* UTF-16 and UTF-32 also can represent all Unicode characters but typically use a lot of 0 bytes.

## Example encoding

Here a few example what kind of bytes encoding the text "für 3 €" produces:

In [10]:
'für 3 €'.encode('cp1252')

b'f\xfcr 3 \x80'

In [11]:
'für 3 €'.encode('utf-8')

b'f\xc3\xbcr 3 \xe2\x82\xac'

In [12]:
'für 3 €'.encode('utf-32')

b'\xff\xfe\x00\x00f\x00\x00\x00\xfc\x00\x00\x00r\x00\x00\x00 \x00\x00\x003\x00\x00\x00 \x00\x00\x00\xac \x00\x00'

## Encoding errors

Attempting to encode a character that is not defined within a certain encoding results in an `UnicodeEncodeError`:

In [13]:
'für 3 €'.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 1: ordinal not in range(128)

# Working with files and folders

## Helpful modules

The [`os`](https://docs.python.org/3/library/os.html) and [`shutil`](https://docs.python.org/3/library/shutil.html) ("shell utilites") provide many functions to operate on files and folders. The [`tempfile`](https://docs.python.org/3/library/tempfile.html) modules simplifies working with temporary files.

In [None]:
import os
import shutil
import tempfile

## File operations

Copy a file including its timestamp and all attributes:

In [None]:
some_txt_path = os.path.join(tempdir, 'some.txt')
copy_of_some_txt_path = os.path.join(tempdir, 'copy_of_some.txt')

shutil.copy2(some_txt_path, copy_of_some_txt_path)

Remove a file:

In [None]:
os.remove(copy_of_some_txt_path)

## Folder operations

Create a folder and all necessary intermediate folders; if the folder already exists, do nothing:

In [None]:
nested_folder = os.path.join(tempdir, 'some', 'nested', 'folder')
os.makedirs(nested_folder, exist_ok=True)

Remove a folder and all files and folders in it:

In [None]:
folder_to_remove = os.path.join(tempdir, 'some')
shutil.rmtree(folder_to_remove)

Other useful functions:
* Copy a folder (including its contents): `shutil.copytree(source, target)`
* Move a file or folder (including its contents): `shutil.move(source, target)`


# StringIO

## What is StringIO?

* `StringIO` is a "filelike object".
* Most functions that take a text file as parameter (but not a path to a file) also works with `StringIO`.
* Data are stored in memory and purged upon calling `close()`.
* Particularly useful for writing to a test "file" and comparing the contents afterwards.


## Read from a string

In [14]:
import io
text_to_read = """The quick brown
fox jumps over the
lazy white dog."""

with io.StringIO(text_to_read) as source:
    for line in source:
        print(line.rstrip('\n'))

The quick brown
fox jumps over the
lazy white dog.


## Writing to a string

In [15]:
import io
with io.StringIO() as output:
    output.write('The quick brown fox\n')
    output.write('jumps over the lazy white dog.\n')
    content = output.getvalue()
print(content)

The quick brown fox
jumps over the lazy white dog.



# Summary

* Use `os.path` for platform independant path operations such as `join()` and `splitext()`.
* Text files can be read as a sequence or lines.
* When opening a text file, always specify an encoding.
* Use the `with` clause to ensure a file gets closed automatically, or call `close()` explicitely.
* The `os` and `shutil` module provide many functions to work with files and folders.