# Reading and Writing Files

This chapter is about reading and writing files of different types.

To understand how to interact with the filesystem to do things such as: creating, moving, referring to files..., see [09: Filesystem](../09_filesystem/09_filesystem.ipynb).

## Opening files and file objects

In Python, you open and read a file using the built-in `open` function and various built-in reading operations.

Python uses the `file` object return by `open` to keep track of a file and expose file operations.

For example, the following snippet reads in one line from a text file named `myfile`:

In [2]:
from pathlib import Path

myfile_path = Path.cwd() / "sample_files" / "01_myfile.txt"

with open(myfile_path, "r") as file_obj:
    line = file_obj.readline()

line

'This is some text written to 01_myfile.txt.'

Note that the first argument to `open` is a pathname. You could have used `"./samples_files/01_myfile.txt"` or `os.path.join(os.path.cwd(), "sample_files", "01_myfile.txt")`.

The second argument of the `open` command is a string denoting how the file should be opened, with `"r"` meaning that you want to open the file for reading.

Also, `open()` supports the `with` keyword, which indicates that the file is opened with a context manager, which ensures resources will be properly handled.

With the `file_obj` in place, we call `readline()` which reads and returns the first line on the file object, that is, everything up to and including the newline character.

The next call to `readline()` will return the second line, if it exists, and so on.

## Closing files

A file must be closed after all the data has been read from or written to a `file` object. Closing a file allows the underlying file to be read or written to by other code.

You close a `file` by using the `close` method when the `file` object is no longer needed.

In [3]:
from pathlib import Path

myfile_path = Path.cwd() / "sample_files" / "01_myfile.txt"

file_obj = open(myfile_path, "r")
line = file_obj.readline()
file_obj.close()

When using the `with` keyword, which automatically invokes a context manager, automatically closes the file when it is no longer needed.

In [4]:
from pathlib import Path

myfile_path = Path.cwd() / "sample_files" / "01_myfile.txt"

with open(myfile_path, "r") as file_obj:
    line = file_obj.readline()

## Opening the file in write or other modes

The most common second arguments that you pass to `open()` function to indicate how you want the file to be opened are:

+ `r` &mdash; open the file for reading.
+ `w` &mdash; open the file for writing, erasing its previous content.
+ `a` &mdash; open the file for appending.

The following snippet, writes a message into a file, erasing its previous content:

In [5]:
from pathlib import Path

myfile_path = Path.cwd() / "sample_files" / "02_hello.txt"

file_obj = open(myfile_path, "w")
file_obj.write("Hello, world!")
file_obj.close()

Depending on the OS, `open` may also have access to additional file modes.

Additionally, `open` can take an optional third argument which defines how reads or writes for that file are buffered and flushed to disk.

Other parameters control the encoding for text files, the handling of newline characters in text files, etc.

## Functions to read and write text or binary data

The function `File.readline()` reads and returns a single line from a `File` object, including any newline character on the end of the line.

When there's nothing more to be read from the `file`, `readline()` returns an empty string.

The following example counts the lines on a file.

In [6]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file_obj = open(base_path / "03_count_lines_linux.txt", "r")
count = 0
while file_obj.readline() != "":
    count += 1
file_obj.close()
print(f"{count} line(s) in the file.")

3 line(s) in the file.


In [7]:
base_path = Path.cwd() / "sample_files"

file_obj = open(base_path / "04_count_lines.txt", "r")
count = 0
while file_obj.readline() != "":
    count += 1
file_obj.close()
print(f"{count} line(s) in the file.")

3 line(s) in the file.


Note how the program doesn not recognize the final `"\n<EOF>"` found in Linux files as a data line.

The method `File.readlines()` reads all the lines in a file and returns them as a list of strings, one string per line, with trailing newlines included:

In [8]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file_obj = open(base_path / "03_count_lines_linux.txt", "r")
lines = file_obj.readlines()
file_obj.close()
print(f"{len(lines)} line(s) in the file.")

3 line(s) in the file.


See how the final `"\n<EOF>"` is not included.

Because `File.readlines()` materializes all the file contents in a list, it is not appropriate for large files.

Another great way to iterate over all the lines of a text file is to treat the `file` object returned by `open()` as an iterator:

In [9]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file_obj = open(base_path / "03_count_lines_linux.txt", "r")
count = 0
for line in file_obj:
    count += 1
print(f"{count} line(s) in the file.")

3 line(s) in the file.


### Platform-dependent line endings

A possible problem with reading text files is that the lines may be terminated by different characters, depending on the OS they were created in.

In Macs, default line ending in `\r`, whereas in Windows systems it is `\r\n`.

By default, Python normalizes the lines read from files translating line endings to `\n`.

This might create a problem if those lines are then written to another text file, because the target file won't be using the OS' default line endings.

Python allows you to tailor the behavior by using using the `newline` parameter and specifying it to be `\r`, `\r\n`, or `\n`.

For example, the following snippet forces only `\n` to be used as a newline.

```python
input_file = open("myfile", newline="\n")
```

Passing `newline=""` will accept all of the various options as line endings and will return whatever was used in the file with no translation.

In [23]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file_obj = open(
    base_path / "05_winfile.txt",
    "r",
    newline="")
line = file_obj.readline()
file_obj.close()

print(line)
print(line[-1])
print(line[-2])

This is windows line, which ends in cr-lf, instead of only newline
e
n


Note that those characters are not present in the object returned by `readline()`.

The write methods that correspond to `File.readline()` and `File.readlines()` are `File.write()` and `File.writelines()`.

`File.write()` can span multiple lines if newline characters are embedded within the string:

In [15]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file_obj = open(base_path / "06_foobar.txt", "w")
file_obj.write("foo\nbar\nfoobar")
file_obj.close()


See how a final `"\n"` is not automatically appended to the file. If you want the file to terminate on a `\n<EOF>` you must terminate the string yourself with a `\n`.

`File.writelines()` takes a list of strings as an argument and writes them, one after the other, to the given `file` object without writing newlines.

If the strings in the list end with newlines, they're written as lines, otherwise, they'll be just concatenated in the file.


In [18]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file_obj = open(base_path / "07_lines.txt", "w")
strs = ["string1", "string2", "string3"]
file_obj.writelines(strs)
file_obj.close()



The following example illustrates how `writelines()` is the counterpart of `readlines()` in the sense that it can handle the list returned by `readlines()` to create an identical file to the given one.

In [20]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

infile = open(base_path / "08_src-file.txt", "r")
lines = infile.readlines()
infile.close()

outfile = open(base_path / "09_dst-file.txt", "w")
outfile.writelines(lines)
outfile.close()

## Dealing with binary files

On some occasions, you may want to read all the data in a file into a single `bytes` object, especially if the data in the file isn't a string.

You might find two use cases:
+ read all the file contents into memory and treat it as a bytes sequence.
+ read a portion of the data file in a `bytes` object of a fixed size.

### A quick intro about `bytes`

A `bytes` object is an immutable sequence of integers whose values range from 0 to 255. They're especially useful when dealing with binary data such as when reading from and writing to a binary data files.

You can transform a Unicode string to a `bytes` object with `string.encode()`.

Similarly, the `bytes.decode()` method converts a `bytes` object to the corresponding string representation.

### Reading files in binary mode with `rb`

To open a file for reading in binary mode, use `rb`. Then, you can use the `File.read()` method.

The `File.read()` method reads all of a file from the current position and returns that data as a `bytes` object.

If you pass an integer number `File.read(n)` it reads that number of bytes from the file (or less, if there isn't enough data in the file to satisfy the request) and returns a `bytes` object of the given size:


In [24]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

infile = open(base_path / "05_winfile.txt", "rb")
data_4_bytes = infile.read(4)
remaining_data = infile.read()
infile.close()

print(data_4_bytes)
print(remaining_data[-2:])


b'This'
b'ne'


`File.write()` method is the counterpart of `File.read()`:

In [28]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"

file = open(base_path / "10_binfile.txt", "wb")
bytes = "foobar".encode()
bytes += "\r\n".encode()
file.write(bytes)
file.close()


## Reading and writing files with `pathlib`

`pathlib` exposes methods to read and write text and binary files. This can be quite convenient as you don't need to open or close the files. However, `pathlib` doesn't allow you to append data to an existing file &mdash; write operations will always erase the existing content:

In [29]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"
p_text = base_path / "11_pathlibfile.txt"
p_text.write_text("foo\nbar\nfoobar\n")

p_text.read_text()

'foo\nbar\nfoobar\n'

In [30]:
from pathlib import Path

base_path = Path.cwd() / "sample_files"
p_bin = base_path / "12_pathlibfile.bin"
p_bin.write_bytes(b"foo\nbar\nfoobar\n")

p_bin.read_bytes()

b'foo\nbar\nfoobar\n'

## Terminal input/output and redirection

The built-in `input()` function to prompt for and read an input string from the command-line:

In [31]:
user_input = input("Enter the filename to delete:")
print(f"{user_input} file will be deleted!")

root file will be deleted!


The user's input is terminated by pressing the Enter key, but the newline at the end of the input line is stripped off.

`input()` always returns a string, so you'll be responsible for casting the string into the appropriate type as needed:

In [32]:
x = int(input("Enter your age:"))
print(f"{x=}; type={type(x)}")

x=50; type=<class 'int'>


`input()` reads from stdin and writes to stdout.

Lower-level access to stdin, stdout, and stderr can be obtained by using the `sys` module through the `sys.stdin`, `sys.stdout`, and `sys.stderr` attributes.

Those attributes can be treated as *special* `file` object and use the already seen `File` methods.

For example, you can use `readline()` on `sys.stdin`. Similarly, you can use the `write()` method on `sys.stdout` and `sys.stderr`.

```python
import sys

print("Enter your name:", end=" ")
user_input = sys.stdin.readline()

sys.stdout.write(f"User input was {user_input}\n")
```

You can redirect standard input to read from file, and standard output to write to a file.

You can reset once you're done to their original values using `sys.__stdin__`, `sys.__stdout__`, and `sys.__stderr__`.

In [36]:
import sys
from pathlib import Path

base_path = Path.cwd() / "sample_files"
infile = open(base_path / "13_input.txt", "r")
sys.stdin = infile

name = sys.stdin.readline()
age = int(sys.stdin.readline())

outfile = open(base_path / "14_output.txt", "w")
sys.stdout = outfile
sys.stdout.write(f"The name was {name.strip()}, with {age} years old\n")
print("That was what the user typed")

sys.stdin = sys.__stdin__
sys.stdout = sys.__stdout__

infile.close()
outfile.close()


Redirecting the output of `print()` can be useful, since `print()` has a simpler and more familiar syntax.

You can use that technique to temporarily redirect standard output to a file to capture what would otherwise be sent to the terminal, and possibly lost off the screen.

## Handling structured binary data with the `struct` module

For sophisticated applications, Python provides the ability to easily read and write arbitrary binary data generated by external programs.

| NOTE: |
| :---- |
| To read/write Python objects written to file, use [pickling](#pickling-python-objects-to-files) instead of `struct`. |

This is done using the `struct` module.

To use it, you start by defining a *format string* understandable to the `struct` module. This will tell `struct` how the records are packed in the file.

For example:
+ `h` &mdash; presence of a single C short integer.
+ `d` &mdash; presence of a single double precising floating point number.
+ `s` &mdash; presence of a string.

Any of these characters can be preceded by an int to indicate the number of values. For example, `7s` indicates a string of seven characters.

As a result, the string `"hd7s"` indicates a short, followed by a double, followed by a seven-char long string.

The function `struct.pack()` can take Python values and transform them to their corresponding byte sequences to satisfy the given format string:

In [38]:
import struct
from pathlib import Path

record_format = "hd7s"
data_record = struct.pack(record_format, 42, 3.14, b"goodbye")

base_path = Path.cwd() / "sample_files"
outfile = open(base_path / "15_struct.bin", "wb")

outfile.write(data_record)
outfile.close()

data_record

b'*\x00\x00\x00\x00\x00\x00\x00\x1f\x85\xebQ\xb8\x1e\t@goodbye'

To read from a binary file created by an external program, you need to know how many bytes you need to read at a time.

`struct` includes a `calcsize`, which takes your format string as an argument and returns the number of bytes used to contain data in such format.

Then, `struct.unpack()` is used to parsed a read record and get a Python representation of the data record in a tuple:

In [2]:
import struct
from pathlib import Path


record_format = "hd7s"
record_size = struct.calcsize(record_format)
records = []

base_path = Path.cwd() / "sample_files"
with open(base_path / "15_struct.bin", "rb") as infile:
    while True:
        record = infile.read(record_size)
        if not record:
            break
        records.append(struct.unpack(record_format, record))

print(records)

[(42, 3.14, b'goodbye')]


Note that `File.read()` will return an empty record if you're at the end of the file.

If `struct.unpack()` receives an incorrect data record, the function will raise an error.

The `struct` module lets you configure whether the data should be read in big-endian/little-endian/machine-native-endian format.

## Pickling Python objects to files

Python can write any data structure into a file, read that data structure back out of a file, and materialize it in your program via the `pickle` module.

| NOTE: |
| :---- |
| To read/write information from arbitrary binary files, use [`struct`](#handling-structured-binary-data-with-the-struct-module) instead of `pickle`. |


Pickling is the process whereby a Python object hierarchy is converted into a byte stream, and unpickling is the inverse operation, whereby a byte stream is converted back into an object hierarchy.

In [6]:
import pickle
from pathlib import Path

a = 42
b = 3.14
c = "test"


base_path = Path.cwd() / "sample_files"
with open(base_path / "16_python.bin", "wb") as outfile:
    pickle.dump(a, outfile)
    pickle.dump(b, outfile)
    pickle.dump(c, outfile)

To unpickle the file you use the `pickle.load()`:

In [7]:
import pickle
from pathlib import Path

base_path = Path.cwd() / "sample_files"
with open(base_path / "16_python.bin", "rb") as infile:
    a = pickle.load(infile)
    b = pickle.load(infile)
    c = pickle.load(infile)

print(f"{a=} {b=} {c=!r}")


a=42 b=3.14 c='test'


The `pickle` module can handle lists, tuples, numbers, strings, dictionaries, and any object made of these types of objects, including class instances.

It also handles shared objects, cyclic references, and other complex memory structures correctly.

However, code objects and system resources such as files and sockets can't be pickled.

### Reasons not to pickle

+ Pickling is neither particularly fast nor space-efficient as a means of serialization. Using JSON to store serialized objects is faster and results in smaller files on disk.

+ Pickling isn't secure, and loading a pickle with malicious content can result in the execution of arbitrary code on your machine. You should avoid pickling if there's a change that the pickle file will be accessible to anyone who might alter it.

## Shelving objects

You can think of a `shelve` object as being a dictionary that store its data in a file on disk rather than in memory. This allows you to overcome any memory limitations your system might have.

Let's explore the `shelve` method with an example of an address book.

Each entry of the address book consists of a tuple of three elements, the first name, phone number, and address all of them indexed by the last name of the person.

Let's start by creating the addresses book file using the `shelve` module.

In [13]:
import shelve
from pathlib import Path

base_path = Path.cwd() / "sample_files"


book = shelve.open(base_path / "17_addresses")

With the `book` object created, we can start adding entries.

The `book` object is similar to a dictionary, but the keys must be strings:

In [14]:
book["pugh"] = ("florence", "555-1234", "123 Hollywood blvd")
book["isaacs"] = ("jason", "123-456", "456 Main st.")

Once you're done with the `book` you can close the file and end the session:

In [15]:
book.close()

Then, we can open the same address book again:

In [16]:
import shelve
from pathlib import Path

base_path = Path.cwd() / "sample_files"


book = shelve.open(base_path / "17_addresses")

And we can retrive data from the `book` object as if it were a dictionary:

In [17]:
book["isaacs"]

('jason', '123-456', '456 Main st.')

As you can see, the addresses file created by `shelve.open()` behave as a persistent dictionary.

More generally, `shelve.open` returns a `shelf` object that permits basic dictionary operations such as key assignment, lookup, `del`, `in`, and the `keys` method.

The main restriction of `shelf` objects is that their keys have to be strings.

It's also important to understand that `shelf` objects are not materialized into memory. Instead, only the needed information is brought into memory, and the rest remains in disk.

Additionally, they provide no control for concurrent access, so `shelf` objects are not appropriate for multiuser databases. Similarly, you might find that while lookups are very fast, adding and updating keys can be quite slow.