# Chapter 7. Files

In this chapter we will cover how to read and write files. We will also cover a topic that is rarely discussed in introductory programming books, but is should be understood by every working programmer - *encodings*.

## Read a text file

Let's create a text file `example.txt` with the following content:

```
Roses are red.
Violets aren't blue.
It's literally in the name.
They're called violets.
```

We can read the file using the `open` function. This function returns a **file object** which allows access to the underlying file:

In [None]:
file = open("example.txt")

In [None]:
file

We can read the content of the file using the `read` method of the `file` object. The `read` method returns the entire content of the file as a regular string:

In [None]:
content = file.read()

In [None]:
print(content)

In [None]:
type(content)

We also need to *close* the file to free up the resouces consumed by the file object:

In [None]:
file.close()

Let's check that the file is really closed by inspecting the `closed` attribute:

In [None]:
file.closed

Trying to call `read` on a closed file object will result in an error:

In [None]:
file.read()

We will now look at this wokflow in more detail.

## The "with" statement

When working with files, we can use the `with` statement which will automatically close the file:

In [None]:
with open("example.txt", "r") as file:
    content = file.read()
    
    # The file is automatically closed, i.e. there is no need to call file.close()

In [None]:
print(content)

## Writing to a file

Writing a file is very similar to reading a file. You can write to a file by passing the `"w"` mode to open (which stands for "write") and calling the `write` method on the file object:

In [None]:
with open("somefile.txt", "w") as file:
    file.write("Some content")

This will create a file `somefile.txt` with the content `"Some content"`. Note that if that file *already exists, it's content will be completely overwritten by the new content*. If you wish to avoid that and instead *append* the new content to the existing content you need to use the `"a"` (append) mode.

## Escape characters

Some characters are **non-printable**, i.e. they aren't used for displaying content, but instead perform auxiliary functions. The most important example of this is the **newline character**.

Let's get the *representation* of `content`:

In [None]:
content

As you can see newline characters are represented using `\n`:

In [None]:
print("This a line.\nThis is another line.")

Another important non-printable character is the **tab character** which is represented using `\t`:

In [None]:
print("\tfirst column\tsecond column")

Note that `\t` and `\n` are also called **escape characters**. Because a backslash is used for *escaping* characters, if you want to display backslashes in your strings, you might need to escape them as well resulting in `\\`:

In [None]:
print("\\t means tab")

Escape characters also allow you to use quotes inside a string:

In [None]:
print(""this will not work"")

In [None]:
print("\"this will work\"")

## Encodings

Now that we have the basics out of the way, we need to have a look at how the content of file is actually stored. To accomplish that we will pass a second argument to `open` which describes the mode the file should be opened in. Here we will pass `rb` which means "read the file as a binary file" (`r` = read and `b` = binary):

In [None]:
with open("example.txt", "rb") as file:
    content = file.read()

First of all we note that the content is no longer a string, but a `bytes` object:

In [None]:
content

In [None]:
type(content)

This object contains the actual *bytes* of the file. A **byte** is simply the smallest unit of storage on a computer and can (usually) hold values from 0 to 255.

For example we access the first byte of the file like this:

In [None]:
content[0]

Wait, why do we suddenly have numbers when a file contains *characters*? The answer to this question is that the computer *deceptively lies to you*. Computers don't *really* store characters. They can *only* store bytes which are numbers from 0 to 255. This means that the file *actually contains a sequence of numbers*.

However computers maintain mappings from those numbers to characters, so that they can *interpret those numbers as characters*. The simplest such mapping is the **ASCII table**. Here is an excerpt from that table:

```
| Byte value | Character        |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
| 80         | P                |
| __________ | _________________|
| 81         | Q                |
| __________ | _________________|
| 82         | R                |
| __________ | _________________|
| 83         | S                |
| __________ | _________________|
| 84         | T                |
| __________ | _________________|
| 85         | U                |
| __________ | _________________|
| 86         | V                |
| __________ | _________________|
| 87         | W                |
| __________ | _________________|
| 88         | X                |
| __________ | _________________|
| 89         | Y                |
| __________ | _________________|
| 90         | Z                |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
| 97         | a                |
| __________ | _________________|
| 98         | b                |
| __________ | _________________|
| 99         | c                |
| __________ | _________________|
| 100        | d                |
| __________ | _________________|
| 101        | e                |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
| 110        | n                |
| __________ | _________________|
| 111        | o                |
| __________ | _________________|
| 112        | p                |
| __________ | _________________|
| 113        | q                |
| __________ | _________________|
| 114        | r                |
| __________ | _________________|
| 115        | s                |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
```

Now we can make some sense of the `content` variable:

In [None]:
content[0]

If we look at the ASCII table, we can see that the number `82` corresponds to the character `R`. Therefore the first byte of the file contains the number `82` which represents the character `R`. The next few characters should be be `o`, `s`, `e` and `s`, i.e. the following bytes should be `111`, `115`, `101` and `115`:

In [None]:
content[1]

In [None]:
content[2]

In [None]:
content[3]

In [None]:
content[4]

Note that space, newline etc are also simply stored as bytes:

In [None]:
content[5]

The ASCII table worked fine for a while until programmers suddenly noticed that are languages in the world other than English. This was a truly *shocking* discovery that fundamentally changed the way programmers thought about the world. The **Unicode** standard was born.

> Note that this is really oversimplified history of the Unicode standard. The reality was much more complicated.

The most important concept of the Unicode standard was the code point. A **code point** is a numerical value that maps to a specific character. This is very similar to the ASCII table, except that Unicode is extremely large and contains such characters as:

* the German umlaut `ä` which has the code point 228
* the checkmark `✅` which has the code point 9989
* the emoji `😀` which has the code point 128512

You can think of Unicode as a giant extension of the ASCII table.

However we can no longer store every character in a single byte. In order to fit every Unicode characters, we would need at least *four bytes*. However this would be extremely wasteful for e.g. english texts, since we would rarely *actually* need all four bytes. In this case.

Therefore the are multiple **encodings** which govern how code points are converted to bytes. For example an encoding can decide to represent some characters as a single byte, some characters as two bytes etc. We will not dive into encodings in this chapter since this is not essential. However it *is essential* to realize that the *same code point* can be *converted to a different sequence of bytes depending on the encoding*.

For example the encoding `UTF-8` (which is the most popular encoding on the internet) represents the German umlaut `ä` using the following sequence of bytes:

In [None]:
utf8_umlaut = "ä".encode("utf-8")

In [None]:
len(utf8_umlaut)

In [None]:
utf8_umlaut[0]

In [None]:
utf8_umlaut[1]

However the `Windows-1252` encoding (called `cp1252` for short) which is commonly used on Windows systems represents the same character completely differently:

In [None]:
cp1252_umlaut = "ä".encode("cp1252")

In [None]:
len(cp1252_umlaut)

In [None]:
cp1252_umlaut[0]

All of this has an extremely important practical consequence:

**If you want to know what string a sequence of bytes represents, you need to know the encoding of the string.**

Consider the following sequence of bytes: 

In [None]:
b = bytes([195, 164])

If that sequence of bytes has the encoding `cp1252` it represents the following string:

In [None]:
b.decode("cp1252")

However if that sequence of bytes has the encoding `utf-8` it represents a completely different string:

In [None]:
b.decode("utf-8")

> It should be noted that if you don't know the encoding of a string there are certain statistical methods that can be used to guess that encoding using common patterns. However this can get very hacky and is not always accurate. Therefore you should never rely on these methods when writing production code.

This also means that if you write a file using one encoding and then try to read it using a different encoding, you will either get scrambled content or maybe even fail to read the file completely. This is actually a fairly common occurence if a file was created on an operating system that uses one encoding by default and then read on another operating system that uses another encoding by default.

Let's see this in action. Create a text file `german.txt` with the following content:

```
A file with umlauts: ÄÖÜäöü
```

The encoding of the file should be `UTF-8`:

In [None]:
with open("german.txt", "w", encoding="utf-8") as german_file:
    german_file.write("A file with umlauts: ÄÖÜäöü")

Let's know try to read the same file using a different encoding:

In [None]:
with open("german.txt", "r", encoding="cp1252") as german_file:
    content = german_file.read()

In [None]:
content

Uh-oh! The content of this file is completely scrambled! This is because we tried to read it in an encoding that is different from the original encoding is was written in.

Depending on the encoding, the read may even fail *completely*:

In [None]:
with open("german.txt", "r", encoding="utf-16") as german_file:
    content = german_file.read()

This is actually better than scrambled content which represents a general principle in programming: *It's better to crash than to proceed with invalid data*. The reason for that is simple: If you crash, then at least you know you have an error. If you proceed with invalid data, then you may never know that you have an error until something really bad happens much later. Consider a file that contains bank transactions. If you fail to read this file, then you know that your software has an error and you may try to fix it. But if you proceed with scrambled content, you may process completely wrong transactions resulting in *a lot* of headaches.