# Accessing files from Python code
One of the most common issues in the developer's job is to `process data stored in files` while the files are usually physically stored using storage devices - hard, optical, network, or solid-state disks.

It's easy to imagine a program that sorts 20 numbers, and it's equally easy to imagine the user of this program entering these twenty numbers directly from the keyboard.

It's much harder to imagine the same task when there are 20,000 numbers to be sorted, and there isn't a single user who is able to enter these numbers without making a mistake.

It's much easier to imagine that these numbers are stored in the disk file which is read by the program. The program sorts the numbers and doesn't send them to the screen, but instead creates a new file and saves the sorted sequence of numbers there.

If we want to implement a simple database, the only way to store the information between program runs is to save it into a file (or files if your database is more complex).

In principle, any non-simple programming problem relies on the use of files, whether it processes images (stored in files), multiplies matrices (stored in files), or calculates wages and taxes (reading data stored in files).

You may ask why we have waited until now to show you these issues.

The answer is very simple - Python's way of accessing and processing files is implemented using a consistent set of objects. There is no better moment to talk about it.

# File names
Different operating systems can treat the files in different ways. For example, Windows uses a different naming convention than the one adopted in Unix/Linux systems.

If we use the notion of a canonical file name (a name which uniquely defines the location of the file regardless of its level in the directory tree) we can realize that these names look different in Windows and in Unix/Linux:
#### Windows
```s
C:\directory\file
```

#### Linux
```s
/directory/file
```

As you can see, systems derived from Unix/Linux don't use the disk drive letter (e.g., `C:`) and all the directories grow from one root directory called `/`, while Windows systems recognize the root directory as `\`.

In addition, Unix/Linux system file names are case-sensitive. Windows systems store the case of letters used in the file name, but don't distinguish between their cases at all.

This means that these two strings: `ThisIsTheNameOfTheFile` and `thisisthenameofthefile` describe two different files in Unix/Linux systems, but are the same name for just one file in Windows systems.

The main and most striking difference is that you have to use `two different separators for the directory names`: `\` in Windows, and `/` in Unix/Linux.

This difference is not very important to the normal user, but is `very important when writing programs in Python`.

To understand why, try to recall the very specific role played by the `\` inside Python strings.

# File names: continued
Suppose you're interested in a particular file located in the directory dir, and named file.

Suppose also that you want to assign a string containing the name of the file.

In Unix/Linux systems, it may look as follows:
```py
name = "/dir/file"
```

But if you try to code it for the Windows system:
```py
name = "\dir\file"
```

you'll get an unpleasant surprise: either Python will generate an error, or the execution of the program will behave strangely, as if the file name has been distorted in some way.

In fact, it's not strange at all, but quite obvious and natural. Python uses the `\` as an escape character (like `\n`).

This means that Windows file names must be written as follows:
```py
name = "\\dir\\file"
```

Fortunately, there is also one more solution. Python is smart enough to be able to convert slashes into backslashes each time it discovers that it's required by the OS.

This means that any the following assignments:
```py
name = "/dir/file"
name = "c:/dir/file"
```

will work with Windows, too.

Any program written in Python (and not only in Python, because that convention applies to virtually all programming languages) does not communicate with the files directly, but through some abstract entities that are named differently in different languages or environments - the most-used terms are `handles` or `streams` (we'll use them as synonyms here).

The programmer, having a more- or less-rich set of functions/methods, is able to perform certain operations on the stream, which affect the real files using mechanisms contained in the operating system kernel.

In this way, you can implement the process of accessing any file, even when the name of the file is unknown at the time of writing the program.

The operations performed with the abstract stream reflect the activities related to the physical file.

<img src="img/pro-file.png">

To connect (bind) the stream with the file, it's necessary to perform an explicit operation.

The operation of connecting the stream with a file is called `opening the file`, while disconnecting this link is named `closing the file`.

Hence, the conclusion is that the very first operation performed on the stream is always `open` and the last one is `close`. The program, in effect, is free to manipulate the stream between these two events and to handle the associated file.

This freedom is limited, of course, by the physical characteristics of the file and the way in which the file has been opened.

Let us say again that the opening of the stream can fail, and it may happen due to several reasons: the most common is the lack of a file with a specified name.

It can also happen that the physical file exists, but the program is not allowed to open it. There's also the risk that the program has opened too many streams, and the specific operating system may not allow the simultaneous opening of more than n files (e.g., 200).

A well-written program should detect these failed openings, and react accordingly.

# File streams
The opening of the stream is not only associated with the file, but should also declare the manner in which the stream will be processed. This declaration is called an `open mode`.

If the opening is successful, the `program will be allowed to perform only the operations which are consistent with the declared open mode`.

There are two basic operations performed on the stream:

  - `read` from the stream: the portions of the data are retrieved from the file and placed in a memory area managed by the program (e.g., a variable);
  - `write` to the stream: the portions of the data from the memory (e.g., a variable) are transferred to the file.

There are three basic modes used to open the stream:

  - `read mode`: a stream opened in this mode allows `read operations only`; trying to write to the stream will cause an exception (the exception is named UnsupportedOperation, which inherits OSError and ValueError, and comes from the io module);
  - `write mode`: a stream opened in this mode allows `write operations only`; attempting to read the stream will cause the exception mentioned above;
  - `update mode`: a stream opened in this mode allows `both writes and reads`.

Before we discuss how to manipulate the streams, we owe you some explanation. `The stream behaves almost like a tape recorder`.

When you read something from a stream, a virtual head moves over the stream according to the number of bytes transferred from the stream.

When you write something to the stream, the same head moves along the stream recording the data from the memory.

Whenever we talk about reading from and writing to the stream, try to imagine this analogy. The programming books refer to this mechanism as the `current file position`, and we'll also use this term.

<img src="img/pro-file1.png">

It's necessary now to show you the object responsible for representing streams in programs.

# File handles
Python assumes that `every file is hidden behind an object of an adequate class`.

Of course, it's hard not to ask how to interpret the word adequate.

Files can be processed in many different ways - some of them depend on the file's contents, some on the programmer's intentions.

In any case, different files may require different sets of operations, and behave in different ways.

An object of an adequate class is `created when you open the file and annihilate it at the time of closing`.

Between these two events, you can use the object to specify what operations should be performed on a particular stream. The operations you're allowed to use are imposed by `the way in which you've opened the file`.

In general, the object comes from one of the classes shown here:

<img src="img/pro-file2.png">

Note: you never use constructors to bring these objects to life. The only way you `obtain them is to invoke the function named open()`.

The function analyses the arguments you've provided, and automatically creates the required object.

If you want to `get rid of the object, you invoke the method named close()`.

The invocation will sever the connection to the object, and the file and will remove the object.

For our purposes, we'll concern ourselves only with streams represented by `BufferIOBase` and `TextIOBase` objects. You'll understand why soon.



# File handles: continued
Due to the type of the stream's contents, `all the streams are divided into text and binary streams`.

The text streams ones are structured in lines; that is, they contain typographical characters (letters, digits, punctuation, etc.) arranged in rows (lines), as seen with the naked eye when you look at the contents of the file in the editor.

This file is written (or read) mostly character by character, or line by line.

The binary streams don't contain text but a sequence of bytes of any value. This sequence can be, for example, an executable program, an image, an audio or a video clip, a database file, etc.

Because these files don't contain lines, the reads and writes relate to portions of data of any size. Hence the data is read/written byte by byte, or block by block, where the size of the block usually ranges from one to an arbitrarily chosen value.

Then comes a subtle problem. In Unix/Linux systems, the line ends are marked by a single character named `LF` (ASCII code 10) designated in Python programs as `\n`.

Other operating systems, especially these derived from the prehistoric CP/M system (which applies to Windows family systems, too) use a different convention: the end of line is marked by a pair of characters, `CR` and `LF` (ASCII codes 13 and 10) which can be encoded as `\r\n`.

This ambiguity can cause various unpleasant consequences.

If you create a program responsible for processing a text file, and it is written for Windows, you can recognize the ends of the lines by finding the `\r\n` characters, but the same program running in a Unix/Linux environment will be completely useless, and vice versa: the program written for Unix/Linux systems might be useless in Windows.

Such undesirable features of the program, which prevent or hinder the use of the program in different environments, are called `non-portability`.

Similarly, the trait of the program allowing execution in different environments is called `portability`. A program endowed with such a trait is called a `portable program`.