# Module 3 - Reading and writing files  
--------------------------------------------------------

## Table of Content <a id='toc'></a>


&nbsp;&nbsp;&nbsp;&nbsp;[**Introduction**](#1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Where is my file](#1.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[File opening modes](#1.2)

&nbsp;&nbsp;&nbsp;&nbsp;[**Reading from files**](#3)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading lines manually](#3.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[End-of-line characters](#3.2)  

&nbsp;&nbsp;&nbsp;&nbsp;[**Writing to files**](#4)

&nbsp;&nbsp;&nbsp;&nbsp;[**Exercises 3.1 and 3.2**](#5)

&nbsp;&nbsp;&nbsp;&nbsp;[**Additional Material**](#6)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading a file's entire content at once](#6.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Easier reading of .csv formatted file](#6.2)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Opening files without context managers](#6.3)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading files using a while loop](#6.4)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Some new cool syntax for Python >= 3.8](#6.5)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[What does the `flush` argument of `print()` do?](#6.6)  

<br>


## Introduction <a id='1'></a>
-------------------

In many use cases, you will want your python code to **read/write from/to files** stored on your local hard drive.

### Code design aspects

Here are a few important points to consider when working with files:

* Where is my file?
* Do I need to read the entire dataset/file into memory?
    * Accessing the hard drive is among the slower operations a computer performs. 
      Reading an entire file when you only need the first few lines will be costly
    * If you are reading a very large file, having the entire file in memory at 
      once may overburden your computer's memory (RAM).
* Are there concurrency issues?
    * If another software (or even your own code if you have messed up) writes to a file you 
      are currently reading, you could run into trouble.


### Practical aspects

* Whether it is for reading or for writing, operations on files occur using **file objects**
  (sometimes also referred-to as **file handles**).
* In modern python, files are opened with a so-called *context manager* statement, which takes
  care of properly opening and closing the file once you are done with it.
* This context manager takes the form of a **`with open(...) as ...`** statement. Note that context
  managers are code blocks, and their content must therefore be indented:

    ```python
    # Open a file in a given "mode" (e.g. read, write or append).
    with open(filename, mode) as file_handle:

        # Do something with the file... (note the indentation)
        # ...

    # When you are outside of the context manager block, the file is automatically closed.

    ```

<br>

### `filename` argument: where is my file located ?  <a id='1.1'></a>

The first argument that `open()` takes is the **file name**, which **may include the file path** (relative or absolute) if the file is not located in the current working directory.

To pass the correct value to `filename`, you will need to know:
 1. where your file is,
 2. and where you *code is executing from* (i.e. what is the working directory when you code runs)
 
We leave question 1. to you, but with Jupyter Notebook, the second question has an easy answer: your code is executed where the jupyter notebook is saved (with a more classical python script, the code executes from the directory where you called `python`).

Now, we need to make sure that our code can find the file:
* **If the file is the same folder as the code**, then you can just use the name of the file,
  no need for further modification.
* **If the file is elsewhere**, you will have to specify a path to the file.
  This can be either a:
 
    * **Absolute path:** from the root of the computer to the file. e.g.,
      - `"C:\Users\JohnDoe\Desktop\ProjectP\data\myFile.txt"` (Windows)
      - `"/home/JaneDoe/Documents/ProjectP/data/myFile.txt"` (Linux,Mac)
    
    * **Relative path:** from your code to the file. e.g.,
      - `"data/myFile.txt"` - here the file is in a subdirectory named `data`.
      - `"../otherProject/myFile.txt"` - more complex, the file is in a subdirectory of the parent directory.
      - *Note:* `..` indicates the parent directory of the current directory.

<br>

The last case - `"../otherProject/myFile.txt"` - depicts a situation like this:
```
parentFolder:
 | 
 ├─ ProjectA
 |     └─ myCode.ipynb
 |
 └─ otherProject
       └─ myFile.txt
```


<br>

[Back to ToC](#toc)

### `mode` argument: the file opening mode  <a id='1.2'></a>

When using the `open()` function, the optional **`mode`** argument can be passed to the function. This specifies the type of access you will have on the file. For instance, the `"r"` mode will only allow to read the content of a file, and not to write to it (this is useful to avoid accidental writing to the file).

There are several possible modes when opening files:
* **`"r"`**: open file in read-only mode. This is the **default value** of the `mode` argument.
* **`"w"`**: open file in write-only mode, **overwriting** an existing file with the same name.
* **`"a"`**: open file in write-only mode, **appending** to an existing file with the same name
  (otherwise the file is created).
* **`"rb"`**, **`"wb"`**, **`"ab"`**: same as `"r"`, `"w"` and `"a"`, but reading/writing to/from binary files (such as `.zip` or `.bmp` image files). The content is read/written as bytes objects without any decoding.

See `help(open)` or the [python online documentation](https://docs.python.org/3/library/functions.html#open) for a full list of modes and details about them.

<br>
<br>

[Back to ToC](#toc)


## Reading from files  <a id='3'></a>
--------------------------

To start reading a file, one creates a **file object** using `open()` function with `mode='r'`.

```python
with open("path/to/file", mode="r") as file_handle:
    ...
```

<br>

### Reading lines manually <a id='3.1'></a>

Now that we have a **file object** (file handle), we can use it to read from the file.  
When reading a file with python, you can consider your file object a bit like a cursor which starts at the beginning of your file, and progresses toward the end of the file (it can go backward, but it is often a bit hacky to do so).

<br>

<img src="img/file_pointer2.png" alt="a file pointer at opening" style="height:200px;" />

<br>


You can read elements (ie. make the cursor advance) using the following methods:
 * **`.readline()`**: the most common, reads a single line.
 * **`.read()`**: reads the remainder of the file (from the current cursor position) in one go.
 * **`.readlines()`**: reads the remainder of the file (from the current cursor position) in one go,
   and returns a `list` of `str` (string), where each element of the list is a line of the file that
   was read: `["line 1...", "line 2...", "line 3...", ...]`.

> All these methods return the text they read as `str` (string objects). This means that if you read
  a number and want to use it as a `float` (e.g. to do math with it), you will need to convert it
  from `str` to `float` yourself (with `float(x)`).

<br>

Let's focus on **`.readline()`**, the most used method. When we call it, the file pointer progresses to the next line and returns the line that was just read as a `str` object.
Note how the returned strings contain the **end of line character `\n`**.

<img src="img/file_pointer3.png" alt="a file pointer at opening" style="height:200px;" />

Each time we call `.readline()`, a new line is returned and the pointer progresses to the next line of the file.

<img src="img/file_pointer4.png" alt="a file pointer at opening" style="height:200px;" />

Once there are no more lines to read, `.readline()` **returns an empty string (`""`)**. This is how we know that the end of the file was reached.

<img src="img/file_pointer5.png" alt="a file pointer at the end of the file" style="height:200px;" />

<br>

**Let's see an example with actual code:**

In [None]:
with open("data/fresh_fruits.txt" , mode="r") as reading_handle:
    
    # Read and print the first line in the file.
    line = reading_handle.readline()  # Reads a single line from the file.
    print("line 1:", line)            # Print the line to the screen.
    
    # Keep reading and printing lines.
    # Problem: how many time should I do this ?
    line = reading_handle.readline()
    print("line 2:", line)
    line = reading_handle.readline()
    print("line 3:", line)
    line = reading_handle.readline()
    print("line 4:", line)
    line = reading_handle.readline()
    print("line 5:", line)
    
    # There are only 5 lines in this file.
    # Once there are no more lines to read, .readline() returns an empty string.
    line = reading_handle.readline()
    print("line 6:", line)
    line = reading_handle.readline()
    print("line 7:", line)
    line = reading_handle.readline()
    print("line 8:", line)


<br>

Luckily for us, file objects are **iterable**, and we can thus use a **`for` loop** to read through files:
* Each iteration reads 1 line.
* The `for` loop ends when the entire file was read.

In [None]:
with open("data/fresh_fruits.txt" , mode="r") as reading_handle:
    for line in reading_handle:
        print(line)


<br>

### End-of-line characters  <a id='3.2'></a>
As you can see in the example above, there are additional empty lines in between our prints. This is because the lines are read from the file with their **end-of-line** characters, which generally is `\n` .  

To avoid this issue, one typically uses the **`.strip()`** method of string, which removes any whitespace or end-of-line character at the start and end of the string.

**Example:** using `.strip()` when reading content from a file.

In [None]:
with open("data/fresh_fruits.txt", 'r') as reading_handle:
    
    # Reminder: "enumerate()" is our friend that automatically enumerates items
    # and creates tuples of the form "(index, element)".
    for i, line in enumerate(reading_handle):
        print("line", i, ":", line.strip())



<div class="alert alert-block alert-success">
    
### Micro Exercise 1 - reading a file with a for loop

* Read the content of `data/titanic_head.csv` and print it. Make sure that no white space is printed between lines.
    
</div>

<br>
<br>


[Back to ToC](#toc)

## Writing to files  <a id='4'></a>
----------------------

Writing to a file is achieved in pretty much the same way as reading from it, but the opening mode is now **`"w"`**.  
And instead of reading lines, we now `print()` them to the file.

* To print to a file - instead of standard-output - we will need to use the optional `file` argument.

In [None]:
help(print)

> *Note*: for people wondering what the `flush` argument does, please see the additional material  
> Spoiler alert - it's a fairly minor argument, only useful in some edge cases.

In [None]:
with open("shopping_list.txt", mode="w") as f:
    print("onion", file=f)
    print(34, "potato", file=f)
    print("shrubbery", file=f)
    print("tomato sauce", file=f)


By passing the file object (file handle) to the **`file` argument** of the `print()` function, we now print to the file rather than to our terminal.
> **Reminder**: the **`"w"`** mode **overwrites** the opened file - if you use it on an existing file,
> its original content of the file is lost.
>
> **Pro tip:** you can open more than one file using a single `with` statement by using multiple
> context managers in the same code block:
> * Either on the same line:
>      
>  ```python
>     with open("input.txt", 'r') as in_file, open("output.txt", 'w') as out_file:
>         do_something()
>  ```
>  
> * Or by having one context manager per line, and enclosing them all in parentheses `()`
>   as shown below (better for readability) - *note: this only works with python >= 3.10* :
>  ```python
>    with (
>        open("input.txt", 'r') as in_file,
>        open("output.txt", 'w') as out_file,
>    ):
>        do_something()
>  ```

<div class="alert alert-block alert-info">

#### Additional material

You might sometimes see some Python code - especially older one - that uses the **`.write()`** method of the **file object** to write/print content to a file.

* There are some differences between the `print()` method and `.write()`; the most important being
  that `.write()` does not do any formatting and even the end-of-line (carriage return) characters
  need to be manually written.

</div>

In [None]:
with open("shopping_list2.txt", mode="w") as f:
    f.write("onion\n")
    f.write("{} potato\n".format(34))
    f.write("shrubbery\n")
    f.write("tomato sauce\n")
    

<br>

<div class="alert alert-block alert-success">

### Micro Exercise 2 - copy a file's content

* Write some code to read, print and copy the content of the `shopping_list.txt` file we just created.  
  Specifically, your code should:
  * Print the content of `shopping_list.txt` to screen.
  * Make a copy of the content in a new file `shopping_list_copy.txt`.
  * Make sure that no white space is printed between lines.
</div>

<br>
<br>

## Exercises 3.1 and 3.2   <a id='5'></a>
------------------------------

* Exercises are found in a separate Jupyter Notebook.
* If you have time, feel free to try the **additional exercises**.

<br>
<br>
<br>

[Back to ToC](#toc)

<div class="alert alert-block alert-info">

# Additional Material  <a id='6'></a>
-------------------------------------

</div>


### `readlines()` - reading a file's entire content at once <a id="6.1"></a>

Sometimes it is necessary to load the entire content of a file into memory at once. One way to achieve this is using the `.readlines()` method (note the plural marker "s" in the method's name).

As its name suggests, `readlines()` reads more than one line at a time (by default, all lines in the file).

In [None]:
with open("data/fresh_fruits.txt", mode="r") as f:

    entire_file = f.readlines()                      # The whole content of the file is now in memory.
    print(f.name, "has", len(entire_file), "lines:") # We can check how many lines there are
                                                     # before we start looping over them.
    for i, line in enumerate(entire_file):
        print("line", i, ":", line.strip())

print(entire_file)
print("The file has", len(entire_file), "lines.")

**Question:** while our examples both `readline()` and `readlines()` work equally well, there can be important implications in using one or the other, especially when dealing with large files. Can you think of a drawback of using `readlines()`?

<br>
<br>
<br>

**Answer:** <font color='white'>using readlines() will (by default) load the entire file in memory, and this can be problematic when working with large files as is often the case in bioinformatics. Always consider the file sizes you are dealing with when using readlines().</font> (select to reveal)

<br>

### Easier reading of .csv formatted file using modules  <a id='6.2'></a>

`csv` (**C**omma **S**eparated **V**alue) is one of the most common file format when it comes to storing tabular data. In this format, each line contain a fixed number of values (columns), separated by a specific character (typically `","`).


Classically, when reading these files, we want to create some form of structure which reflects their tabular structure.  
For instance, we can create a `list` where each row is a dictionary whose keys are the column name (found in the file's first line):


In [None]:
# List where we will store the content of the ".csv" file.
# Each line of the file will be stored as a separate element (dict)
# in the list.
data = []

with open("data/titanic_head.csv") as f:
    
    line = f.readline()
    
    # The column names are in the first line
    columnNames = line.strip().split(',')  # .split(',') is our best ally here : it cuts a str into a list. 
    
    for line in f:
        # Split the line in its different fields.
        sl = line.strip().split(',')
        
        # Now we map the fields onto their constituent columns.
        row = {}
        for i in range(len(sl)):
            row[columnNames[i]] = sl[i]

        # Add the dictionary for the current row (line) to the list.
        data.append(row)
        

print("Full data:")
for row in data:
    print(row)

print("***")
print("Name of passenger 4 : ", data[4]["Name"])
print("Age of passenger 4 : ", data[4]["Age"])

Sure, this works, but it is also fairly tedious to write.

Because csv is such a widespread format, python comes with functions that that can help us out:

In [None]:
import csv   # Imports an external module. Ignore this for now, we'll talk about it in the next notebook

data = []

with open("data/titanic_head.csv") as f:
    
    reader = csv.DictReader(f)  # Returns a DictReader object.
    for row in reader:
        data.append(row)        # Row is a dictionary whose keys correspond to the columns!

for row in data:
    print(row)

In [None]:
print("full data:")
for row in data:
    print(row)

print("***")
print("Name of passenger 4 : " , data[4]["Name"])
print("Age of passenger 4 : " , data[4]["Age"])

That is much simpler.

If the file uses **another field delimiter** (eg, `';'`), you can specify it when creating the `DictReader`:

   ```python
   reader = csv.DictReader( readingHandle , delimiter=';')
   ```

Additionally, libraries dedicated to data analysis often have functions that read directly from a csv file and create their specific data structure.

For instance, for `pandas` (that's a sneak-peak into day3 modules ;-) )

In [None]:
import pandas as pd  # ignore this, we'll talk about it in the next notebook.

# Reading the csv file as a pandas.DataFrame, their custom type for tabular data.
df = pd.read_csv("data/titanic_head.csv") 
df

<br>

### Opening files without context managers  <a id='6.3'></a>

So far we have always used the build-in `open()` function as a *context manager* (i.e. `with open() as ...`).
* But `open()` can also be used as a regular function.

**Don't forget the close it!** (usually the `with` takes care of that for you)

In [None]:
file_handle = open("data/fresh_fruits.txt", "r")

for i, line in enumerate(file_handle):
    print("line", i, ":", line.strip())
        
# Don't forget to close the file!
file_handle.close()

<br>

### Reading files using a while loop  <a id='6.4'></a>

Here is an example of file reading where, instead of a `for` loop, we use a `while` loop and `.readline()`.

In [None]:
reading_handle = open("data/fresh_fruits.txt", "r")
line = reading_handle.readline()
i = 0

# When the file has been entirely read, readline() returns an empty string and the while loop will end.
# In python a non-empty string evaluates to "True", and therefore we can use "while line" as a shortcut 
# for "while line != '' ".
while line:
    print("line", i, ":", line.strip())    # Note: we use the "strip()" method of "str" to remove the 
                                           # trailing "\n" (carriage return) of each line.
    line = reading_handle.readline()       # Don't forget this or you will have an infinite loop.
    i += 1

reading_handle.close()

<br>

### The walrus operator: a new syntax for Python >= 3.8  <a id='6.5'></a>

Starting with Python 3.8, a new operator **`:=`** (a.k.a, the **walrus operator**) allows to do a variable assignment (`line` in the example below), while at the same time evaluating an expression.

This can be used when reading a file to reduce the number of lines of code, as shown below:

In [None]:
with open("data/fresh_fruits.txt", "r") as f:
    i = 0
    while (line := f.readline()):             # := assigns values to variables as part of a larger expression. 
        print("line", i, ":", line.strip())   # It is known as the "walrus operator” and it works really well
        i += 1                                # together with the while-loop


<br>

[Back to ToC](#toc)


### What does the `flush` argument of `print()` do? <a id='6.6'></a>

By default (i.e. when `flush=False`), the output of `print()` is buffered until either:
* The buffer is full, or
* a `\n` is printed (or some other triggering event occurs).

By passing `flush=True` to `print()`, the value passed to `print()` is printed immediately (along with everything that is still in the buffer at that time).

<br>

**Example:**
* To observe the difference between `flush=False` and `flush=True` in practice, you can copy/paste
  the following code into a python interpreter.
* **Important:** this **will not work as expected in Jupyter Notebook**, as Jupyter Notebook always flushes the buffer.

<br>

```python
import time

# Default behavior: the buffer is flushed when a `\n` gets printed at the end of the loop.
# Note that there are no `\n` printed inside the loop as we manually specify `end=" "`.
for i in range(5):
    print(i, end=" ")
    time.sleep(0.5)
print("end")

# With `flush=True` the values passed to `print()` are immediately printed.
for i in range(5):
    print(i, end=" ", flush=True)
    time.sleep(0.5)
print("end")
```