# Table of Content <a id='toc'></a>


&nbsp;&nbsp;&nbsp;&nbsp;[Module 3 - Reading and writing files](#0)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Where is my file](#1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[File opening modes](#2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading from files](#3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading lines manually](#4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[End-of-line characters](#5)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Easier reading of .csv formatted file](#6)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Writing to files](#7)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Exercises 3.1, 3.2 and 3.3](#8)

&nbsp;&nbsp;&nbsp;&nbsp;[Additional Theory](#9)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Opening files without context managers](#10)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading files using a while loop](#11)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Some new cool syntax for Python >= 3.8](#12)

<br>

# Module 3 - Reading and writing files  <a id='0'></a>
--------------------------------------------------------


In many use cases, you will want your python code to **read/write from/to files** stored on your local hard drive.  
Here are a few important points to consider when working with files:

* Where is my file ?

* Do I need to read the entire dataset/file into memory?
    * remember the table about the time a computer takes to perform various action. 
      Accessing the hard drive is among the slower operations. 
      Reading an entire file when you only need the first few lines will be costly
    * if you are reading a very large file, then having the entire file in memory at 
      once may overburden your computer
      
* Are there concurrency issues?
    * if another software (or even your code if you have messed up) writes to a file you 
      are currently reading, you could run into trouble.

Whether it is for reading or for writing, operations with files occur using **file objects** (sometimes also referred-to as file **handles**).

In modern python, files are opened with a *context manager* statement which takes care of properly opening and closing the file once you are done with it.

This context manager takes the form of a **with** statement:


```python
# Open a file in a given "mode" (e.g. read, write or append).
with open(filename, mode) as fileHandle:

    # Do something with the file... (note the indentation)
    # ...

# When you are outside the block of the with statement, it closes the file

```


<br>

[back to the toc](#toc)

<br>

## Where is my file  <a id='1'></a>

This is the very first step. 
Without a good idea of
 1. where your file is,
 2. and where you *code is executing from* you will get nowhere.

We leave question 1. to you.

With jupyter notebook, the second question has an easy answer : your code executed where the jupyter notebook is saved (with a more classical python script, the code execute from the folder where you called `python`).

Now, we need to make sure that our code can find the file.

If the file is the same folder as the code, then you can just use the name of the file, no need for further modification.

If the file is elsewhere, you will have to specify a path to the file, either:
 * absolute path : from the root of the computer to the file. eg,
    - `'C:\Users\JohnDoe\Desktop\ProjectP\myFile.txt'` (Windows)
    - `'/home/JaneDoe/Documents/ProjectP/data/myFile.txt'` (Linux,Mac)
 * relative path : from you code to the file. eg,
     - `'data/myFile.txt'` (the file is in subfolder data)
     - `'../otherProject/myFile.txt'` (more complex, the file is in a subfolder of the parent folder)


This last case depict a situation like this :
```

parentFolder:
 | 
 |- ProjectA:
 |     |
 |     |- myCode.ipynb
 |
 |- otherProject:
       |
       |- myFile.txt

```



[back to the toc](#toc)

<br>

## File opening modes  <a id='2'></a>
When using the `open()` function, a **mode** can be passed as argument to the function. This specifies the type of access you will have on the file. For instance, the `'r'` mode will only allow to read the content of a file, and will not allow writing to it (this is useful to avoid accidental writing to the file).

There are several possible modes when opening files:
* `'r'`: open file in read-only mode.
* `'w'`: open file in write-only mode, **overwriting** an existing file with the same name.
* `'a'`: open file in write-only mode, **appending** to an existing file with the same name
  (otherwise the file is created).
* `'rb'`, `'wb'`, `'ab'`: same as `'r'`, `'w'` and `'a'`, but reading/writing to/from binary files (such as `.zip` or `.bmp` image files). 
  The content is read/written as bytes objects without any decoding.

See `help(open)` or the [python online documentation](https://docs.python.org/3/library/functions.html#open) for a full list of modes and details about them.

<br>


[back to toc](#toc)

<br>

## Reading from files  <a id='3'></a>
To start reading a file, one creates a **file object** using `open` function with `mode='r'` .    


[back to the toc](#toc)

<br>

### Reading lines manually <a id='4'></a>

When reading a file with python, you have to consider your **file object** a little bit like a cursor which starts at the very beginning of your file, and progresses toward the end of the file (it can go backward, but it is often a bit hacky to do so).

You can read elements (ie. make the cursor advance) using the following methods:
 * `.readline()` : the most common, reads a single line
 * `.read()` : reads the rest of the file in one go
 * `.readlines()` : reads the rest of the file in one go, and put each line as an element in a list
 

Each method returns the text it read as a `str`.

> the methods return `str` object : this means that if you read number and want to use it as a `float` (ie. do math with it), you will need to convert it from `str` to `float` first (with `float(x)`)

In [9]:


with open("data/fresh_fruits.txt" , mode='r') as reading_handle:

    line = reading_handle.readline() # this function reads a single line from the file 
    print(line) # I print the line 

    line = reading_handle.readline()
    print('line:') 
    print(line) 
    line = reading_handle.readline()
    print('line:') 
    print(line) 
    line = reading_handle.readline()
    print('line:') 
    print(line) 
    line = reading_handle.readline()
    print('line:') 
    print(line) 
    # problem : how many time should I do this ?

    line = reading_handle.readline()
    print('line:') 
    print(line) 
    line = reading_handle.readline()
    print('line:') 
    print(line) 
    line = reading_handle.readline()
    print('line:') 
    print(line) 
    
    
reading_handle.close()


passionfruit

line:
oranges

line:
apples

line:
grapefruit (whole and segments)

line:
pointed sticks
line:

line:

line:



Once there is no more line to read, `.readline()` returns an empty string (`''`)

The `for` loop method makes this a much tidier code: each iteration reads 1 line.

In [10]:
with open("data/fresh_fruits.txt" , mode='r') as reading_handle:
    i = 0
    for line in reading_handle:
        print("line", i, ":", line)
        i += 1


line 0 : passionfruit

line 1 : oranges

line 2 : apples

line 3 : grapefruit (whole and segments)

line 4 : pointed sticks


<br>


[back to toc](#toc)

<br>

### End-of-line characters  <a id='5'></a>
As you can see in the example above, there are additionnal empty lines in between our prints. This is because the lines are read from the file with their **end-of-line** characters, which generally is `\n` .  
To avoid this kind of issue, one typically uses the `.strip()` method of strings, which removes any whitespace or *end-of-line* character at the start or end of the string.

To avoid this issue, one typically uses the **`.strip()`** method of strings, which removes any whitespace or end-of-line character at the start or end of the string.

**Example:** using `.strip()` when reading content from a file.

In [11]:
with open("data/fresh_fruits.txt", 'r') as reading_handle:

    for i, line in enumerate(reading_handle):  # enumerate() is our friend that will automatically enumerate items
        print("line", i, ":", line.strip())    # Note: we use the "strip()" method of "str" to remove the 
                                               # trailing "\n" (carriage return) of each line.



line 0 : passionfruit
line 1 : oranges
line 2 : apples
line 3 : grapefruit (whole and segments)
line 4 : pointed sticks


<br>

### Reading a file's entire content at once

Here is another way to read the fruity content of our file: the **`readlines()`** function (note the "s" in the name).

As its name suggests, `readlines()` reads more than one line at a time (by default, all lines in the file).

In [None]:
with open("data/fresh_fruits.txt" , 'r') as reading_handle:

    entire_file = reading_handle.readlines()                      # the whole content of the file is in memory
    print(reading_handle.name, "has", len(entire_file), "lines:") # and we can check how many lines there are
    for i, line in enumerate(entire_file):                        # before we start looping over them
        print("line", i, ":", line.strip())

print(entire_file)
print("The file has", len(entire_file), "lines.")

**Question:** while our examples both `readline()` and `readlines()` work equally well, there can be important implications in using one or the other, especially when dealing with large files. Can you think of a drawback of using `readlines()`?

<br>
<br>
<br>

**Answer:** <font color='white'>using readlines() will (by default) load the entire file in memory, and this can be problematic when working with large files as is often the case in bioinformatics. Always consider the file sizes you are dealing with when using readlines().</font> (select to reveal)


#### Micro Exercise

* read the content of `data/titanic_head.csv` and print them. Make sure that no white space is printed between lines.



[back to the toc](#toc)

<br>

### Easier reading of .csv formatted file <a id='6'></a>

csv (**C**omma **S**eparated **V**alue) is one of the most common file format when it comes to storing tabular data.

In it, each line contain a fixed number of values (columns), separated by a specific character (typically `','`).


Classically, when reading these files, we would want to create some form of structure which accounts for this tabular structure.

Here is for instance how to create a `list` where each row is a dictionary whose keys are the column name (found in the file first line):


In [28]:
data = []

with open('data/titanic_head.csv') as IN:
    
    line = IN.readline()
    # the column names are in the first line
    columnNames = line.strip().split(',')  # .split(',') is our best ally here : it cuts a str into a list 
    
    for line in IN:
        sl = line.strip().split(',') ## split the line in its different fields
        
        # now we map the fields onto their constituant columns
        row = {}
        for i in range( len(sl) ):
            row[ columnNames[i] ] = sl[i]


        data.append(row) # store the row dictionnary
        

print('full data:')
for row in data:
    print(row)

print('***')
print('Name of passenger 4 : ' , data[4]['Name'])
print('Age of passenger 4 : ' , data[4]['Age'])

full data:
{'Name': 'Bjornstrom-Steffansson Mr. Mauritz Hakan', 'Sex': 'male', 'Age': '28', 'Pclass': '1', 'Survived': '1', 'Family': '0', 'Fare': '26.55', 'Embarked': 'S'}
{'Name': 'Coleff Mr. Peju', 'Sex': 'male', 'Age': '36', 'Pclass': '3', 'Survived': '0', 'Family': '0', 'Fare': '7.5', 'Embarked': 'S'}
{'Name': 'Laroche Miss. Simonne Marie Anne Andree', 'Sex': 'female', 'Age': '3', 'Pclass': '2', 'Survived': '1', 'Family': '1', 'Fare': '41.58', 'Embarked': 'C'}
{'Name': 'Smith Miss. Marion Elsie', 'Sex': 'female', 'Age': '40', 'Pclass': '2', 'Survived': '1', 'Family': '0', 'Fare': '13', 'Embarked': 'S'}
{'Name': 'Dooley Mr. Patrick', 'Sex': 'male', 'Age': '32', 'Pclass': '3', 'Survived': '0', 'Family': '0', 'Fare': '7.75', 'Embarked': 'Q'}
{'Name': 'Kantor Mr. Sinai', 'Sex': 'male', 'Age': '34', 'Pclass': '2', 'Survived': '0', 'Family': '1', 'Fare': '26', 'Embarked': 'S'}
{'Name': 'Goodwin Miss. Lillian Amy', 'Sex': 'female', 'Age': '16', 'Pclass': '3', 'Survived': '0', 'Family': '

OK, this works, but it is a bit tedious to write.

Because csv is such a classical format, python actually contains things that can help us out:

In [29]:
import csv#ignore this, we'll talk about it in the next notebook

data = []

with open('data/titanic_head.csv') as IN:
    
    reader = csv.DictReader(IN)
    for row in reader:
        ## row is a dictionnary whose keys correspond to the columns!
        data.append(row)


print('full data:')
for row in data:
    print(row)

print('***')
print('Name of passenger 4 : ' , data[4]['Name'])
print('Age of passenger 4 : ' , data[4]['Age'])

full data:
{'Name': 'Bjornstrom-Steffansson Mr. Mauritz Hakan', 'Sex': 'male', 'Age': '28', 'Pclass': '1', 'Survived': '1', 'Family': '0', 'Fare': '26.55', 'Embarked': 'S'}
{'Name': 'Coleff Mr. Peju', 'Sex': 'male', 'Age': '36', 'Pclass': '3', 'Survived': '0', 'Family': '0', 'Fare': '7.5', 'Embarked': 'S'}
{'Name': 'Laroche Miss. Simonne Marie Anne Andree', 'Sex': 'female', 'Age': '3', 'Pclass': '2', 'Survived': '1', 'Family': '1', 'Fare': '41.58', 'Embarked': 'C'}
{'Name': 'Smith Miss. Marion Elsie', 'Sex': 'female', 'Age': '40', 'Pclass': '2', 'Survived': '1', 'Family': '0', 'Fare': '13', 'Embarked': 'S'}
{'Name': 'Dooley Mr. Patrick', 'Sex': 'male', 'Age': '32', 'Pclass': '3', 'Survived': '0', 'Family': '0', 'Fare': '7.75', 'Embarked': 'Q'}
{'Name': 'Kantor Mr. Sinai', 'Sex': 'male', 'Age': '34', 'Pclass': '2', 'Survived': '0', 'Family': '1', 'Fare': '26', 'Embarked': 'S'}
{'Name': 'Goodwin Miss. Lillian Amy', 'Sex': 'female', 'Age': '16', 'Pclass': '3', 'Survived': '0', 'Family': '

That is much simpler.

If the file uses **another field delimiter** (eg, `';'`), you can specify it when creating the DictReader : 
```python
reader = csv.DictReader( readingHandle , delimiter=';')
```

Indeed, libraries dedicated to data analysis often have functions that read directly from a csv file and create their specific data structure.

For instance, for `pandas` (that's a sneakpeak into day3 modules ;-) )

In [32]:
import pandas as pd #ignore this, we'll talk about it in the next notebook

df = pd.read_csv( 'data/titanic_head.csv' ) 
# reading the csv file as a pandas.DataFrame, their custom type for tabular data
df

Unnamed: 0,Name,Sex,Age,Pclass,Survived,Family,Fare,Embarked
0,Bjornstrom-Steffansson Mr. Mauritz Hakan,male,28.0,1,1,0,26.55,S
1,Coleff Mr. Peju,male,36.0,3,0,0,7.5,S
2,Laroche Miss. Simonne Marie Anne Andree,female,3.0,2,1,1,41.58,C
3,Smith Miss. Marion Elsie,female,40.0,2,1,0,13.0,S
4,Dooley Mr. Patrick,male,32.0,3,0,0,7.75,Q
5,Kantor Mr. Sinai,male,34.0,2,0,1,26.0,S
6,Goodwin Miss. Lillian Amy,female,16.0,3,0,5,46.9,S
7,Olsen Mr. Karl Siegwart Andreas,male,42.0,3,0,0,8.4,S
8,Fleming Miss. Margaret,female,,1,1,0,110.88,C


<br>


[back to the toc](#toc)

<br>

## Writing to files  <a id='7'></a>
Writing to a file is achieved in pretty much the same way as reading from it, but the opening mode is now `'w'`.  
And instead of reading lines, we now `print()` them to the file.

In [None]:
with open("shopping_list.txt", mode="w") as f:
    print("onion", file=f)
    print(34, "potato", file=f)
    print("shrubbery", file=f)
    print("tomato sauce", file=f)


By passing the file object (or file handle) to the **`file` argument** of the `print()` function, we now print to the file rather than to our terminal.
> **Reminder**: the **`"w"`** mode **overwrites** the opened file - if you use it on an existing file,
  its original content is lost.  
> **Pro tip:** you can open more than one file using a single `with` statement:
```python
with open("input.txt", 'r') as in_file, open("output.txt", 'w') as out_file:
    do_something()
```

#### Additional info

You might sometimes see some Python code - especially older ones - that uses the **`.write()`** method of the **file object**.

There are some differences between the `print()` method and `.write()`; the most important one is that `.write()` will not do any formatting and even the end-of-line (carriage return) characters need to be manually written.

In [None]:
with open("shopping_list2.txt", mode="w") as f:
    f.write("onion\n")
    f.write("{} potato\n".format(34))
    f.write("shrubbery\n")
    f.write("tomato sauce\n")
    

<br>

### Micro Exercise - copy a file's content

* Write some code to read the content of the `shopping_list.txt` file we just created, in order to check
  that the writing did work properly. More specifically, you should:
    * Print the content to screen.
    * Make a copy of the content in a new file `shopping_list_copy.txt`.
    * Make sure that no white space is printed between lines.

<br>
<br>

## Exercises 3.1 and 3.2   <a id='8'></a>


<br>

[back to toc](#toc)

<br>

# Additional Theory  <a id='9'></a>
-----------------------------



[back to the toc](#toc)

<br>

## Opening files without context managers  <a id='10'></a>
Now that you understand the basics of opening and closing a file, we can show you the actual "pythonic", recommended, way to deal with files:

In [13]:
fileHandle = open("data/fresh_fruits.txt", 'r')
for i, line in enumerate(fileHandle):
    print("line", i, ":", line.strip())
        
# don't forget to clsoe the file!
fileHandle.close()

line 0 : passionfruit
line 1 : oranges
line 2 : apples
line 3 : grapefruit (whole and segments)
line 4 : pointed sticks


<br>


[back to the toc](#toc)

<br>

## Reading files using a while loop  <a id='11'></a>

Here is an exemple of file reading where, instead of a for loop, we use a while loop and `.readine()`.

In [None]:
reading_handle = open("data/fresh_fruits.txt", 'r')
i = 0
line = reading_handle.readline()

# When the file has been entirely read, readline() returns an empty string and the while loop will end.
# In python a non-empty string evalutes to "True", and therefore we can use "while line" as a shortcut 
# for "while line != '' ".
while line:
    print("line", i, ":", line.strip())    # Note: we use the "strip()" method of "str" to remove the 
                                           # trailing "\n" (carriage return) of each line.
    line = reading_handle.readline()       # Don't forget this or you will have an infinite loop.
    i += 1
    
reading_handle.close()

<br>

### Some new cool syntax for Python >= 3.8  <a id='12'></a>

Starting with Python 3.8, a new opperator ":=" (a.k.a, the "walrus operator") allows to do a variable assignment (here "line", while at the same time evaluating an expression.

This can be used when reading a file to reduce the number of lines of code, as shown below:

In [None]:
reading_handle = open("data/fresh_fruits.txt", 'r')
i = 0
while (line := reading_handle.readline()):  # := assigns values to variables as part of a larger expression. 
    print("line", i, ":", line.strip())     # It is known as “the walrus operator” and it works really well
    i += 1                                  # together with the while-loop
reading_handle.close()