# Reading and writing files

In [None]:
# run this cell to play back an audio file, type Esc-o to hide player
from IPython.display import Audio
Audio("media/fl-intro.mp3")

Entering input via the keyboard and reading the output from a console obviously isn't a very convenient way of dealing with large amounts of data. Most of the times, we will want to operate on files (eg TXT, CSV or JSON files to mention a few). Most commonly, files are used to save data entered by the user (think of a .doc file). However, files can be downloaded from some online API or database; they can also be used to store intermediate results of computation for further processing, and to store large amounts of output during batch processing. Dealing with files is luckily straightforward in Python.

Note that since the files we will work with sit in the same directory (and on the same machine) as the notebook, you may need to view and edit them through the Jupyter Notebook interface rather than through your local text editor, according to where you are running.

# Text vs Binary files

There are essentially two types of data files: *text* and *binary*. 

**Text files** can be opened and modified with a plain text editor such as gedit, emacs or notepad. They are not necessarily written in plan English: a Python program, an [HTML](https://en.wikipedia.org/wiki/HTML) page,  a [JSON](https://en.wikipedia.org/wiki/JSON) file, an [XML](https://en.wikipedia.org/wiki/XML) file, a [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file, a [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file or a [PDB](https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)) file are all text files. While not always meant for human consumption, text files are generally somewhat human-readable and are portable across different operating systems and editors, with very minor niggles. The disadvantage is that they take up a lot of space on the disk.

**Binary files** contain data in the internal machine representation. They are less portable and are generally used either to talk directly to the machine, or to store large amounts of data efficiently, or to protect intellectual property. Python bytecode (.pyc) files, compressed (.zip) files, Microsoft Word (.doc) files, executable files (.exe on Windows systems), image files (.jpg, .gif), and music files (.wav, .mp3) are all examples of binary files. An attempt at opening them in a text editor will only show gibberish on the screen.

Happily many of the file types used in web programming, bioinformatics and other fields are text files with open format specifications, so we will not worry about binary files here (specialised libraries exist for the most common file types anyway). It is important however to remember, in your future professional life, that sometimes what you really need is not a bigger disk, but  only a more compact data representation.


# Reading files

Unless they just contain plain text meant for humans (eg a README file), text files normally follow detailed specifications that describe how the information they contain is formatted, and make it possible to retrieve and process it automatically. One of the simplest text file specifications is the [FASTA](https://en.wikipedia.org/wiki/FASTA_format) format, extremely common in Bioinformatics. It is used to store the sequence of proteins (ie, their chemical composition). Essentially, a FASTA file looks like this:
```
> Accession code | Name of protein
THEENTIRESEQUENCEARRANGEDONSEVERALLI
NESWITHOUTSPACESANDBROKENUPWITHNEWLI
NESINARBITRARYPLACESTOMAKEITLOOKTIDY
```
To see an actual example, run the cell below and click on the link it creates. If your browser won't open it, go to File->Open in the Jupyter interface or use a text editor of your choice (note: this example requires the file to exist and to be in the same directory as this notebook).

In [None]:
# This is just an IPython trick that creates a link to the file below.
# Click on the link to view the file
from IPython.display import FileLink
FileLink('P04637.fas')

Note the ```>``` sign that marks the header line and the ```|``` sign that separates the accession code from the name of the protein. The challenge is to read the file retrieving the accession code, the protein name and the entire sequence as one string, discarding the newlines that are just pretty-printing.

Reading this file in Python requires first a call to ```open()```, that returns a *file handle*:

In [None]:
FASTA=open("P04637.fas", "r")
print(FASTA)

the ```"r"``` indicates we are opening this file for *reading*. The *handle* that's returned is a convenience object used by Python to keep track of all data relative to the file (location, type, position, etc). Just treat it as you would any other Python object and assign it to a variable.

### Using ```readlines()```

An easy way to read all the file is now to use ```readlines()```:

In [None]:
# this won't work twice in a row, see below
everything=FASTA.readlines()
print(everything)

As you can see, ```readlines()``` returns a list of strings, each corresponding to a line in the file. This exhausts the contents of the file. All that remains to do now is to close the file to free the associated system resources:

In [None]:
FASTA.close()

Note that you cannot run ```readlines()``` on the same file twice without "rewinding" it (for example by closing it and reopening).

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-readlines.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

### Using ```readline()```

Using ```readlines()``` is less than ideal as it slurps the entire file up in one go, which is somewhat inconvenient as the file may contain different types of data (in this case the header and the protein proper) or may be too large to fit entirely in the memory. In general, it is better to process it one line at a time, using repeated calls to ```readline()```. In this case, this also gives us the chance to assemble the protein into a single string. Here is how it works:

In [None]:
FASTA=open("P04637.fas", "r")
header=FASTA.readline()
protein="" # build up the sequence here
while True:
    ll=FASTA.readline()
    if ll=="": break
    protein+=ll.rstrip() # remove trailing '\n'
FASTA.close()
# Done. This is just pretty-printing
(code, name)= header.split('|')
print("Accession code:")
print(code)
print("\nName:")
print(name)
print("Protein:")
print(protein)
print("\nNumber of residues:")
print(len(protein))


In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-readline.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

### Using a ```for``` loop:

There is a more "Pythonic" way of doing this that uses the fact that a file is an *iterable* and therefore can be read directly using a ```for``` loop:

In [None]:
FASTA=open("P04637.fas", "r")
header=FASTA.readline()
protein="" 
for ll in FASTA: # couldn't be easier!
    protein+=ll.rstrip()
FASTA.close()
# Done. This is just pretty-printing
(code, name)= header.split('|')
print("Accession code:")
print(code)
print("\nName:")
print(name)
print("Protein:")
print(protein)
print("\nNumber of residues:")
print(len(protein))

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-forloop.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

Reading other text file formats obviously requires a knowledge of the relevant format specification. Common formats such as JSON and CSV (see below) are supported by Python through the standard library (see for example [here](https://docs.python.org/3/library/json.html)), others are supported through third party libraries (eg [Biopython](https://biopython.org/)). In the other cases, the general strategies described above coupled with some experimentation should hopefully see you through.

# Writing files

Writing files is not much more of a hassle than reading them. In fact, the main steps are the same:
* open a file
* write the content
* close it

The only differences are that the file must be opened in write mode, and that the ```write()``` method must be used to actually write data to it.

Example:

In [None]:
OUTF=open("greetings.txt", "wt")
# strings can be written directly
OUTF.write("Hello World!\n") # \n means newline
value=('The Answer', 42) # other stuff needs to be converted to a string
OUTF.write(str(value))
OUTF.close() # makes sure the buffer gets flushed to the disk

In [None]:
# This is the usual IPython trick that creates a link to the file below.
# Click on the link to view the file
from IPython.display import FileLink
FileLink('greetings.txt')

Try running the example above twice, and check the *greetings.txt* file each time. You will see that the content gets overwritten. If you do not like that behaviour, change ```"wt"``` to ```"at"``` in the ```open``` statement to open the file for *appending*. See for yourself what that does. Appending to a non-existing file just creates it.

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-theanswer.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

So for instance, writing your imaginary protein out in FASTA format can be done this way:

In [None]:
# Your imaginary protein
accession="PXXXX"
description="My candidate protein - Homo programmaticus (Programmer)" 
sequence="""QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMK
    ILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
	FLFLIKHNPTNTIVYFGRYWSP""" # whatever
# get rid of tabs, newlines and spaces in the above string
sequence=sequence.replace(' ','')
sequence=sequence.replace('\t', '')
sequence=sequence.replace('\n', '')

# Ok, now let's write it
OUTF=open(accession+".fas", "w")
# first the header
header="> " + accession + " | " + description+"\n"
OUTF.write(header)
# output in 60 char lines for convenience
linew=60
pos=0
while sequence[pos:pos+linew]!='':
        OUTF.write(sequence[pos:pos+linew]+"\n")
        pos+=linew
OUTF.close() # Done!


In [None]:
# ...the same old trick... check the file out!
from IPython.display import FileLink
FileLink(accession+".fas")

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-hprogrammaticus.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

# CSV Files

The [CSV (Comma-Separated Values)](https://en.wikipedia.org/wiki/Comma-separated_values) file format is a popular text file format that lists each record on a separate line. Data fields for the same entry are separated by commas (or occasionaly semicolons or tabs). Many popular packages can output data as CSV files, among others Excel. Reading and writing csv files in Python is easy, either directly or via the ```csv``` module.

Example - the file *marksheet.csv* contains the following text:

```
Name, Surname, Mark
John, Smith, 50
Anne, Larsson, 65
Emiliano, Zapata, 95
Donald, Duck, 40
```

When you click on a link to this file, the browser is likely to suggest opening it as a spreadsheet:

In [None]:
# ...and again
from IPython.display import FileLink
FileLink('marksheet.csv')

We can, however, treat this as a normal text file. Suppose that we want to compute the average mark:

In [None]:
FILE=open("marksheet.csv","r")
FILE.readline() # skip header
total=0.0
students=0
for line in FILE:
    entries=line.split(',')
    total+=float(entries[2])
    students+=1
FILE.close()

print("Average: ", total/students)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-csvdiy.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

The ```csv``` module offers a slightly more convenient way of accessing the data:

In [None]:
import csv # import the csv module
FILE=open("marksheet.csv","r")

total=0.0
students=0
marksheet=csv.reader(FILE)
next(marksheet) # skip first line
for line in marksheet: # already split for us
    total+=float(line[2])
    students+=1
FILE.close()

print("Average: ", total/students)

For such a simple file it's hardly worth the trouble; however, the ```csv``` module can handle several types of files (essentially different separators) and automatically guess what software might have generated the file, see the [online help](https://docs.python.org/3/library/csv.html) for details.

The ```csv``` module also provides a ```writer``` object. However, remember that writing a ```csv``` file is essentially only a matter of putting the commas in the right places.

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-csvwcsv.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

# The ```with``` context manager

Python provides a context management control structure, introduced by the ```with``` keyword. The context manager takes care of initialising and mopping up resources your program is using - in this case, a file. In practice, instead of writing:

In [None]:
INFILE=open("marksheet.csv","r")
for line in INFILE:
    print(line, end='') # newline is already included in line
INFILE.close()

you can simply write:

In [None]:
with open("marksheet.csv","r") as INFILE:
    for line in INFILE:
        print(line, end='')
        
# the context manager automatically closes the file

The file is closed automatically for you at the end of the ```with``` block, when the context manager exits. The same happens if the file is opened for writing.

This is a very Pythonic way of doing things, and it is very effective at fixing a common problem - i.e. files being left open when they are no longer needed. Context managers have applications also in other areas, that however are mostly beyond the scope of this module.

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/fl-oldvsnew.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

**(C) 2014,2020 Fabrizio Smeraldi** ([f.smeraldi@qmul.ac.uk](mailto:f.smeraldi@qmul.ac.uk) - [web](http://www.eecs.qmul.ac.uk/~fabri/)), all rights reserved. In: "Computer Programming", School of Electronic Engineering and Computer Science, Queen Mary University of London.