## 9. Files
*Read and write*

## 9.1 Introduction

More often than not the data you need for your program will come from somewhere else - either from user input or a file. Especially for more complex data it becomes essential to be able to read in data files, do something with the data, and write out a new file with modified information or a set of analysis results.

## 9.2 Reading files
 
To read in a file you have to create a *file handle*. This is a sort of connection to the file that you can use to pull data from it. You create a connection to a file by using the **open()** function. Whenever you're done using the file, it's good practice to close the file handle. 

In [1]:
# Open the file
fileHandle = open("data/readfile.txt")  
# Close the file
fileHandle.close()
# Nothing happened...

All this does, is creating this connection, the file has not been read. In order to read in a file, there are a couple of possibilities:
- `readline()` - read the first line of the file as one string.* 
- `readlines()` - read all of the lines in the file. Each line is one string. The lines are combined as a list of lines (strings). 
- `read()` - read the whole file as one string. 
S

In [2]:
fileHandle = open("data/readfile.txt")  
fileHandle.read()

'Now we have a fourth line \nBecause Python always appends the string'

In [3]:
fileHandle.close()

In [4]:
fileHandle = open("data/readfile.txt")   
fileHandle.readline()

'Now we have a fourth line \n'

In [5]:
fileHandle.close()

In [6]:
fileHandle = open("data/readfile.txt")   
fileHandle.readlines()

['Now we have a fourth line \n', 'Because Python always appends the string']

In [7]:
fileHandle.close()

Knowing this we can move on to more complex examples. First make sure to find the PDB file *TestFile.PDB* in your data folder or download [this fake PDB coordinate file for a 5 residue peptide](http://wiki.bits.vib.be/images/3/3a/TestFile.pdb) and save it in the data directory. 

In the example below we will read all the lines in the file (as separated by a newline character), and store them in the variable *lines*. Each element in this list corresponds to one line of the file! When this is done, we close the file. 

In [8]:
# Read in the file per line
fileHandle = open("data/TestFile.pdb")
lines = fileHandle.readlines()
 
# Close the file
fileHandle.close()
 
# Print number of lines in the file
print("There are:", len(lines), "lines in the file")

# Loop over the lines, and do some basic string manipulations
for line in lines:
    line = line.strip()  # Remove starting and trailing spaces/tabs/newlines
    print(line)

There are: 263 lines in the file
HEADER    IMMUNE SYSTEM                           09-APR-01   1ABC
TITLE     SOLUTION STRUCTURE OF A NON-EXISTENT PEPTIDE
REMARK   1 THIS IS A FAKE PDB FILE
SEQRES   1 A  5  ASP VAL GLN LEU GLN
MODEL        1
ATOM      1  N   ASP A   1     -10.341  -9.922   9.398  1.00  0.00           N
ATOM      2  CA  ASP A   1     -11.156  -9.786   8.164  1.00  0.00           C
ATOM      3  C   ASP A   1     -10.288  -9.894   6.915  1.00  0.00           C
ATOM      4  O   ASP A   1     -10.429 -10.831   6.127  1.00  0.00           O
ATOM      5  CB  ASP A   1     -11.868  -8.431   8.188  1.00  0.00           C
ATOM      6  CG  ASP A   1     -11.756  -7.739   9.533  1.00  0.00           C
ATOM      7  OD1 ASP A   1     -10.726  -7.077   9.778  1.00  0.00           O
ATOM      8  OD2 ASP A   1     -12.701  -7.858  10.342  1.00  0.00           O
ATOM      9  HA  ASP A   1     -11.893 -10.575   8.149  1.00  0.00           H
ATOM     10  HB2 ASP A   1     -11.430  -7.790 

In [9]:
line = lines[10]
line = line.strip().split()
line[-1]

'C'

Now you can do many other things with the data in the file. E.g. if you want to count the number of times a carbon element appears in the file. 

In [10]:
# Open the file
fileHandle = open("data/TestFile.pdb")
 
# Read all the lines in the file (as separated by a newline character), and store them in the lines list
# Each element in this list corresponds to one line of the file!
lines = fileHandle.readlines()
 
# Close the file
fileHandle.close()
 
# Initialise the line counter
lineCount = 0
 
# Loop over the lines
for line in lines:
    columns = line.strip().split()
    if columns[-1] == 'C':       # Alternatively, use "if ' C ' in line:"
        print(line, end='')     # Using the 'end' argument in the print because the line already contains a newline at the end
                                # otherwise will get double spacing.
        lineCount += 1

print("Number of lines with ' C ': {}".format(lineCount))

ATOM      2  CA  ASP A   1     -11.156  -9.786   8.164  1.00  0.00           C  
ATOM      3  C   ASP A   1     -10.288  -9.894   6.915  1.00  0.00           C  
ATOM      5  CB  ASP A   1     -11.868  -8.431   8.188  1.00  0.00           C  
ATOM      6  CG  ASP A   1     -11.756  -7.739   9.533  1.00  0.00           C  
ATOM     16  CA  VAL A   2      -8.505  -8.908   5.584  1.00  0.00           C  
ATOM     17  C   VAL A   2      -7.043  -8.954   6.011  1.00  0.00           C  
ATOM     19  CB  VAL A   2      -8.732  -7.652   4.719  1.00  0.00           C  
ATOM     20  CG1 VAL A   2      -8.420  -7.943   3.260  1.00  0.00           C  
ATOM     21  CG2 VAL A   2     -10.157  -7.142   4.872  1.00  0.00           C  
ATOM     32  CA  GLN A   3      -4.943 -10.183   5.959  1.00  0.00           C  
ATOM     33  C   GLN A   3      -4.103 -10.294   4.693  1.00  0.00           C  
ATOM     35  CB  GLN A   3      -4.735 -11.419   6.834  1.00  0.00           C  
ATOM     36  CG  GLN A   3  

You should find 75 lines - note that in this case, for those who know the PDB format a bit, you're finding all carbon atoms.

## 9.3 Writing a file
Writing a file is very similar, except that you have to let Python know you are writing this time by adding the `'w'` parameter in the `open()` function. Actually Python needs two arguments, however it assumes that if you only give one parameter (the file that it has to read), the other one is `'r'` which stands for *reading* mode. 

For the sake of the example, we're writing something new in the `readfile.txt`:

In [20]:
f = open('data/writefile.txt','w')
f.write('Now we have a new file \n')
f.write('Because Python automatically makes this file and writes some text to it.')
f.write('Btw, if you don\'t specify the newline characters, it will append the string at the end of the last line')
f.close()
f = open('data/writefile.txt')
text = f.read()
print(text)
f.close()

Now we have a new file 
Because Python automatically makes this file and writes some text to it.Btw, if you don't specify the newline characters, it will append the string at the end of the last line


**Be careful** - if the file exists already it will be overwritten without warning!

The file is written to the directory you're executing the program in - have a look!

Now we will read in a file, extract all the lines that contain "VAL" and write out all those lines to a new variable and then make a file from it. 

----
### 9.3.1 Exercise
Read in the file from the previous example, and write out all lines that contain 'VAL' to a new file.

----

## 9.4 Advanced file reading and interpretation exercise 
Read in the TestFile.pdb atom coordinate file, print out the title of the file, and find all atoms that have coordinates closer than 2 angstrom to the (x,y,z) coordinate (-8.7,-7.7,4.7). Print out the model number, residue number, atom name and atom serial for each; the model is indicated by:
```
MODEL     1
```
lines, the atom coordinate information is in:
```
ATOM      1  N   ASP A   1     -10.341  -9.922   9.398  1.00  0.00           N
```
lines, where column 1 is always ATOM, column 2 is the atom serial,  column 3 the atom name, column 4 the residue name, column 5 the chain code, column 6 the residue number, followed by the x, y and z coordinates in angstrom in columns 7, 8 and 9.

Note that the distance between two coordinates is calculated as the square root of (x1-x2)²+(y1-y2)²+(z1-z2)².

## 9.5 Next session

Go to our [next chapter](10_Functions.ipynb).