## 9. Files

More often than not the data you need for your program will come from somewhere else. Often "somewhere else" will be a file. Especially for more complex data, it becomes essential to be able to read in data files, do something with the data, and write out a new file with modified information or a set of analysis results.

## 9.1 Reading files
 
To read in a file you have to create a *file handle*. This is a sort of connection to the file that you can use to pull data from. You create a connection to a file by using the `open()` function.

In [None]:
# Open the file
with open("data/readfile.txt") as fileHandle:
    #work with the file handle in this code block

# Outside the with block, the fileHandle doesn't exist

---

The above method of working with file handles is _modern_ Python. The `with` keyword tells Python to _manage a resource_
(in this case a file handle). Python will automatically `close()` the file when the `with` block is exited. Hoewever,
you may work with _legacy_ or _old_ Python wher you will have to remember to manually `close()` the file handle yourself:

In [None]:
# Open the file
fileHandle = open("data/readfile.txt")  
# Close the file
fileHandle.close()
# Nothing happened...

---

`open()` only creates a connection to a file, the file has not been read. In order to read in a file,
there are a couple of possibilities:
- `readline()` - read the first line of the file as one string.
- `readlines()` - read all of the lines in the file. Each line is one string. The lines are combined as a list of lines (strings). 
- `read()` - read the whole file as one string. 

Each method has its trade-offs and you will have to consider which is most appropriate for your use case.

**Example use cases**

1. If you're searching for the presence of a word or string in a file, given that the file is not too big, you can use `read()`.

1. If you want to process an enormously big file and from each line you need to extract, process and save the information, than it's better to read line by line with `readline()` within a for-each loop.

Try to understand the difference of these methods while you go through this section. 

Find the file `readfile.txt` in the data-folder:

```
This is the first line.
Here is a second one. 
And there is also a third line. 
```


1. Using `read`:

Note that the three different lines are read in one long string. This is how the `read` function works. 

In [None]:
with open("data/readfile.txt") as fileHandle:
    data = fileHandle.read()

print(data)


2. Using `readline()`:
`readline()` reads in lines one-per-call. It starts with the first one. When you call the method again, it will read the second line. It's important to understand this as you can exploit this method in a for-each loop to access each line separately.

In [None]:
with open("data/readfile.txt") as fileHandle:  
    line1 = fileHandle.readline()
    line2 = fileHandle.readline()

print(f"{line1=} {line2=}")


3. Using `readlines()`:
Instead of reading the lines of a file one by one, you can also do it in one go. As explained above, each line is one string and all of the lines/strings are stored in a list. 

In [None]:
with open("data/readfile.txt") as fileHandle:
    all_lines = fileHandle.readlines()

all_lines

Knowing this we can move on to more complex examples. First make sure to find the PDB file `TestFile.pdb` in your data folder or download [this fake PDB coordinate file for a 5 residue peptide](http://wiki.bits.vib.be/images/3/3a/TestFile.pdb) and save it in the data directory. 

In the example below we will read all the lines in the file (as separated by a newline character), and store them in the variable *lines*. Each element in this list corresponds to one line of the file! When this is done, we close the file. 

In [None]:
# Read in the file per line
with open("data/TestFile.pdb") as fileHandle:
    lines = fileHandle.readlines()
 
# Print number of lines in the file
print("There are:", len(lines), "lines in the file")

# Loop over the lines, and do some basic string manipulations
for line in lines:
    line = line.strip()  # Remove starting and trailing spaces/tabs/newlines
    print(line)

One line can be extracted by slicing that line out of the list. After removing whitespaces from that line, it is possible to split the line in elements separated by their tab. This method allows us to extract values from a column in a file. 

In [None]:
line = lines[10]
line = line.strip().split()
line[-1]

Now you can do many other things with the data in the file. E.g. if you want to count the number of times a carbon element appears in the file. 

In [None]:
# Open the file
with open("data/TestFile.pdb") as fileHandle:
    # Read all the lines in the file (as separated by a newline character), and store them in the lines list
    # Each element in this list corresponds to one line of the file!
    lines = fileHandle.readlines()

# Initialise the line counter
lineCount = 0
 
# Loop over the lines
for line in lines:
    columns = line.strip().split()
    if columns[-1] == 'C':      # Alternatively, use "if ' C ' in line:"
        print(line, end='')     # Using the 'end' argument in the print because the line already contains a newline at the end
                                # otherwise will get double spacing.
        lineCount += 1

print(f"Number of lines with ' C ': {lineCount}")

You should find 75 lines - note that in this case, for those who know the PDB format a bit, you're finding all carbon atoms.

## 9.3 Writing a file
Writing a file is very similar, except that you have to let Python know you are writing this time by adding the `'w'` parameter in the `open()` function. Actually `open()` always requires two arguments, however it assumes that if you only give one parameter (the file that it has to read), by default the second one is `'r'` which stands for *reading* mode. 

For the sake of the example, we're writing a new file and call it `writefile.txt`:

In [None]:
with open('data/writefile.txt','w') as f:
    f.write('Now we have a new file \n')
    f.write('Because Python automatically makes this file and writes some text to it.')
    f.write('Btw, if you don\'t specify the newline characters, it will append the string at the end of the last line')


with open('data/writefile.txt') as f:
    text = f.read()

print(text)

**Be careful** - if the file exists already it will be overwritten without warning!

The file is written to the directory you're executing the program in - have a look!

----
### 9.3.1 Exercise
Read in the file from the previous example, and write out all lines that contain 'VAL' to a new file.

----

## 9.4 Advanced file reading and interpretation exercise 
Read in the TestFile.pdb file, print out the title of the file, and find all atoms that have coordinates closer than 2 angstrom to the (x,y,z) coordinate (-8.7,-7.7,4.7). Print out the model number, residue number, atom name and atom serial for each; the model is indicated by:
```
MODEL     1
```
lines, the atom coordinate information is in:
```
ATOM      1  N   ASP A   1     -10.341  -9.922   9.398  1.00  0.00           N
```
lines, where column 1 is always ATOM, column 2 is the atom serial,  column 3 the atom name, column 4 the residue name, column 5 the chain code, column 6 the residue number, followed by the x, y and z coordinates in angstrom in columns 7, 8 and 9.

Note that the distance between two coordinates is calculated as the square root of (x1-x2)²+(y1-y2)²+(z1-z2)².

## 9.5 Next session

Go to our [next chapter](10_Functions.ipynb).