Processing Multiple Files and Writing Files
==================================

<div class="overview-this-is-a-title overview">
<p class="overview-title">Overview</p>
<p>Questions</p>
    <ul>
        <li> How do I analyze multiple files at once?
    </ul>
<p>Objectives:</p>
    <ul>
        <li> Import a python library.
        <li> Use python library funtions.
        <li> Process multiple files using a `for` loop.
        <li> Print output to a new text file.
    </ul>
<p>Key Points:</p>
    <ul>
        <li> Use the glob function in the python library `glob` to find all the files you want to analyze.
        <li> You can have multiple `for` loops nested inside each other.
        <li> Python can only print strings to files.
        <li> You must close files so python will actually write them. You can do this either manually or automatically.
    </ul>    
</div>

## Processing multiple files



### Importing libraries

We are going to import two libraries.  One is the `os` library which controls functions related to the operating system of your computer. We used this library in the last lesson to handle filepaths.  The other is the `glob` library which contains functions to help us analyze multiple files.  If we are going to analyze multiple files, we first need to specify where those files are located.

```python
import library_name
output = library_name.function_name(input)
```

**Check your understanding** 

How would you use the `os.path` module to point to the directory where your PDB files are located?

In [1]:
import os
os.getcwd()


'/workspaces/iqb-2024'

In [2]:
os.listdir()


['environment.yml',
 'extras',
 '.gitignore',
 'LICENSE',
 'molecule_manipulation.ipynb',
 'iqb-101',
 'docking_single_ligand.ipynb',
 '.git',
 '.github',
 '.virtual_documents',
 'EC_class_ligands_search.ipynb',
 '.devcontainer',
 'README.md',
 'docking_prepration.ipynb',
 'images',
 '.ipynb_checkpoints',
 'filled_notebooks',
 'ligands',
 'protein_structures',
 'ligands_to_dock']

In [3]:
cd 'iqb-101/'

/workspaces/iqb-2024/iqb-101


In [4]:
folderpath = os.path.join("PDB_files")
print(folderpath)

PDB_files


In [5]:
file_location = os.path.join("PDB_files","*.pdb")
print(file_location)

PDB_files/*.pdb


In [7]:
import glob
filenames = glob.glob(file_location)
print(filenames)


['PDB_files/5eu9.pdb', 'PDB_files/3iva.pdb', 'PDB_files/3fgu.pdb', 'PDB_files/3vnd.pdb', 'PDB_files/6zt7.pdb', 'PDB_files/1ddo.pdb', 'PDB_files/4eyr.pdb', 'PDB_files/5veu.pdb', 'PDB_files/2pkr.pdb', 'PDB_files/7tim.pdb']


In [8]:
print(type(filenames))

<class 'list'>


### Reading multiple files with nested for loops

In [11]:
# go through every file in the list 
# 'r' reads the element f
for f in filenames :  
    with open(f, 'r') as outfile :
        data = outfile.readlines()

    for line in data:
        if 'RESOLUTION.' in line:
            res_line = line
            # print(f, res_line) # print filename and their line
            words = res_line.split() 
            resolution = float(words[3]) # element 3 of the list is resol, convert to number by float
            print(resolution)

2.05
2.7
2.15
2.6
1.85
3.1
1.8
2.91
2.4
1.9


Now, we want to get the PDB_ID with the resolution. We can use the path and filename for this.

In [16]:
first_file = filenames[0]
print(first_file)

file_name = os.path.basename(first_file)
print(file_name)

PDB_files/5eu9.pdb
5eu9.pdb


**Check your understanding** 

How would you extract the PDB ID from the example above?

file_name.split(".")   # splits the list defaukt delimiter is spaces, but here we split at the period.

In [21]:
# print(filenames)

for f in filenames:
    # Get the PDB ID
    file_base = os.path.basename(f)
    split_file_base = file_base.split(".")
    PDB_ID = split_file_base[0]
    #print(PDB_ID)

    # Read the data
    with open(f, 'r') as outfile:
        data = outfile.readlines()
    
    for line in data:
        if 'RESOLUTION.' in line:
            res_line = line
            words = res_line.split()
            resolution = float(words[3])
            print(PDB_ID, ": ", resolution, " Angstroms")


5eu9 :  2.05  Angstroms
3iva :  2.7  Angstroms
3fgu :  2.15  Angstroms
3vnd :  2.6  Angstroms
6zt7 :  1.85  Angstroms
1ddo :  3.1  Angstroms
4eyr :  1.8  Angstroms
5veu :  2.91  Angstroms
2pkr :  2.4  Angstroms
7tim :  1.9  Angstroms


## Printing to a File



```python
with open('file_name.txt', 'w') as filehandle:
    take some actions
    use filehandle.write('content') to add content to the file
```

In [54]:
# '+' tells Python to create the file if it does not already exist
with open('resolutions.txt', 'w+') as datafile:
    for f in filenames:

        # Get PDB ID
        file_base = os.path.basename(f)
        split_file_base = file_base.split(".")
        PDB_ID = split_file_base[0]
        #print(PDB_ID)

        # Read the data
        with open(f, 'r') as outfile:
            data = outfile.readlines()
        
        for line in data:
            if 'RESOLUTION.' in line:
                res_line = line
                words = res_line.split()
                resolution = float(words[3])
                datafile.write(F'{PDB_ID} \t {resolution} \t Angstroms \n')
                # print(PDB_ID, ": ", resolution, " Angstroms")

os.listdir()

with open('resolutions.txt', mode='r', encoding='utf-8') as f :
    outy = f.readlines()
    for line in outy:
        lines = line
        linesN = lines.split()
        print(linesN)
f.close()


['5eu9', '2.05', 'Angstroms']
['3iva', '2.7', 'Angstroms']
['3fgu', '2.15', 'Angstroms']
['3vnd', '2.6', 'Angstroms']
['6zt7', '1.85', 'Angstroms']
['1ddo', '3.1', 'Angstroms']
['4eyr', '1.8', 'Angstroms']
['5veu', '2.91', 'Angstroms']
['2pkr', '2.4', 'Angstroms']
['7tim', '1.9', 'Angstroms']


## A note about F string formatting
The F'string' notation that you can use with the print or the write command lets you format strings in many ways.  You could include other words or whole sentences.  For example, we could change the file writing line to

`datafile.write(F'For the PDB ID {molecule_name} the resolution is {resolution} in Angstroms.')`

where anything in the braces is a python variable and it will print the value of that variable. 

**Project**

You can complete this project to test your skills. It should be completed when this material is used in a long workshop, or if you are working through this material independently.
    
The goal of this exercise is to extract the Enzyme Commission Class for a series of enzyme structures in PDB files and write them to a text file. The files are located in the `PDB_files` folder. If you open any of these files in a text editor and search for the term "EC:" you will find a listing that looks like this: 
    
    COMPND   6 EC: 1.2.1.13;
    
You are probably familiar with these numbers, but just in case - the Enzyme Commission class tells you the function of an enzyme in a hierarchical format. You can learn more at the [BRENDA EC Explorer](https://www.brenda-enzymes.org/ecexplorer.php?browser=1&f[nodes]=21&f[action]=close&f[change]=21#21).  
  
**Your assignment** is to parse the files in the `PDB_files` folder and write a new file named `EC_class.txt` that contains the PDB ID and EC class for each of these enzymes. When you open the file in your text editor, it should look like this:

7tim 	  5.3.1.1  
6zt7 	  3.2.1.55  
5eu9 	  4.2.1.11  
3iva 	  2.1.1.13  
2pkr 	  1.2.1.13  
3vnd 	  4.2.1.20  
5veu 	  1.14.14.1
    
*Hint*

It helps when you are writing code to break up what you have to do into steps. Overall, we want to get information from the file. How do we do that?
 
If you think about the steps you will need to do this assignment you might come up with a list that is like this: 

1. Open the file for reading.
1. Read the data in the file.
1. Loop through the lines in the file.
    1. Read the files to gain access to the information we want.
    1. Extract the desired information and write it to a file.

It can be helpful when you code to write out these steps and work on it in pieces. Try to write the code using these steps. Note that as you write the code, you may come up with other steps!  
  
First, think about what you have to do for step 1, and write the code for that. Next, think about how you would do step 2 and write the code for that. You can troubleshoot each step using print statments. 
  
The steps build on each other, so you can work on getting each piece written before moving on to the next.  