File Parsing
============

<div class="overview-this-is-a-title overview">
<p class="overview-title">Overview</p>
<p>Questions</p>
    <ul>
        <li> How do I sort through all the information in a text file and extract particular pieces of information?
    </ul>
<p>Objectives:</p>
    <ul>
        <li> Open a file and read in its contents line by line.
        <li> Search for a particular string in a file.
        <li> Manipulate strings and change data types.
        <li> Print to a new file.
    </ul>
<p>Key Points:</p>
    <ul>
        <li> You should use the os.path module to work with file paths.
        <li> One of the most flexible ways to read in the lines of a file is the `readlines()` function.
        <li> An <ital>if</ital> statement can be used to find a particular string within a file.
        <li> The split() function can be used to seperate the elements of a string.
        <li> You will often need to recast data into a different data type when it was read in as a string.
    </ul>
</div>

### Working with file paths - the `os.path` module

In [None]:
import os

In [74]:
# Show the folder I am in
os.getcwd()

'/workspaces/iqb-2024/iqb-101'

In [None]:
# Edit CB : change into desired directory
cd iqb-101/

In [None]:
# To list the files in that directory I am in
os.listdir()

In [None]:
os.listdir('PDB_files')

## Absolute and relative paths

In [None]:
# Make abolute filepath for Python
filepath = os.path.join('PDB_files', '4eyr.pdb')
print(filepath)

In [None]:
# read in filepath file
with open(filepath) as f:
    data = f.readlines()

In [None]:
print(type(data))
print(data[1])

**Check your understanding** 

Check that your file was read in correctly by determining how many lines are in the file.

In [None]:
# read in filepath file and count lines
with open(filepath) as f:
    data = f.readlines()

data_length = len(data)
print('There are ',data_length,'lines in the file.')

## Searching for a pattern in your file                                                     

In [None]:
for line in data:
    print(line)

In [None]:
for line in data:
    if "HETNAM" in line:
        print(line)

In [None]:
for line in data:
    if "HETNAM" in line:
        HETNAM_line = line
        words = HETNAM_line.split()
        print(words)

In [None]:
abbrev = words[1]
print(abbrev)

**Check Your Understanding** 

Some PDB files contain more than one heterogen. For example, the structure of D-amino acid oxidase found in PDB entry 1ddo contains three heterogens. Can you think of a way to keep all of the lines using syntax we have already learned?  

In [None]:
filepath2 = os.path.join('PDB_files', '1ddo.pdb')
print(filepath2)

In [None]:
with open(filepath2) as f:
    data2 = f.readlines()
    print(data2[0])


In [71]:
hetnams = []
for line in data2:
    if "HETNAM" in line:
        HETNAM_line2 = line
        words2 = HETNAM_line2.split()
       # print(words2)
        abbrev2 = words2[1]
        print(abbrev2)
        hetnams.append(abbrev2)


FAD
ITR
DTR


**Exercise on file parsing**

Use skills from this lesson and the previous lesson to extract the experimental method and temperature for determining the structure of 4eyr.pdb. Try to complete this exercise without opening and reading the PDB file in your text editor.

```python
EXPERIMENT TYPE : X-RAY DIFFRACTION
TEMPERATURE (KELVIN) : 298
```
    
*Hint*
- Remember that you can only use readlines once. You will need to reopen the file to read it again.
- To find the lines with the keywords, do a search and then print the lines to see their content. Then you can refine your search and split the lines as needed to get the desired output


## Searching for a particular line number in your file

**Check Your Understanding** 

What would be printed if you entered the following?

    print(data[311])
    print(data[312])
    print(data[313])
    print(data[314])
    print(data[315])


## A final note about regular expressions
Sometimes you will need to match something more complex than just a particular word or phrase in your output file.  Sometimes you will need to match a particular word, but only if it is found at the beginning of a line.  Or perhaps you will need to match a particular pattern of data, like a capital letter followed by a number, but you won't know the exact letter and number you are looking for.  These types of matching situations are handled with something called *regular expressions* which is accessed through the python module `re`.  While using regular expressions is outside the scope of this tutorial, they are very useful and you might want to learn more about them in the future.  A tutorial can be found at [Automate the Boring Stuff with Python book](https://automatetheboringstuff.com/2e/chapter7/). A great test site for regex is [here](https://regex101.com/).