# Reading and writing FASTA files

This section analyzes how to analyze biological sequences using lists and for loops in python.

#### Prerequisites for this section
* Have Anaconda python installed
* Be familiar with how to run python
* Be familiar with the Central Dogma of Molecular Biology, and how to represent DNA,RNA, and Protein sequences using python strings.
* Use 'for loops' to simplify repetitive code
* How to use 'if statements' to let your code handle different conditions

#### In this section you will learn
* How to read and write text files in python
* How sequence data are represented in the FASTA file format
* How to work with special tab ('\t') and newline('\n') characters 
* How to write FASTA files using python
* How to combine if statements and for loops to read a FASTA file in python


## The FASTA file format

As we have seen, sequence data is ubiquitous in computational biology and bioinformatics. One of the most straightforward methods of representing that data is as a FASTA file. FASTA files are simply text files. They may have various file extensions including .txt, .fa, .fasta, .fna or .faa. Typically, .fna indicates a nucleotide fasta file, whereas .faa indicates an amino acid fasta file.

Let's explore an example FASTA file in python. In doing so, we will learn about how to read and write  Then we'll think more carefully about how these files are put together. 


## How to open files in python

Opening files in python is fairly straightforward: the [open](https://docs.python.org/3/library/functions.html#open) function takes in the relative path to a file, and returns a file object. You can then iterate over the lines of that file object in a for loop, and you will be given each line as a python string object.

To test out the `open` function for yourself, first download an example fasta file for the Human FMR1 protein and ensure that it is saved in your current working directory with the filename `Human_FMR1_Protein_UniProt.fasta`.

At first, we will simply open the file and print out each of its lines. I find this is a relatively low-stress way to start working on a new type of file format, rather than trying to immediately jump into writing more complex code.

Here's one way to do this in python:

In [8]:
infile = open("./Human_FMR1_Protein_UniProt.fasta")
for line in infile:
    print(line)


>tr|R9WNI0|R9WNI0_HUMAN Fragile X mental retardation 1 OS=Homo sapiens OX=9606 GN=FMR1 PE=1 SV=1

MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDIN

ESDEVEVYSRANEKEPCCWWLAKVRMIKGEFYVIEYAACDATYNEIVTIERLRSVNPNKP

ATKDTFHKIKLDVPEDLRQMCAKEAAHKDFKKAVGAFSVTYDPENYQLVILSINEVTSKR

AHMLIDMHFRSLRTKLSLIMRNEEASKQLESSRQLASRFHEQFIVREDLMGLAIGTHGAN

IQQARKVPGVTAIDLDEDTCTFHIYGEDQDAVKKARSFLEFAEDVIQVPRNLVGKVIGKN

GKLIQEIVDKSGVVRVRIEAENEKNVPQEEGMVPFVFVGTKDSIANATVLLDYHLNYLKE

VDQLRLERLQIDEQLRQIGASSRPPPNRTDKEKSYVTDDGQGMGRGSRPYRNRGHGRRGP

GYTSGTNSEASNASETESDHRDELSDWSLAPTEEERESFLRRGDGRRRGGGGRGQGGRGR

GGGFKGNDDHSRTDNRPRNPREAKGRTTDGSLQIPPVKVVGCARVKIVTRRKRSQTAWMV

SNHS



**Stop and Consider**. Given just what you've seen so far, is this sequence DNA, RNA, or protein? If you can't tell, review the material on the how these sequences are represented before proceeding.

#### Trying to read a file that does not exist triggers a FileNotFoundError
Just so that we can fix any errors that occur, let's test what happens if we try to open a file that IS NOT in the place we say it is. Let's try to open the file 'does_not_exist.txt', which you presumably will not have in your current working directory:

In [10]:
infile = open('does_not_exist.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'does_not_exist.txt'

As expected, we get an error. Specifically we get `FileNotFoundError: [Errno 2] No such file or directory: 'does_not_exist.txt'`. If you see an error like this one when trying to read from a file, the most likely reason is that the file you tried to open isn't in your current working directory, or is in that directory under another name. Minor differences in capitalization, spelling, or the file extension (e.g. '.text' instead of '.txt') are common reasons you might get this error even if you think the file is there. So if you get this error you should a) double-check the file is where it should be b) check that the filename is specified correctly.

## The structure of FASTA files

If everything worked you should now see each line of the FASTA file printed out one by one. Here's how FASTA files are structured:

- FASTA files can contain one or more sequences
- Each sequence entry has two main parts, an **identifier line** and one or more **sequence lines**
- The **identifier line** *must* begin with a `>` (greater than) character. This line says what the sequence is. The identifier line also indicates that all following lines are sequence, until you reach either another identifier line or the end of the file. In some cases, identifier lines *may* also contain additional information about the sequence - like it's putative function - but this entirely depends on the specific database or research group that wrote the FASTA file. In our example file, the identifier line is: `>tr|R9WNI0|R9WNI0_HUMAN Fragile X mental retardation 1 OS=Homo sapiens OX=9606 GN=FMR1 PE=1 SV=1`

- **Sequence line(s).** the line(s) of a FASTA file after an identifier line, contain the actual sequence data. In some cases, long sequences will simply be placed on a single very long line. Your text editor may wrap this around so it looks like multiple lines, but as far as the computer is concerned it's just one. In other FASTA files, like this one, long sequences are broken up into multiple lines, each with no more than a certain number of characters (here 60). 

One difficulty with multiline fasta files is that each line ends in a special newline character. You might notice that when we run our code, there seems to be an extra blank line after each line of sequence data. This is due to hidden newline characters.  These newline characters indicates where to end the line, and are roughly equivalent to hitting return on your keyboard when typing. When working with python strings, newline characters are represented with `'\n'` . If you newline characters to a string, it will print out on more than one line. Here's a quick example:

In [12]:
print("First let's print a sequence without newline characters")
print("AATAAA") 
print ("Now let's add newline characters:")
print("AA\nTA\nAA\n")

First let's print a sequence without newline characters
AATAAA
Now let's add newline characters:
AA
TA
AA



As you can see, adding `\n` caused our string to be broken up across several lines of text when printed to the screen. This is why our FASTA file 

### Removing newline characters from a FASTA file

If you want a multiline string to print out on just one line, you have to remove newline characters. To do this, we have to remember two things: first, the `replace` string method let's us replace part of a string with something else, and second, the newline character is just a normal part of a python string. Therefore, we can replace it with an empty string (`''`) in order to remove newlines from our file.

Let's revise our code for reading FASTA files so that it strips newline characters:


In [13]:
infile = open("./Human_FMR1_Protein_UniProt.fasta")
for line in infile:
    line = line.replace('\n','')
    print(line)

>tr|R9WNI0|R9WNI0_HUMAN Fragile X mental retardation 1 OS=Homo sapiens OX=9606 GN=FMR1 PE=1 SV=1
MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDIN
ESDEVEVYSRANEKEPCCWWLAKVRMIKGEFYVIEYAACDATYNEIVTIERLRSVNPNKP
ATKDTFHKIKLDVPEDLRQMCAKEAAHKDFKKAVGAFSVTYDPENYQLVILSINEVTSKR
AHMLIDMHFRSLRTKLSLIMRNEEASKQLESSRQLASRFHEQFIVREDLMGLAIGTHGAN
IQQARKVPGVTAIDLDEDTCTFHIYGEDQDAVKKARSFLEFAEDVIQVPRNLVGKVIGKN
GKLIQEIVDKSGVVRVRIEAENEKNVPQEEGMVPFVFVGTKDSIANATVLLDYHLNYLKE
VDQLRLERLQIDEQLRQIGASSRPPPNRTDKEKSYVTDDGQGMGRGSRPYRNRGHGRRGP
GYTSGTNSEASNASETESDHRDELSDWSLAPTEEERESFLRRGDGRRRGGGGRGQGGRGR
GGGFKGNDDHSRTDNRPRNPREAKGRTTDGSLQIPPVKVVGCARVKIVTRRKRSQTAWMV
SNHS


Notice that the text now prints out much more nicely. If we wanted to, we could have used `line = line.strip()` instead of our call to `replace`. The `strip` method removes any whitespace characters at the end of a line, including newline characters, spaces, and tabs.

## Building a FASTA file parser

In [15]:
#TODO

In [16]:
## Example code that processes a FASTA file

In [17]:
#Normally this would be determined
#by user input via argparse. Hard-coded for now
input_file = './Human_FMR1_Protein_UniProt.fasta'

#Open a file object

f = open(input_file)

def parse_fasta_file(input_file):
    """Return a dict of {id:gene_seq} pairs for input file
    input_file -- a file handle for an input fasta file
    """
    parsed_seqs = {}
    curr_seq_id = None
    curr_seq = []

    for line in f:
        #remove newline characters ("\n")
        line = line.strip()

        if line.startswith(">"):
            if curr_seq_id is not None:
                parsed_seqs[curr_seq_id] = ''.join(curr_seq)

            curr_seq_id = line[1:]
            #empty out the current sequence
            curr_seq = []
            continue

        #If we got here we've got sequence!
        curr_seq.append(line)

    #Add the final sequence to the dict
    parsed_seqs[curr_seq_id] = ''.join(curr_seq)
    return parsed_seqs


parsed_seqs = parse_fasta_file(input_file)

#Write the output to a file:
output_filepath = 'parsed_seqs.txt'
output_file = open(output_filepath,'w+')

for seq_id,seq in parsed_seqs.items():
    print(seq_id,seq)
    output_file.write("\t".join([seq_id,seq])+"\n")

tr|R9WNI0|R9WNI0_HUMAN Fragile X mental retardation 1 OS=Homo sapiens OX=9606 GN=FMR1 PE=1 SV=1 MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDINESDEVEVYSRANEKEPCCWWLAKVRMIKGEFYVIEYAACDATYNEIVTIERLRSVNPNKPATKDTFHKIKLDVPEDLRQMCAKEAAHKDFKKAVGAFSVTYDPENYQLVILSINEVTSKRAHMLIDMHFRSLRTKLSLIMRNEEASKQLESSRQLASRFHEQFIVREDLMGLAIGTHGANIQQARKVPGVTAIDLDEDTCTFHIYGEDQDAVKKARSFLEFAEDVIQVPRNLVGKVIGKNGKLIQEIVDKSGVVRVRIEAENEKNVPQEEGMVPFVFVGTKDSIANATVLLDYHLNYLKEVDQLRLERLQIDEQLRQIGASSRPPPNRTDKEKSYVTDDGQGMGRGSRPYRNRGHGRRGPGYTSGTNSEASNASETESDHRDELSDWSLAPTEEERESFLRRGDGRRRGGGGRGQGGRGRGGGFKGNDDHSRTDNRPRNPREAKGRTTDGSLQIPPVKVVGCARVKIVTRRKRSQTAWMVSNHS


## Exercises

**Write an example FASTA file using python**. Your file should contain three sequences, and be in the FASTA format. Don't write it by hand in a text editor! The point is to practice writing things like this into files using python. Seq_1 has sequence 'ATGCTGGGTACGATCGTACGTACGTACG', Seq_2 has sequence 'ATCGCGGGGCTATATCTGGATTTTTAAACGGATCG', and Seq_3 has sequence 'ATCGACGATCGTACGATCGTACGGGGGGGGTACTATCTATTATATATTAAAAAAAATGCTAGCTACGATCTACGTACGTACGATCGTACGGCTAGCTACGATCGTACGATGCTATATATACGCGAGGCTACGTACGATCGTACGAT'.