In [7]:
# In Class Oct. 20
# Zeke Van Dehy
# Oct 20

# Opening files for reading #

Today's assignment comes with a zip file containing a few small files that you will be able to work with in Jupyter. Put the zip in your Notebooks directory and 'unzip Oct1files.zip'. 

You will have three files:

- justafile.txt: contains some random text
- dnaseqs.txt: contains several DNA sequences of varying length and GC content, with string “TACG” at random locations in forward and reverse, with no headers
- mock.faa: contains several protein sequences in real FASTA sequence format with header lines for each sequence

You also have a couple of other files from previous exercises -- NC_007898.txt and sequences.fastq. Keep those around -- we might use them!

To open a file for reading, we use the reserved keyword open(). Open does not just open and read the file out though. We need to assign what open returns to a file object.

```fileobject = open("filename.txt")```

Open the file justafile.txt and assign it to the file object justanobject. Now try to print justanobject. What happens?

In [1]:
justanobject = open("InClass-Files-Oct20/justafile.txt")
print(justanobject)

<_io.TextIOWrapper name='InClass-Files-Oct20/justafile.txt' mode='r' encoding='UTF-8'>


# The .read() file method #

WHAT JUST HAPPENED? We made a file object, but the contents of a file object are NOT the contents of the file. Instead, the file contains a pointer to the file on your system. It refers to the file the same way a list object name refers to a complete list. If you actually want to get into a list and use it, you need to use keywords (like "for") or methods (like .append()) to work with the list content.

You need to use FILE METHODS and keywords (open and with) to work with the file object content. The first method  we'll learn is .read()

```filecontents = file.read()```

Work with the justafile.txt and justanobject, same names as you used in the cell above. Use .read() on the *file object* to read the contents of the file. You can assign them to a variable and print the variable. Or you can use them directly, e.g. in a print statement.

In the cell below, print the contents of justafile.txt.

In [2]:
justaread = justanobject.read()
print(justaread)

Hi, I am just a file
With my contents in no particular format
And now you have opened and read it.




# Common errors #

We just saw what happens when you try to print or use a file object when you actually mean to use its contents. You print the pointer, not the file itself. What happens when you apply a file method, like read, directly to the filename?

```"filename.txt".read()```

Try this with justafile.txt. Learn to recognize these common errors.

In [3]:
"justafile.txt".read()

AttributeError: 'str' object has no attribute 'read'

If you try to access a file name without opening it and assigning it to a file object, the interpreter just reads it as a string -- or as a variable that's not defined, depending on which way you mess it up.

# The open function #

You're just always going to have to use open() if you want to get at the contents of a file. Open usually takes two arguments -- the filename itself and a string that encodes the opening mode.  Mode can be either "r", "w", or "a" -- read, write or append. By default files are opened in "r" mode so you don't even need it if you're just reading. 

Writing mode can also be binary if you add a b -- but we are not going to do that in this course. There's also a third argument you can use, to open a file with buffering or without, but again, that's more advanced than we're going to get right now.

- "r" mode puts the pointer at the beginning of the file and starts reading there
- "w" mode puts the pointer at the beginning of the file and starts writing there, overwriting any content
- "a" mode puts the pointer at the end of the file and starts adding on from there

In the cell below, open the file mockfaa.txt as file object fasta, in "r" mode.

In [4]:
mockFile = open("InClass-Files-Oct20/mockfaa.txt", "r")
print(mockFile)

<_io.TextIOWrapper name='InClass-Files-Oct20/mockfaa.txt' mode='r' encoding='UTF-8'>


# What gets opened must get closed #

The .close() method closes a file once you're done. Typically if you are done with the contents of the file you can use fileobject.close() to close it.

Close the fasta file object that you opened in the previous cell.

In [5]:
mockFile.close()

# Opening a file for writing #

We can open a file for writing and write something to it. Assign a file called outfile.txt to a file object, using the second argument "w" (yes, in quotes) like so:

```fileobject = open("filename.txt","w")```

Then use fileobject.write() to write something to the file. I suggest that you put some text in quotes inside those parentheses, and make sure that your text ends with an escaped newline, \n. This makes sure that you have whitespace that you need in your file so you can write another line later without it all becoming one big line.

```fileobject.write("Hi I am a string in a file \n")```

Then close the file with fileobject.close()

Do this below. The file will automatically be created the first time you write to it. Then go out to your Unix shell window and cd into your Notebooks directory (or wherever you are putting this notebook). See whether the file is there, and use the Unix command "cat" to see what its contents are. 

In [9]:
newFile = open("InClass-Files-Oct20/newFile.txt","w")
newFile.write("Hi I am a string in a file \n")
newFile.close()
newFile = open("InClass-Files-Oct20/newFile.txt","r")
print(newFile.read())
newFile.close()

Hi I am a string in a file 



# Opening a file for appending #

Now use the same filename (outfile.txt) and open it for appending instead of writing.

fileobject = open("filename.txt","a")

Use another fileobject.write command to add a second line of text to the file. Go out and look at your file and make sure it worked!

In [10]:
newFile = open("InClass-Files-Oct20/newFile.txt","a")
newFile.write("This is a second line of text \n")
newFile.close()
newFile = open("InClass-Files-Oct20/newFile.txt","r")
print(newFile.read())
newFile.close()

Hi I am a string in a file 
This is a second line of text 



# Creating a with block #

The exception to the "you have to close it" rule is if you open the file inside a with statement, initiating a with block. with is a reserved keyword like def or for that initiates a particular type of block.

with statements in python ensure that clean-up code is executed. There are other contexts for with in python that you may need as you get more advanced, but the one place we're commonly going to use with is to open files -- because it ensures that the file is closed without explicitly using a .close() statement. In general, with blocks run their contents in a way that allows your code to continue even if an exception is raised inside the block. But you don't usually need them in simple code, other than for opening files. 

```with open("filename.txt","r") as fileobject:
    filecontents = fileobject.read()
    print(filecontents)

In the cell below, open justafile.txt in a with block, get the file contents into a variable, and print them out.

In [11]:
with open("InClass-Files-Oct20/justafile.txt","r") as justanobject:
    filecontents = justanobject.read()
    print(filecontents)

Hi, I am just a file
With my contents in no particular format
And now you have opened and read it.




# The .readlines() method #

What if instead of getting the file as a big chunk of text, we want to get it one line at a time? We use .readlines() instead of .read(). There's also a .readline() method that gets just the first line, but honestly I just learned that one existed when I was working on these notes because have NEVER had call to use it.

Try .readlines() with the mockfaa.txt fasta file. First create a file object that points to mockfaa.txt, and then apply .readlines() to that file object. 

In [12]:
with open("InClass-Files-Oct20/mockfaa.txt") as mockfaaFile:
    mockfaaFile.readlines()

# Filtering .readlines() output #

The thing about .readlines() is that it makes the file contents into an iterable object. If you set a variable = somefileo.readlines(), it will end up containing each line in the file as an item in a list.

A list is an iterable object, though. So a very typical way to use .readlines() is in a code block like so:

```with open("filename.txt","r") as fileobject:
    for line in fileobject.readlines():
        print(line)```
        
Using .readlines(), open up the dnaseqs.txt file. Loop through the lines in .readlines() and print out just the first 10 characters in the line.

In [15]:
with open("InClass-Files-Oct20/dnaseqs.txt") as dnaseqsFile:
    for line in dnaseqsFile.readlines():
        print(line[0:10])

CTATGTATAC
ACCCCTTATT
TACGCGGTTG
GTCAAAATTT
TCATACGAGG
TTCCGTACGT
TTCCGTACGT
TGTTTCCGAG
TTCCGTTAAG
TTCCGTTACG
TGTTTCCGAG
TTCCGTTAAG


# Filtering .readlines() output #

(5 points) Using .readlines() again, open up the dnaseqs.txt file. That file contains the pattern "TACG" somewhere on each line, in either the forward or the reverse of the sequence. Use a loop, conditional blocks, and the find method to find where TACG is either in the forward or the reverse, and print out the position of the first instance. You don't have to build this as a function -- just do it straight through.

In [16]:
with open("InClass-Files-Oct20/dnaseqs.txt") as dnaseqsFile:
    for line in dnaseqsFile.readlines():
        index = line.index("TACG")
        if index != -1:
            print("FORWARD ", index)
        else:
            print("REVERSE ", list(line[::-1].index("TACG")))

ValueError: 'TACG' is not in list

# The .writelines() method #

Python also has a .writelines() method, which will potentially be useful if you want to write many things to a file. Writelines has to receive an iterable. It could be a list that you already have. It could be a generator call. For now let's use it with a simple list.

```table2 = ["Amit\n", "Shruti\n", "Maya\n"]
with open("table2.txt","w") as fo:
    fo.writelines(table2)```
    
writes a file that contains

```Amit
Shruti
Maya```
    
In the cell below, do the following. Create a list from the alphabet string, and write the list to the file with writelines. Go out of jupyter and into your Unix directory, and see what ended up in the file.

Now think about how you can use a for loop to modify the items in your list so that they print out on separate lines. See if you can make it work!

# Screening lines for a pattern that tells what type they are #

The format of a FASTA file is different than a fastq. FASTA entries come in pairs. First a description line, which begins with '>'. Then a line or a series of lines that contain the sequence. There can be multiple individual lines in between description lines, but all those lines are part of the one sequence that matches the description.

Let's do the easy part of this first.

In the cell below, open mockfaa.txt in a with block, iterate through its lines, and print the line only if the first character of the line is '>'.

# Triggering an event when parsing a file #

Last time we saw some example code for reading a FASTQ file, where the lines come in a simple four-line pattern:

```def read_fastq(fastq):
	"""reads a fastq file into three variables, streams"""
	name,seq,qual = [None],[None],[None] # initialize variables
	for i,line in enumerate(fastq): #gets the lines plus a counter
		line = line.strip() #cleans up the whitespace
		if i % 4 == 0: #decides what to do based on the index -- is it line 0,1,2 or 3 in the pattern
			name = line # if it's an 0, it's a header
		elif i % 4 == 1: # if it's a 1, it's a sequence
			seq = line
		elif i % 4 == 3: # if it's a 3, it's a quality string
			qual = line 
			yield name,seq,qual # done with line 3, yield all values and go back and get the next line
			name,seq,qual = [None],[None],[None] #set values to None after yield just to be sure
		else:
			pass #this happens if it's line 2```
            
This was the code we used to create infile and pass it to the fastq parser:

```with open("sequences.fastq") as infile: #this call turns the open file into an object
    for n,s,q in read_fastq(infile): #this is how we call read_fastq on the object and get the values back```
    
The trigger in this code is that when we get to the fourth line in the pattern (i % 4 == 3) then we are done with the pattern, and we have to yield the values from that set of four lines. Then we clear the values out by setting them to None, and start on to the next set of four lines. The counter goes up four times, and then the yield is triggered again. 

Note: Files do not have to be parsed with generators. You can parse them with the kind of functions that you have already used, that give up their values all at once with "return". The yield here could easily be replaced by a statement appending each variable to a group of parallel lists, or anything else you wanted to do with the values to package them up to be returned all at once.

# Triggering an event #

(10 points) Let's think about how to solve the actual problem of the FASTA file, which is to get the sequence out in description/sequence pairs. We need a trigger that can tell us when to gather up all the files in a description/sequence pair, save them, clear everything and move on to the next iteration.

Because each sequence is divided up into multiple lines, that all need to be joined together before they match up with a description line, we can't just go based on the count of lines like the FASTQ example I showed above.

So what could we do? Let's discuss this in class first, and then put your solution below. Get your values into either two lists, names and seqs, or if you want to be fancy, into a single list of tuples (name,seq), and return the list(s) from your function.

*Getting this parser to work and putting it in the form of a function is IMPORTANT. We will use this function again in future weeks, we'll turn it into a generator, and you may even find a way to use it in other classes in the future. So don't let this one get past you. Come see us if you need a walkthrough. Do it until you can do it.*