# Working with files in Python

Many of the files that bioinformaticians work with are flat text files. That means that reading and writing most of the files you will be dealing with can be acheived using the same tools. To read in special filetypes like Microsoft Office files or images, you will need to use additional packages which offer that functionality. For the purposes of this course, we will be sticking with reading and writing text files.

## Opening files with `open()`

Python comes with a built-in function to open files stored on your computer: `open()`. `open()` returns a class that has several methods for interacting with files. Here, we're going to go over the `.read()`, `.readlines()`, `.write()`, and `.close()` methods. Before we can use one of those methods, though, we first need to open a file. I have included an example file, "plantgrowth.txt", along with these notebook files for you to work with.

You can open a file using `open()`, but specifying the filepath and the opening mode. Opening mode refers to whether you are going to read ("r"), write ("w"), or append ("a") to the file. The default mode is read. There are also some other modes that we won't cover here.

In [1]:
# If you don't specify mode, then the file is opened with read mode. Equivalent to read(<filepath>, 'r')
file = open("plantgrowth.txt")
print(file)

<_io.TextIOWrapper name='plantgrowth.txt' mode='r' encoding='UTF-8'>


## `.read()`

Now that we have opened our file, we can interact with it using the variable `file` to get access to the contents of the file. The simplest way to get the contents of a file is using the `.read()` method. `.read()` returns the entire file contents as a `str`.

In [2]:
contents = file.read()
print(contents[:50])

weight	group
4.17	ctrl
5.58	ctrl
5.18	ctrl
6.11	ct


Note that we can only read a file once. After that we need to open it again if we want to do something with it. However, we still need to remember to close it using `.close()`. If we don't close files the we clutter up our memory with the open files. If you are opening many or large files you might run into an error complaining you have too many files open.

In [3]:
file.close()

Even though our file is now closed, we still have access to the contents that we read into our `contents` variable. We can now work with the data within.

The data in this file are the weights of plants treated with either a control condition or one of two treatments. Each line represents a different plant and includes its weight and treatment condition. Above, we printed the first 50 characters of the file contents to see what it looked like. We can see indication that there are newlines as the data are not all on one line. We can also see evidence of tabs as each column is aligned. However that might be spaces. We can view whitespace characters as literals (e.g., "\n", "\t" etc) using the Python built=in function `repr()`. `repr()` returns a string representation of an object in such a way that you could copy and paste the output into a Python script to recreate the printed object ([See the docs here](https://docs.python.org/3/library/functions.html#repr)).

In [4]:
print(repr(contents[:50]))

'weight\tgroup\n4.17\tctrl\n5.58\tctrl\n5.18\tctrl\n6.11\tct'


If we want to extract the data in each column of each line, then we need to handle the delimiters separating those data. There are two delimiters in this file: newlines ("\n") and tabs ("\t"). Newlines separate each sample, while tabs separate the data associated with each sample. Fortunately, the contents of the file are stored in a `str` instance and `str` has a method specifically for dealing with situations like this: `.split()`. 

We covered `.split()` in an earlier notebook. `.split()` takes a delimiter as input and splits a `str` into a `list` using that delimter. The deafault delimiter used by `.split()` is any whitespace character. That default would certainly split up our data, but we would lose the structure provided by the two delimiters. Therefore, we should instead split twice. First on newlines, then on tabs.

In [5]:
file_lines = contents.split("\n")
print(file_lines[:3])

['weight\tgroup', '4.17\tctrl', '5.58\tctrl']


Now we have a list of our lines. We have separated our samples from one another, but have retained the association of the weight and treatment for each sample.

Next we need to decide how to store our data in a useful way for further processing. We have a few options. We could further unpack our data by splitting each element of the `file_lines` list so that we have a list of lists. However, if we ever want to retrieve all of the treatment 1 samples, we would then have to iterate over ever sample and use an `if` statement to identify the ones with the desired treatment. Instead, let's use a `dict`. Whenever you are in a situation where you might want to look up data using some other associated date (in this case the treatment), a `dict` should be your first thought.

I think a sensible `dict` structure to use is one in which treatment types are the keys and the list of weights of samples given each treatment are the values. We could represent that with pseudocode as `{<treatment>: <list of weights>}`. It can be helpful to add pseudocode like that as a comment when you create a `dict` so that you have a reference later of what the structure of the `dict` is.

For now, we're going to use a cumbersome approach to build this `dict`. However, once we cover importing modules, we'll start using an easier approach to make `dict`s like this. 

In [6]:
treatment_weights_dict = {}
for line in file_lines[1:]: # [1:] skips the first one as we don't need to store the headers
    weight, treatment = line.split("\t")
    if treatment in treatment_weights_dict: # does the key already exist
        treatment_weights_dict[treatment].append(weight) # lookup the list associated with treatment and append this weight to that list
    else: # if not, add it with a list as the value
        treatment_weights_dict[treatment] = [weight]

# Now we have a nicely organized object storing our data
print(treatment_weights_dict["ctrl"])

ValueError: not enough values to unpack (expected 2, got 1)

**What went wrong???**

The above error is a little confusing. We saw above that we can split our file on newlines and get a list of tab-delimited columns like `['weight\tgroup', '4.17\tctrl', '5.58\tctrl']`. So why is there only one element when we then split those on tabs?

If we try to run the above code on just the first three lines that we've looked at, it works fine

In [7]:
for line in file_lines[:3]:
    weight, treatment = line.split("\t")
    print(weight, treatment)

weight group
4.17 ctrl
5.58 ctrl


So the columns of each line are unpacking into our two variables just fine. So what is the problem?

The issue here is a quirk of how `.split()` behaves. `.split()` searches a `str` for every instance of the delimiter character and, everytime it finds a delimiter, it stores the string to the left and to the right of that delimiter. The above error is occuring because EVERY line in a file ends in a newline. That means that there is a newline (i.e., our delimiter) at the very end of the file as well. What `.split()` does in that case is it includes the string to the right of that final newline in the returned `list`. I.e., there is an empty string at the end of our list.

We can see that if we look at the end of our `file_lines` object.

In [8]:
print(file_lines[-2:])

['5.26\ttrt2', '']


**What can we do about it?**

We have two options to handle the blank lines. We can remove them from the `list` we are iterating over using something like the `filter()` function ([docs here](https://docs.python.org/3/library/functions.html#filter), or we can add code to our loop to handle blank lines. Let's use `filter()` for now. We will cover how to better control our loops in a later class.

In [9]:
treatment_weights_dict = {}
for line in filter(None, file_lines[1:]): # filter(None, <iterable>) ignores any blank elements in the iterable
    weight, treatment = line.split("\t")
    if treatment in treatment_weights_dict: # does the key already exist
        treatment_weights_dict[treatment].append(weight) # lookup the list associated with treatment and append this weight to that list
    else: # if not, add it with a list as the value
        treatment_weights_dict[treatment] = [weight]

# Now we have a nicely organized object storing our data
print(treatment_weights_dict["ctrl"])

['4.17', '5.58', '5.18', '6.11', '4.5', '4.61', '5.17', '4.53', '5.33', '5.14']


Now it is working and we have a nicely organized object that we can use to analyze the data in our file.

That was quite a lot of work, though. Most of the files we work with have lines in them, and it is common to want to process a file line by line. It would be a pain to always have to read in the whole file, split on newlines, and have to remember to remove blank lines. Fortunately, there is a method that handles that for us: `.readlines()`.

## `.readlines()`

`.readlines()` handles a lot of what we just did above for us. Specifically, it returns an [iterator](https://docs.python.org/3/glossary.html#term-iterator) object (i.e., an object that gives us it's elements one by one for use in a loop) that returns each line of the file one by one. It doesn't return a blank line at the end (though it will still return blank lines in the middle of the file if there are any). However, as we aren't splitting the `str`, the newline delimiter is not removed. We can easily get rid of trailing newlines with `.strip()`

As we have already read our file, we need to open it again to use `.readlines()`. However, as we are able to use `.readlines()` to iterate over the lines of our file, we don't need to immediately split our file into a `list`. Instead, we can go straight into our loop.

In [10]:
file = open("plantgrowth.txt")
treatment_weights_dict = {}
for line in file.readlines()[1:]: # .readlines() is a method of our open file object. [1:] to skip the header line
    weight, treatment = line.strip().split("\t") # strip away trailing newlines, then split the str
    if treatment in treatment_weights_dict: # does the key already exist
        treatment_weights_dict[treatment].append(weight) # lookup the list associated with treatment and append this weight to that list
    else: # if not, add it with a list as the value
        treatment_weights_dict[treatment] = [weight]

# close the file once we are done reading it
file.close()
# Now we have a nicely organized object storing our data
print(treatment_weights_dict["ctrl"])

['4.17', '5.58', '5.18', '6.11', '4.5', '4.61', '5.17', '4.53', '5.33', '5.14']


That was much easier than reading the whole file and handline splitting ourselves. The only extra thing to remember is that we need to strip the trailing newline from each line as `.readlines()` does not remove the newlines for us.

## Splitting and chaining methods

There's actually a simple way to not have to do that `.strip()` operation, but I did it that way to introduce a new idea: chaining methods together like `line.strip().split("\t")`. Before we get into that, let's quickly go over the way we could have done the above without `.strip()`.

### `.split()` behaviors

Perhaps the most common usecase of `.split()` is to break up lines of text into columns. Depending on the formatting of the text, you will use `.split()` in different ways. The default mode for `.split()` uses any number of whitespace characters as a delimiter. This is useful when your file is column-based, but the columns are padded with space characters to align the columns. Depending on the length of column contents, it takes a different number of spaces to align the following column. See for example the following string.

In [11]:
"column1 column2                    column3\n".split()

['column1', 'column2', 'column3']

Even though there was an inconconsistent number of spaces between the three columns, `.split()` was able to correctly identify which column was which. It also didn't return an empty string at the end of the list. The reason for this is the algorithm used by `.split()` when operating in its default mode.

If you think about how you could go about writing a program that would identify elements separated by an unknown number of whitespace characters, one solution you might come up with is to split on every individual whitespace character and then remove any empty strings. All that would remain then is the non-empty elements you were looking for. That is how `.split()` works when no delimiter is specified. Therefore, no empty strings are present at the end of the returned list.

The alternative mode for `.split()` is when a delimiter is specified. This mode is useful when you are reading a csv or other format where there is a fixed delimiter. When a csv has an empty column, that is represented by consecutive commas. The same is true in other similar formats with fixed delimiters. In that case, when you use `.split()` to identify the columns, you want to get back an empty string for any that are missing so that your 4th column is still 4th in your list. If empty strings were removed, then you would have no idea which column is which for lines missing one or more columns.

In this way, `.split()` is kind of like two methods in one. It behaves differently depending on how you run it. It is important to be aware of these two modes of operaton when using it as you may need to process the outputs differently in each case.

As for the data we have been working with here, as each line only has whitespace between the columns, and has no whitespace within each column (e.g., if you had tab-separated columns, but could have multiple words in a column), we can use `.split()` with its default mode instead of splitting on tabs. That will remove the trailing newline for us.

In [12]:
file = open("plantgrowth.txt")
treatment_weights_dict = {}
for line in file.readlines()[1:]:
    weight, treatment = line.split() # lines look like "weight\tgroup\n"
    if treatment in treatment_weights_dict:
        treatment_weights_dict[treatment].append(weight)
    else: 
        treatment_weights_dict[treatment] = [weight]

file.close()
print(treatment_weights_dict["ctrl"])

['4.17', '5.58', '5.18', '6.11', '4.5', '4.61', '5.17', '4.53', '5.33', '5.14']


### Chaining methods

In the above example (before talking about the details of `.split()`), we used a syntax that we haven't covered yet: `line.strip().split("\t")`. If you've never seen that kind of syntax before, it might look complicated. However, the way it is working is very simple. When reading code with chained methods like that, you just need to keep in mind two things: the order of operations, and the class being returned by each method.

#### Order of operations

chains of methods are executed in the same direction you read them: left to right. You can think about what's going on as being the same idea as the process-substitution we used in Bash (i.e., the syntax `$(<command>)`). Let's walk through the order of operations of the line `line.strip().split("\t")` to illustrate what is happening.

The first component of the statement is the variable `line`. `line` will contain different data on each line of our file, but let's consider the first line for this example: "weight\tgroup\n". What Python does can be thought of as replacing the word "line" in that statement with the contents of the `line` variable. We could therefore rewrite the statement as `"weight\tgroup\n".strip().split("\t")`.

In [13]:
"weight\tgroup\n".strip().split("\t")

['weight', 'group']

The next thing to execute is the `.strip()` method. `.strip()` removes whitespace characters from either end of a `str`. In this case, it returns the following (we can check by just running that portion of the statement):

In [14]:
"weight\tgroup\n".strip()

'weight\tgroup'

The output of `.strip()` is what `.strip()` is then run on. You can think of the statement up to that point as having been replaced by its output. i.e.,:

In [15]:
'weight\tgroup'.split("\t")

['weight', 'group']

As you can see, that output is exactly the same as the output of our original statement.

#### Return types

As we just saw, when you have a chain of methods, they are executed in sequence from left to right. Each method is executed on the output of the previous. In the example we just considered, we started with a `str`, used `.strip()` which returned a new `str` and then used `.split()` which returned a list. But what if we wanted to add another method to the chain? Which methods can we use?

Both of the methods used in the example above are `str` methods. We could use `str` methods because at each step in the chain we were working with a `str`. However, once a `list` is returned by `.split()`, if we try to use a `str` method, we will get an error.

In [16]:
"weight\tgroup\n".strip().split("\t").replace("group", "treatment")

AttributeError: 'list' object has no attribute 'replace'

As the error states, once a method in the chain has returned a `list`, it is no longer possible to use `str` methods, because we are no longer working with a `str`. As `.split()` returns a `list`, we can now only use `list` methods. For example

In [17]:
"weight\tgroup\n".strip().split("\t").index("group")

1

You can chain together methods in any case where each method returns an instance of some class or another. If you use a method like `list.append()`, which performs a modification in-place and doesn't return anything, then you can't add another method afterwards.

For example, we can get a list of the columns in the first line of a file using a single line by chaining together methods used in the example earlier.

In [18]:
open("plantgrowth.txt").read().split("\n")[0].split("\t")

['weight', 'group']

That single line works, but it is much more difficult to read than performing the same operations over multiple lines. You should use short chains of methods whenever you think it is a clearer way to write code to perform a certain operation. If you start writing long chains like the line above, you might want to reconsider if you are writing your code in the clearest way.

## Converting data

When you read data from a file using `.read()` or `.readlines()`, the file contents are always read as `str`s. In many cases, the data you are reading from the file are actually numbers. In those cases you must convert the data into the appropriate type. When we used `awk` during the Bash portion of this course, numbers in text files were handled automatically. However, Python doesn't work that way. Python doesn't try to guess what you are trying to do. Instead, it requires that you are explicit.

The data in the "plantgrowth.txt" file we have been working with include numerical weights. As those weights include decimal places, we will need to convert those data to `floats`. If we didn't care about the decimals, we could convert them to `int`s instead, but the numbers would be rounded **down** to the nearest integer.

We can modify out file reading approach to convert the data during file reading.


In [19]:
file = open("plantgrowth.txt")
treatment_weights_dict = {}
for line in file.readlines()[1:]:
    weight, treatment = line.split() # lines look like "weight\tgroup\n"
    weight = float(weight) # perhaps simplest/clearest way is reassigning to the same variable
    if treatment in treatment_weights_dict:
        treatment_weights_dict[treatment].append(weight)
    else: 
        treatment_weights_dict[treatment] = [weight]

file.close()
print(treatment_weights_dict["ctrl"])

[4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14]


As you can see, the numbers are no longer printed in quotes. in addition, we can confirm that they are now `float`s using `type()`

In [20]:
type(treatment_weights_dict["ctrl"][0])

float

## `write()`

Once we have read the data in a file and performed some sort of operation on it, it is common to want to next write the output of that operation to a file. To do this, we can use the `write()` method of the open file class we have been working with. The syntax looks a lot like the `.read()` method, except that instead of reading the file contents into a single string, we instead write a single string to a file.

Let's say, for example, that we wanted to write the mean average weight of plants in each group to an output file. We could do that as follows:

In [21]:
# First make a variable to contain the averages. A dict seems like a sensible object for that
# We could store these data in something else, but a dict would be useful in case we want to look up a mean later in the script
mean_weights_dict = {}

# next, we can iterate over the keys and values in our treatment_weights_dict to calculate the mean of each treatment
for treatment, weights in treatment_weights_dict.items(): # dict.items() returns each key and value one by one as an iterator
    # count the number of measurements for later use
    num_weights = len(weights)
    # Python has a built-in sum function to add up the weights
    sum_weights = sum(weights)
    # then we can add the mean to our dict
    mean_weights_dict[treatment] = sum_weights/num_weights

print(mean_weights_dict)

{'ctrl': 5.032, 'trt1': 4.661, 'trt2': 5.526}


Now we have our data, next we need to organize it into a writeable format (i.e., a `str`). The most straightforward way to do that would be with string concatenation. That would look like this:

In [22]:
# First make a str to store our output. We can start it off with a header line
out_contents = "Treatment\tMean_Weight\n"
for treatment, mean_weight in mean_weights_dict.items():
    out_contents += treatment + "\t" + str(mean_weight) + "\n"

print(out_contents)

Treatment	Mean_Weight
ctrl	5.032
trt1	4.661
trt2	5.526



That output looks like exactly how our file should look. Let's write that file and then we will talk about a couple of best practices when building large `str` object.

In [23]:
# Open the file you want to write to. The file doesn't need to exist. this operation is like ">" in Bash
file = open("mean_weights.txt", 'w') # 'w' mode means we will write to the file
file.write(out_contents)
file.close()

That's all it takes. There is now a file called mean_weights.txt in your current directory.

## f-strings

Creating the above `str` to write to our output file was fairly straightforward. However, it was more complicated than it needed to be (at least in terms of code elements written). Since Python version 3.6, Python has supported a better way: f-strings. f-strings are a modified form of the `str` class that supports what we called variable substitution in Bash. Basically, f-strings let you put your variables within a string. Additionally, while we had to convert our `float` `mean_weight` variable into a `str`, f-strings handle the conversion for us. f-strings look like this:

`f"some string stuff {<variable>} more string stuff {<another variable>}"`

In Bash you could specify your variables with "\\$" and perhaps put curly braces around them if you had other characters following your variable name. Python doesn't use "\\$" to declare variables, so you always need to put curly braces ("{}") around your variable names.

In [24]:
a = "some_string" # str
b = 1 # int
c = 3.14 # float
d = [1,2,3] # list

result = f"a is {a}, b is {b}, c is {c}, d is {d}, the length of d is {len(d)}"
print(result)

a is some_string, b is 1, c is 3.14, d is [1, 2, 3], the length of d is 3


f-strings can look a lot cleaner and more readable than having lots of type conversion and concatenation operations. Let's look at how f-strings would look in the code we used to create our output file earlier.

In [25]:
out_contents = "Treatment\tMean_Weight\n"
for treatment, mean_weight in mean_weights_dict.items():
    out_contents += f"{treatment}\t{mean_weight}\n"
    
print(out_contents)

Treatment	Mean_Weight
ctrl	5.032
trt1	4.661
trt2	5.526



We get the same output, but the characters written in your code are closer to how your final output will look. You can decide whether you want to use f-strings in your code.

## String-building best-practice

As we saw in Bash, string concatenation is slow. The same is true in Python. In Python that is because when you add two `str`s together, you are copying all of the data from each `str` to a new location in memory. You aren't just modifying an existing `str` instance. For short strings like what we have been working with here, that isn't an issue. However, if you were to want to write a file with thousands of lines, then you would have much longer run times using string concatenation than if you used a faster method. Specifically, you should instead make a `list` and then join the completed `list` together using `str.join()`. Our current code for creating our output contents would look like this if we use a `list` instead of a `str` to build our output.

In [26]:
out_list = ["Treatment\tMean_Weight\n"]
for treatment, mean_weight in mean_weights_dict.items():
    out_list.append(f"{treatment}\t{mean_weight}\n")

out_contents = "".join(out_list)
print(out_contents)

Treatment	Mean_Weight
ctrl	5.032
trt1	4.661
trt2	5.526



As you can see, the output is the same. If you are curious to see the difference in time, you can run the following code, which will make 100,000 lines of `str` concatenation and 100,000 lines of list followed by a `str.join()` operation.

Output in my test:

string concatenation took 7.171992063522339 seconds
list building and joining took 0.004171848297119141 seconds

In [27]:
import time # We'll cover this syntax below
str_start = time.time() # We'll cover this later too
string = ""
# We can use the built-in function range() to loop over a range of numbers
for _ in range(100_000): # You can use underscores to break up numbers into readable groups like how commas are used in written numbers
    string += "something\n"   
print(f"string concatenation took {time.time()-str_start} seconds")

list_start = time.time()
lst = []
for _ in range(100_000):
    lst.append("something\n")
string = "".join(lst)
print(f"list building and joining took {time.time()-list_start} seconds")

string concatenation took 7.171992063522339 seconds
list building and joining took 0.004171848297119141 seconds


As you can see, the run time of string concatenation was thousands of times slower, even when adding a short string. If we had used longer line lengths, then the run time would have been even longer for string concatenation, while list joining would not be noticeably affected.

Output in my test:

string concatenation took 19.156351804733276 seconds
list building and joining took 0.004466056823730469 seconds

In [28]:
import time
str_start = time.time()
string = ""
for _ in range(100_000):
    string += "something even longer\n" 
print(f"string concatenation took {time.time()-str_start} seconds")

list_start = time.time()
lst = []
for _ in range(100_000):
    lst.append("something even longer\n")
string = "".join(lst)
print(f"list building and joining took {time.time()-list_start} seconds")

string concatenation took 19.156351804733276 seconds
list building and joining took 0.004466056823730469 seconds


## reading command line input

It is rarely valuable to hard-code file paths in your scripts. Doing so will make it so that you can only run a script if it and any files are in the correct location. Furthermore, you can't then run your script on other files without editing the script for every file.

Python does provides the ability to read information from commandline input. However, you need to import the "sys" module to gain access to that functionality.

### What is a module?

A module is a set of classes and/or functions which provide some sort of functionality. Typically, a module's classes and functions will all be useful for related tasks. For example, the "sys" module provides functionality for interacting with and accessing system parameters. The "time" module provides functionality for recording the time and date and converting between date and time formats.

Python includes several built-in functions that you always have access to in a Python environment. However, there is lots of functionality you might want to have access to when writing Python code. [Python comes with a lot of optional modules called the "standard library"](https://docs.python.org/3/library/index.html). You can import any module in the standard library without needing to install anything extra and can assume that anyone else running your script will have the standard library on their system as well.

### Why are there modules instead of always having everything loaded?

Organizing functionality into modules in the standard library rather than having everything available all the time has a few benefits. 

First, having more modules loaded increases the time taken for your script to run. If you aren't going to use most of the modules, it makes sense to not slow your execution time needlessly. 

Second, keeping things organized into modules leaves you with a cleaner namespace. We'll talk more about namespace later, but basically, by organizing functions and classes into modules, you don't have to worry so much about having two functions with the same name. To refer to the `time()` function in the "time" module above, we used `time.time()`. If we have another time function in another module we would refer to it as `other_module.time()`. If both functions were loaded directly into our script, it would be difficult to be clear about which time function you wanted to run.

### Reading commandline input with the `sys` module

To read input provided in the commandline, you can use the `sys` module. Specifically, `sys` stores each input in a list that you can access using `sys.argv` (short for "argument vector"). The numbering of elements in the list is the same as we saw in Bash. i.e., the oneth element is your first input, while the zeroth element is your script name. You can access each element by indexing the list. Therefore, the first input given in the commandline is `sys.argv[1]`