# Lesson 1.9: File input and output

<hr>

In [1]:
import os
import glob

<hr/>

Reading data in from files and then writing your results out again is one of the most common practices in scientific computing. In this tutorial, we will learn about some of Python's File I/O capabilities. We will use a .txt file as an example. 

## File objects

To open a file, we use the built-in `open()` function. When opening files, we should do this using **context management**. I will demonstrate how to open a file and then describe the syntax.

In [2]:
with open('data/input.txt', 'r') as f:
    print(type(f))

<class '_io.TextIOWrapper'>


Python has a wonderful keyword, `with`. This keyword enables **context management**. Upon entry into a `with` block, variables have certain meaning. In this case, the variable `f` has the meaning of an open file, an instance of the `_io.TextIOWrapper` class. Upon exit, certain operations take place. For file objects created by opening them, the file is automatically closed upon exit, **even if there is an error**. This is important. If your program raises an exception before you have a chance to close the file, it won't get closed and you could be in trouble. If you use context management, the file will still get closed. So here is an important tip:

<div style="color: dodgerblue; text-align: center; font-weight: bold;">

Use context management using <tt>with</tt> when working with files.
    
</div>

Let's focus for a moment on the variable `f` in the above code cell. It is a Python `file` object, which has methods and attributes, just like any other object. We'll explore those in a moment, but first, let's look at how we opened the file. The first argument to `open()` is a string that has the name of the file, with the full path if necessary. The second argument is a string that says what we will be doing with the file. I.e., are we reading or writing to the file? The possible strings for this second argument are

|string | meaning|
|:------|:-------|
|`'r'` | open a text file for reading|
|`'w'` | create and open a text file for writing|
|`'a'` | append an existing text file|
|`'r+'`| open a text file for reading and writing|
|append `'b'` to any of the above | same as above, except for binary files|

We will mostly be working with text files in the bootcamp, so the first three are the most useful.  A big warning, though....


<div style="color: tomato; text-align: center; font-weight: bold;">

Trying to open an existing file with <tt>'w'</tt> will wipe it out and create a new file.
    
</div>


### Reading data out of the file with file object methods

We will focus on the methods `f.read()` and `f.readlines()`. What do they do?

|method | task|
|:------|:-------|
|`f.read()` | Read the entire contents of the file into a string|
|`f.readlines()` | Read the entire file into a list with each item being a string representing a line|

First, we'll try using the first method to get a single string with the entire contents of the file.

In [3]:
# Read file into string
with open('data/input.txt', 'r') as f:
    f_str = f.read()

# Let's look at the first 1000 characters
f_str[:1000]

'"Big data" refers to data sets that are too large or complex to be dealt with by traditional data processing application software.\nData with many fields (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.\nBig data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.\nBig data was originally associated with three key concepts: volume, variety, and velocity.\nThe analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling.\nTherefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.'

We see lots of `\n`, which signifies a new line. The backslash is known as an **escape character**, meaning that the `n` after it does not signify the letter n, but that `\n` together means a new line.

Now, let's try reading it in as a list.

In [4]:
# Read contents of the file in as a list
with open('data/input.txt', 'r') as f:
    f_list = f.readlines()

# Look at the list (first ten entries)
f_list[:10]

['"Big data" refers to data sets that are too large or complex to be dealt with by traditional data processing application software.\n',
 'Data with many fields (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.\n',
 'Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.\n',
 'Big data was originally associated with three key concepts: volume, variety, and velocity.\n',
 'The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling.\n',
 'Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.']

We see that each entry is a line, including the newline character. To look at lines in files, the `rstrip()` method for strings can come it handy. It strips all whitespace, including newlines, from the end of a string.

In [5]:
f_list[0].rstrip()

'"Big data" refers to data sets that are too large or complex to be dealt with by traditional data processing application software.'

### Reading line-by-line

What if we do not want to read the entire file into a list? For example, if a file is several gigabytes, we do not want to spend all of our RAM storing a list. Instead, we can read it line-by-line. Conveniently, the file object can be used as an iterator.

In [6]:
# Print the first ten lines of the file
with open('data/input.txt', 'r') as f:
    for i, line in enumerate(f):
        print(line.rstrip())
        if i >= 7:
            break

"Big data" refers to data sets that are too large or complex to be dealt with by traditional data processing application software.
Data with many fields (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.
Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.
Big data was originally associated with three key concepts: volume, variety, and velocity.
The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling.
Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.


Alternatively, we can use the method `f.readline()` to read a single line in the file and return it as a string.

In [7]:
# Print the first ten lines of the file
with open('data/input.txt', 'r') as f:
    i = 0
    while i < 7:
        print(f.readline().rstrip())
        i += 1

"Big data" refers to data sets that are too large or complex to be dealt with by traditional data processing application software.
Data with many fields (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.
Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.
Big data was originally associated with three key concepts: volume, variety, and velocity.
The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling.
Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.



Each subsequent call to `f.readline()` reads in the next line of the file. (As we read through a file, we keep moving forward in the bytes of the file and we have to use `f.seek()` to rewind.)

## Writing to a file

Writing to a file has similar syntax. We already saw how to open a file for writing. Again, context management is useful. However, before trying to open a file, we should check to make sure a file of the same name does not exist before opening it. The `os.path` module is useful. The function `os.path.isfile()` function checks to see if a file exists.

In [8]:
os.path.isfile('data/input.txt')

True

Now that we know how to check existence of a file so we do not overwrite it, we can open and write a file.

In [9]:
if os.path.isfile('mastery.txt'):
    raise RuntimeError('File mastery.txt already exists.')

with open('mastery.txt', 'w') as f:
    f.write('This is my file.')
    f.write('There are many like it, but this one is mine.')
    f.write('I must master my file like I must master my life.')

RuntimeError: File mastery.txt already exists.

Note that we can use the `f.write()` method to write strings to a file. Let's look at the file contents.

In [10]:
!cat mastery.txt

This is my file.
There are many like it, but this one is mine.
I must master my file like I must master my life.


Ah!  There are no newlines!  When writing to a file, unlike when you use the `print()` function, you must include the newline characters.  Let's try again, intentionally obliterating our first attempt.

In [11]:
with open('mastery.txt', 'w') as f:
    f.write('This is my file.\n')
    f.write('There are many like it, but this one is mine.\n')
    f.write('I must master my file like I must master my life.\n')
    
!cat mastery.txt

This is my file.
There are many like it, but this one is mine.
I must master my file like I must master my life.


That's better. Note also that `f.write()` **only** takes strings as arguments. You cannot pass numbers. They must be converted to strings first.

In [12]:
# This will result in an exception
with open('gimme_phi.txt', 'w') as f:
    f.write('The golden ratio is φ = ')
    f.write(1.61803398875)

TypeError: write() argument must be str, not float

Yup.  It must be a string.  Let's try again.

In [13]:
with open('gimme_phi.txt', 'w') as f:
    f.write('The golden ratio is φ = ')
    f.write('{phi:.8f}'.format(phi=1.61803398875))

!cat gimme_phi.txt

The golden ratio is φ = 1.61803399

That works!