# Reading and writing files

- [Overview](#overview)
- [Opening and closing files](#opening-and-closing-files)
- [Reading a file](#reading-a-file)
- [Read lines from file](#read-lines-from-file)
- [Read lines efficiently](#read-lines-efficiently)
- [Writing files](#writing-files)
- [Tying it together with read and write](#tying-it-together-with-read-and-write)
- [Further reading](#further-reading)

<h2 id="overview">Overview</h2>

Learning how to read and write files is an essential programming task.

Below are some examples intended to demonstrate some basic techniques. 

As you become more experienced with Python, you'll likely gradujate to tools such as Python's built-in `csv` module or `pandas.read_csv` to simplify the ingestion (or export) of data.

But it's still helpful to get a grounding in the fundamentals of how file input/outpu (aka file IO), works in Python. It may come in handy when you're dealing with very large files, and more generally will help you understand the underlying Python functionality that such libraries rely on.

> Most examples below use [files/data/animals.csv](files/data/animals.csv)

<h2 =id"opening-and-closing-files">Opening and closing files</h2>

Files are typically read from (or written to) in Python using the built-in [open](https://docs.python.org/3/library/functions.html#open) function.

This function allows us to open a file in various modes.

For example, to open a file in _read)_ mode:

In [None]:
f1 = open('files/data/animals.csv', 'r') # 'r' is for read
f1.close()

Opening a file in _append_ mode would allow you to add lines to pre-existing content

In [None]:
f2 = open('files/data/animals.csv', 'a') # 'a' is for append
f2.close()

Opening a file in write mode will overwrite pre-existing content.

In [None]:
f3 = open('files/data/fake_file.csv', 'w') # 'w' is for write
f3.close()

The `open` function has a few other modes, but the above read, append and write modes are the most useful to learn at the outset.

Note that we made a point of closing all of the files. Failing to close a file can lead to [memory leaks](https://en.wikipedia.org/wiki/Memory_leak) and other unexpected behavor. For example, when working with files in a Jupyter notebook, content you've written may not get "flushed" to the file until you call the `close` method on the open file.

<h3 id="with-idiom">The "with" idiom</h3>

As you get more familiar with Python, you'll notice the use of the `with` statement to open files. This is a common idiom which ensures that a file is properly closed after the `with` block of code completes execution.

> Here's some [helpful background](https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/) on why we should always use `with` to open files.

In [None]:
with open('files/data/animals.csv', 'r') as myfile:
    text = myfile.read()
# At this point, we're outside the "with" block
# and the file has been automatically closed
print(text)

This idiom can feel strange at first, but using it can help avoid memory leaks or other code problems.

<h2 id="reading-a-file">Reading a file</h2>

The simplest way to read data from a file is the [read](https://www.w3schools.com/python/ref_file_read.asp) method on an open file [handle](https://en.wikipedia.org/wiki/Handle_(computing)).

For example, to read data from [files/data/animals.csv](files/data/animals.csv).

In [None]:
with open('files/data/animals.csv', 'r') as f:
    text = f.read()
print(text)

<h2 id="read-lines-from-file">Read lines from file</h2>

When we read data from a file, often the most useful way to access that data is line by line. Unfortunately, the `.read` method mentioned above brings the data in as one large blob of text, leaving us on the hook for spliting that text into separate lines.

It's not a ton of extra work, but why go to the trouble when Python gives us several ways to more quickly process individual rows?

Below is an example using the [readlines](https://www.w3schools.com/python/ref_file_readlines.asp) method on open files.


In [None]:
with open('files/data/animals.csv', 'r') as f:
    animals = f.readlines()
print(animals)

Note above that the items in the list contain a `\n` character. This is a [newline](https://en.wikipedia.org/wiki/Newline), an "invisible" character that is used to indicate the end of a line of text on Mac/Unix systems.

When processing data read from files, we typically want to remove newlines using the [strip method](https://www.w3schools.com/python/ref_string_strip.asp).

In [None]:
animals[1] # here the first animal has a newline

In [None]:
animals[1].strip() # here we strip it

Stripping newlines helps ensure that programming logic such as name-based matching or filters don't accidentally fail due to the presence of newlines. For example, the below illustrates how newlines might trip you up.

Notice the animal rows have newlines:

In [None]:
print(animals)

Let's create a new list and attempt to store just the cat in the list:

In [None]:
filtered_animals = []
for animal in animals:
    if animal == 'cat':
        filtered_animals.append(animal)
print(filtered_animals)

**Note:** The list is empty; our attempt to match "cat" failed!

Let's try again, this time also matching the newline character.

In [None]:
for animal in animals:
    if animal == 'cat\n':
         filtered_animals.append(animal)
print(filtered_animals)

This time the code worked as expected since we checked for the newline.

A better alternative here would be to strip the newline and _then_ check `if animal == 'cat':`.

The code not only becomes less confusing, but you've also performed a standard data cleaning operation on the data.

Try updating the code above to strip the newline and check for the name.

<h2 id="read-lines-efficientlyl">Read lines efficiently</h2>

The `readlines` method is handy, but Python provides an even simpler idiom for reading the lines of a file: just step through them using a [for loop](https://www.w3schools.com/python/python_for_loops.asp).


In [None]:
with open('files/data/animals.csv') as f:
    for line in f:
        print(line.strip())

Unlike `read` or `readlines`, the "for loop" method above reads each line from the file in a step-wise fashion, one by one. 

> This method of data ingestion is particularly handy when dealing with large files. It helps us avoid overwhelming our system's memory when dealing with larger data sets, by allowing us to process data row by row in a so-called "stream".

<h2 id="writing-files>Writing files</h2>

Let's say that we want to create a new file containing a filtered list of animals. Specifically, we just want animals whose names do not start with the letter "c".

Let's start with a hard-coded list of animals (plus the column header `animal`).


In [None]:
animals = ['animal', 'cat', 'cougar', 'dog', 'snake', 'narwhal']

Let's say we want to filter out all the animals that begin with the letter _c_. 

In other words, we want to _exclude_ `cat` and `cougar`.

Below, we create an empty list (`animals_filtered`) to store the filtered list.

In [None]:
animals_filtered = []
for animal in animals:
    if animal not in ['cat', 'cougar']:
        animals_filtered.append(animal)
print(animals_filtered)

Now we're ready to write the filtered data to a new file. 

In this example, we once again use the `open` function. But this time we use the `w`, or "write" mode.

Also note that we add a newline to ensure that each item in our list appears on a separate row.


In [None]:
with open('animals_filtered.csv', 'w') as outfile:
    for animal in animals_filtered:
        # Note we have to add the newline that we
        # stripped above
        outfile.write(animal + '\n')

<h2 id="tying-it-together-with-read-and-write"> Tying it together with read and write</h2>

So far we've learned how to read from and write to files separately, along with how to create filtered lists of data based on some conditional logic. We've also touched on the need to carefully handle the newline character.

Now let's tie those skills together with a final example. Once again, we'll exclude animals whose names start with "c" (`cat`, `cougar`).

We start by reading the data from [files/data/animals.csv](files/data/anfiles/data/animals.csv) and creating a filtered list of animals. 

Note that we provide the `r` option, for "read", to the `open` command and we strip the newline character from each line before checking it against our list of animals to exclude .


In [None]:
animals_filtered = []
with open('files/data/animals.csv', 'r') as infile:
    for line in infile:
        animal = line.strip()
        if animal not in ['cat', 'cougar']:
            animals_filtered.append(animal)

print(animals_filtered)

Now we can write the filtered list to a new file. 

In this example, we once again use the `open` function with the `w`, or "write", option.

In [None]:
with open('animals_filtered.csv', 'w') as outfile:
    for animal in animals_filtered:
        # Note we have to add the newline that we
        # stripped above
        outfile.write(animal + '\n')

Go ahead and open the `animals_filtered.csv` you just generated (it should be in the same folder as this notebook).

You should see the column header (`animal`) along with rows for _dog_, _snake_ and _narwhal_.

It's worth noting that above, we created an extra bit of work for ourselves by stripping newlines when we read the source data. When we wrote the filtered data to a new file, we were forced to add the newline to each row.

If we had not restored the newline, the data would have been jumbled into a single row in the file: `animaldogsnakenarwhal`.

<h2 id="further-reading">Further reading</h2>

For more info on reading and writing files, check out:

* The W3C chapters on file handling, starting with [Python file handling](https://www.w3schools.com/python/python_file_handling.asp).
* [Chapter 9 - Reading and Writing Files](https://automatetheboringstuff.com/2e/chapter9/) of *Automate the Boring Stuff*