# Files

The data we have worked with so far has been entered directly
into the python code.  For programs running on relatively small
amounts of data, this is often a feasible approach.  However,
when we are writing code, more often than not it is to perform
some computations on more than just a few values.  In many cases
we wish to perform computations on data that comes from another
source, such as data collected from a science experiment or
survey or data acquired from the internet, etc.  In these
cases, having to type all of the data directly into the
python code does not make much sense.  In many cases,
it was already recorded in a file by someone else, and duplicating
that work to type it into a list is inefficient.

When you wish to work with data that is already in a file,
we perform what's known as file I/O (input/output).
In python, to interact with files we take the following
steps:
* open the file - this gives us what's known as handle to the
  file so we can interact with it
* read/write - this is where we either access the data in the
  file or add data to a file (or both)
* close the file - our way of saying we are done modifying
  the file and all of the changes should be saved

## Opening Files

The first step, opening files is done with the function
`open(filename, mode)`, where
* `filename` is a string containing the name of the file. 
  It can also be what's known as a *path* to the file and
  include folders that the file is in.
* `mode` is a string indicating how we plan to interact
  with the file.  The primary options are:
  * `r` = read
  * `w` = write, this will overwrite a file that already exists
  * `a` = append, this will add to the end of a file that already exists
  * `r+` = read and write
  
  There are technically a few other modes for files that
  don't contain text, such as an mp3 file for a song.  For
  these, the modes have a `b` at the end, e.g. `rb`, `wb`, etc.
  But, we'll focus on reading/writing text based files as that
  is the most common to do with base python (if you
  were going to read in an audio file or an image file, there's a good
  chance you'd use an separate python package.)

If the file is able to be opened, we get what is known as
a *file handle*, essentially a connection to the file so
that we can access it.  Note this isn't actually the file itself.
Sometimes, a file is unable to be opened, in which case the code
will raise an error.  For instance, if we try to open a file
for reading that doesn't exist, we will get a `FileNotFoundError`.

For example:

In [None]:
fr = open('data/blahblahreading.txt', 'r')

Note that there is no issue with opening a file that doesn't exist
for writing, it will simply create the file.

In [None]:
fw = open('data/blahblahwriting.txt', 'w')

Note that we can still get errors while opening for writing
if we were on a computer trying to open a file we did not have
the permissions to edit (hard to simulate on notebooks running
on CoCalc or other cloud based servers).

Let's open the file "robert-frost.txt" in the "data" folder,
which contains the text for the poem "Stopping by Woods on a
Snowy Evening" by Robert Frost.
Because this file is nested in a folder, we need to give it
the path, which is just the folder name, followed by a "/"
followed by the filename, e.g. "data/robert-frost.txt".
Once we have a file handle, it's just another variable in python
and we can do things like print it.  

In [1]:
ffrost = open('data/robert-frost.txt')
print(ffrost)

<_io.TextIOWrapper name='data/robert-frost.txt' mode='r' encoding='UTF-8'>


Note that when we print
it, we get a printed version of the file handle, we don't actually
get the file contents.  This is an important distinction -- a file
handle isn't actually the contents of the file, we need to read from
the file in order to get the contents.

## Reading/Writing Files

Once a file is open and we have a file handle, we can either read from
or write to the file (depending on the mode with which we opened it).

### Reading Files

To read from files, we use the desired file handle corresponding to the
file we wish to read from to call one of the reading methods.  The
available reading methods are:

* `read()` - returns a string of the entire input file
* `readlines()` - returns a list with each line in the file as a string
  in the list
* `readline()` - returns a string of the next line in the file

The first two options will store the entire file in variables at once,
so while convenient are often not appropriate for very large files.

Let's look at some examples of reading on "robert-frost.txt" file containing
one of Robert Frost's poems.  Note that once we reach the end of the file
once, subsequent calls (even to a different reading method) will return
an empty string or list.  So, in order to illustrate the difference,
we first reopen the file each time.

In [4]:
ffrost = open('data/robert-frost.txt')
fulltext = ffrost.read()
print(len(fulltext))
print(fulltext)

544
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.

My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.

He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.

The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.


In [7]:
ffrost = open('data/robert-frost.txt')
lines = ffrost.readlines()
print(lines)

['Whose woods these are I think I know.\n', 'His house is in the village though;\n', 'He will not see me stopping here\n', 'To watch his woods fill up with snow.\n', '\n', 'My little horse must think it queer\n', 'To stop without a farmhouse near\n', 'Between the woods and frozen lake\n', 'The darkest evening of the year.\n', '\n', 'He gives his harness bells a shake\n', 'To ask if there is some mistake.\n', 'The only other sound’s the sweep\n', 'Of easy wind and downy flake.\n', '\n', 'The woods are lovely, dark and deep,\n', 'But I have promises to keep,\n', 'And miles to go before I sleep,\n', 'And miles to go before I sleep.']


Note that in the above example, each string in the list ends
with `\n`, the newline character.  This is because when reading
from a file, the newline characters are read as well.  Assuming
you would go process the strings in this list after reading, you
would likely first want to remove the newline.  This could be
done with slicing or (more commonly) by using the `strip()` method.

In [None]:
print(len(lines[0]))
print(len(lines[0].strip()))

#### Reading and Looping
For big files, it's more efficient to process
the file line-by-line.  The `readline()` method is one
option that does this.  Since it only reads a single
line of the file at a time, we would typically need
to put it in a loop to process the whole file (until we
get an empty line signaling we've reached the end of the
file).  However, there is an even easier way, we can
simply loop on the file handle, similar to how we
can loop through a list:
```
for line in f:
    # do something with line
```

For instance, suppose we just wanted to count the lines
in the file:

In [None]:
ffrost = open('data/robert-frost.txt')
count = 0
for line in ffrost:
    count += 1
print(count)

With any of the methods of reading, once it is read in, it is simply
a string (or list of strings), that could be processed like
any other string to break it up into words, convert to another
type to perform computations, etc.

### Writing Files

We can write to a file by calling the `write(textstr)` method
on the file handle.  This method writes the text in the string
`textstr` to the file

For example, using the `fw` file handle above:

In [None]:
fw.write("hello world\n")
fw.write("hello world line 2")

## Closing Files

Once we are finished working with a file (i.e. no longer need to read from
or write to the file), we must close the file to indicate to the computer
that we are no longer potentially going to access it or modify it.  This will
also ensure that any changes made when writing to a file take effect timely.

We close files in python by calling the `close()` method on the specific
file handle we wish to close.  If we wish to close more than one
file, we must call `close()` for each.  For instance, suppose we want to close
`fw`, `ffrost`:

In [None]:
fw.close()
ffrost.close()

## Alternative to Open/Close

The downside with opening files is remembering to close them later.
However, often the actual code for which the file needs to be open
is relatively small (because the desired file contents may be stored
in a variable to use at a later point in the code).  When this is the
case, there is a popular shorter form that handles the closing of
the files automatically using the `with` statement.  The `with`
statement is used to execute a block of code within a context (where
something holds within that context).  For files, we can put the call
to open in the `with` statement, so anything executed within that block
of code is executed with the file open, but the file is automatically closed
at the end, even if an error occurs.  This sort of block looks like:

```
with open(filename, mode) as f:
    # code to process file
```

Inside the with block, we would still put the desired calls to read/write
as appropriate, but we no longer need to worry about closing the file.

## Example - Groundhogs Day Data

The file "data/groundhog.csv" contains the historical data for Punxsutawney Phil on Groundhog's Day.
Unlike the Robert Frost poem, this data is more structured -- the first 2 lines are header information
and the remaining lines contain multiple pieces of data separated by commas.  

Suppose we wish to count the number of years
in which Phil saw a partial shadow.  At a high level, this would involve:
* loop through the lines of the file
* strip excess whitespace from the line and split into list of strings by commas (since
  commas separate the columns)
* check if line has 2nd entry, and if 2nd entry in list is "Partial Shadow",
  if so increment count of partial shadows

Let's look at what this looks like in code:

In [11]:
count = 0
with open('data/groundhog.csv', 'r') as f:
    for line in f:
        line = line.strip().split(',')
        if len(line) > 2:
            if line[1] == 'Partial Shadow':
                 count += 1

print(count)

1


## Reading CSV Files

Above we saw how to read in files by either
* reading in all of the lines as a list of strings with `readlines()`
* reading in the entire file with `read()`
* lopping through the file handle itself (which loops through the lines
  of the file, each line as a string).

For formatted files (where each line is made
up of a bunch of columns), often you would use string methods
to split apart each line and extract the desired column information.
For instance, a file, `books-small.csv`, with lines like:
```
author,book title
Harper Lee,"To Kill a Mockingbird"
John Steinbeck,"Of Mice and Men"
```

Might be processed with code like:
```
with open('books-small.csv') as f:
    for line in f:
        book_info = line.split(',')
```

This works well for many scenarios.  However, imagine
the file contained more lines, and some of the
book titles had commas in the title.
For example,
```
author,book title
John Steinbeck,"Of Mice and Men"
Harper Lee,"To Kill a Mockingbird"
Iris Murdoch,"The Sea, the Sea"
James Kelman,"How Late it Was, How Late"
Philip Hoare,"Leviathan or, the Whale"
```
While we could come up with a solution for
this file, if there were multiple columns that
may have commas in them, it quickly becomes
a more tedious task.

Since csv (comma separated value) files are so
common, there is a library designed solely to
help read/write from csv files, called `csv`.
To use this library for reading csv files, we first
create a csv reader object with
```
reader = csv.reader(filehandle, delimiter=',')
```
Then, instead of looping through
the file handle, we loop through `reader`.
Whereas with the standard way of handling
files we would need to split up the line
ourselves, when we loop through `reader`
the lines will already be split into
a list of strings, where each entry is one of the
columns.  This will intelligently handle commas
that occur inside a column (for well formatted
csv files).

Let's take a look at an example.  The file
`books.csv` in the `data` folder
contains the books as listed above.
We first start by importing the library:

In [None]:
import csv

Then, we open the file and create our
csv reader.
We could either open the file or use the
`with` notation.  In this example, we'll
just print out the book titles, but you could
do any additional processing desired for each line.

In [None]:
with open('data/books.csv') as f:
    reader = csv.reader(f, delimiter=',')
    firstline = True
    for line in reader:
        if firstline:
            firstline = False
        else:
            print(line[1])