# Data Has to Go Somewhere

## The open and close commands

On your personal computer and mine, the operating system stores data in files.  In many data analysis applications, your first step will be to read data from a file.  Likewise, you will often need to write data or results to an output file.  In this section, we will show you how to read and write files using python.

Let's first look at the fundamental command that will enable you to access a file: open.

`fileHandle = open(path_to_file, mode)`

The open function returns a Python object of type file; in this case we assign that file to the variable `fileHandle`.   This is the variable we will use to read or manipulate the contents of the file.

The `path_to_file` variable specifies where the file is located.  This is a fully qualified path and file name that tells python where to look for the file on your computer.

The mode parameter specifies what kind of operations that we want to do on the file, and what kind of file is it.

The first letter indicates the operation:
- r means read
- w means write, and overwrite the file if it exists
- x means write, but only if file does not already exist
- a means write (append): write at the end of the file if it exists

The second letter of mode indicates the file type:
- t (or nothing) means text
- b means binary

After opening a file and manipulating it in some way, you should remember to close the file as well. Closing the file ensures that Python does not maintain file handles to files it is not using anymore. This practice can keep memory usage low as well. 

## Writing a text file using write()

Let's write a file using Python. First we'll need some content:

In [1]:
file_content = '''To be or not to be
That is a question.'''

We want to create a new file using the contents in our multi-line string.  To do that, we'll follow three steps: open the file, write to it, then close it.

In [2]:
poem_file = open('shakespeare.txt', 'wt')
poem_file.write(file_content)
poem_file.close()

After you execute that code, you should see a file named "shakespeare.txt" in the folder containing this ipython notebook. Let's take a look:

In [3]:
!ls -al

total 576
drwxr-xr-x+ 17 kashaolu  staff     578 Oct 28 09:53 [34m.[m[m
drwxr-xr-x+ 19 kashaolu  staff     646 Oct 20 13:28 [34m..[m[m
-rw-r--r--@  1 kashaolu  staff    6148 Oct 28 07:43 .DS_Store
drwxr-xr-x+  4 Personal  staff     136 Oct 28 09:31 [34m.ipynb_checkpoints[m[m
-rw-r--r--+  1 kashaolu  staff  100184 Oct 20 13:28 8.1 - Encoding Text.pptx
-rw-r--r--+  1 kashaolu  staff    8825 Sep 19 11:26 8.2 - Unicode Strings.ipynb
-rw-r--r--+  1 kashaolu  staff   15496 Sep 19 11:26 8.3 - Encoding.ipynb
-rw-r--r--+  1 kashaolu  staff   16894 Sep 19 11:26 8.4 - Formatting.ipynb
-rw-r--r--+  1 kashaolu  staff   23439 Sep 19 11:26 8.5 - Regular Expressions.ipynb
-rw-r--r--+  1 kashaolu  staff   11738 Sep 19 11:26 8.6 - Binary Data.ipynb
-rw-r--r--+  1 kashaolu  staff   19483 Oct 28 09:14 8.7 - File Input and Output.ipynb
-rw-r--r--+  1 kashaolu  staff    4887 Oct 28 09:53 8.8 - Structured Text Files.ipynb
-rw-r--r--+  1 kashaolu  staff      72 Sep 19 11:26 Week 8 Assign

We can inspect the contents using the cat command:

In [4]:
!cat shakespeare.txt

To be or not to be
That is a question.

We programatically created the file shakespeare.txt with the given content.  If the file already existed, we overwrote it with our new content.

Next, let's append more text to the file. We will need to open it using the "a" mode to append:

In [5]:
poem_file = open('shakespeare.txt', 'at')
poem_file.write("\n--Written By Shakespeare")
poem_file.close()

Lets see what the file looks like now.

In [6]:
!cat shakespeare.txt

To be or not to be
That is a question.
--Written By Shakespeare

Notice the newline character "\n" in the string we passed to the write method.  The function write() enters all characters verbatim, so we have to explicitely include a newline if we want to move to the next line.

## Read a text file with read(), readline(), or readlines()

Reading files is easy in Python, but there are a few important things to keep in mind:

First, the read() function with no arguments will load the entire file into memory.  This is acceptable for small files, but files can get quite large.  If a file is large enough, it can cause your application to run out of memory. Still, if you know you have a small file, this is an easy way to read its contents:

In [7]:
poem_file_read = open('shakespeare.txt', 'rt')
poem_read = poem_file_read.read()
poem_file_read.close()
print(poem_read)

To be or not to be
That is a question.
--Written By Shakespeare


You can also use the function readlines() that will also load all of the file in memory, but conveniently return a list in which each item is a line in the file.

In [8]:
poem_file_read = open('shakespeare.txt', 'rt')
poem_read_array = poem_file_read.readlines()
poem_file_read.close()
print(poem_read_array)

['To be or not to be\n', 'That is a question.\n', '--Written By Shakespeare']


In [9]:
print(poem_read_array[0])

To be or not to be



Notice the contents of the file is now in an list. You can now process a file line by line like this:

In [10]:
for line in poem_read_array :
    print(line)

To be or not to be

That is a question.

--Written By Shakespeare


We have a for loop that lets us access each line of the file one by one and process it.  This is a very common procedure when working with text. Even though our file is divided into separate lines, however, our code still loads the entire file into memory when we call readlines().

The readline() function saves us from doing this. This method reads a single line from the file and then stops.  If we call it again later, it will continue by returning the next line in the file.

In [11]:
poem_file_read = open('shakespeare.txt', 'rt')

while True:
    line = poem_file_read.readline()
    if not line:
        break
        
    print(line)

poem_file_read.close()

To be or not to be

That is a question.

--Written By Shakespeare


Notice that we use an infinite loop to read lines one by a time using the readline() function.  When we reach the end of the file, readline() will return an empty string.  When this happens, we break out of the loop. 

The advantage of this technique is that we read our file one line at a time, and do not have to worry about memory issues (unless we encounter an unusually long line).

There is an even easier way to read files. The file handle itself is an iterator so you can pass it directly to a for statement:

In [12]:
poem_file_read = open('shakespeare.txt', 'rt')

for line in poem_file_read:
    print(line)
    
poem_file_read.close()

To be or not to be

That is a question.

--Written By Shakespeare


## Writing a binary file using write()

If we want to write to a binary file instead of a text file, we must open it with a "b" as the second character in the `mode` parameter.

In [13]:
binary_data = bytes(range(0,255)) # Generating some arbitrary binary data

binary_file_write = open('binary', 'wb')
binary_file_write.write(binary_data)
binary_file_write.close()

## Reading a binary file 

Similarly to text files, you can create a file handle that has read access to the binary file. Since there are typically no newline characters in binary data, you should use the read() function to read binary files.  To avoid reading the entire file into memory at once, you can specify how many bytes you want as a argument to read().  If you don't supply a number of bytes, this method will read all the way to the end of the file.

In [6]:
binary_file_read = open('binary', 'rb')
print("First 5 bytes of the file:", binary_file_read.read(5))
print("Next 5 bytes of the file:", binary_file_read.read(5))
binary_file_read.close()

First 5 bytes of the file: b'\x00\x01\x02\x03\x04'
Next 5 bytes of the file: b'\x05\x06\x07\x08\t'


## Find position with tell(), change position with seek()

As you read and write, Python keeps track of where you are in the file. There are a set of functions that allow you to find out and modify your current position in that file.

The `tell()` function returns your location from the beginning of the file in bytes
The `seek()` function allows you to jump to another location in the file

For example, let's look at our binary file that we created:

In [8]:
binary_file_read = open('binary', 'rb')

Let's find out our current position in the file:

In [9]:
binary_file_read.tell()

0

Now let's go to the offset where we saw printed that capital "A" and print just that byte.

In [10]:
binary_file_read.seek(65)
print(binary_file_read.read(1))

b'A'


You can also seek from your current position. Let's say I want to move five bytes from my current position. I would pass a 1 as a second argument to the seek function.  A value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point.

In [11]:
binary_file_read.seek(65) #I'm going to go back to the original position
binary_file_read.seek(5, 1) # Now I'm reading five bytes after the current position
print(binary_file_read.read(5))

b'FGHIJ'


Please note: the seek() and tell() functions work best with binary files, as you are moving back and forth in units of bytes. These functions will also work with text files, but note that because of variable encoding, one character could use more bytes than another, leading to unexpected side effects.