## Working with plain text files

One of the most common formats for working with text files is the .txt format. But there are actually a number of different potential ways to work with one of these files. One of the most basic uses a with statement.

In [2]:
filename = 'corpus/1915_the_voyage_out.txt'
with open(filename, 'r') as file_in:
    text = file_in.read()
print(text[0:100])

FileNotFoundError: [Errno 2] No such file or directory: 'corpus/1915_the_voyage_out.txt'

Notice that we open the file and assign it a new, temporary name for the duration of the statement. This ensures that the file is opened, dealt with, and then closed safely. Once we un-indent, we have closed the file, and if we tried to read the same file again we would get a ValueError for trying to work with a closed file. The 'as file_in' bit assigns it to a variable so as to help us organize what is happening (we might have another file that we are writing to. Text is now one long string, which is fine in certain cases, but we could also read the contents of it in line by line. Here is a variation on the same approach:

In [3]:
filename = 'corpus/1915_the_voyage_out.txt'
with open(filename, 'r') as file_in:
    text = file_in.readlines()
print(text[0:10])

FileNotFoundError: [Errno 2] No such file or directory: 'corpus/1915_the_voyage_out.txt'

The "readlines()" function allows us to take an open file and read it line by line, returning a list of the lines. We assign that list to the text variable here, which we can now use to examine particular parts of the text. Note that here the line breaks do not correspond to sentences. Dividing longer chunks of text into sentences is a separate technique entirely, one called segmentation, that we'll get into later. For now, though, note how this means that the steps required to process your data in the way that you require depend entirely on the way in which it was encoded. In some cases, line breaks can be quite useful, say, when working with poetry where the line breaks are especially meaningful:

In [None]:
filename = 'corpus/sonnet_one.txt'
with open(filename, 'r') as file_in:
    poetry = file_in.readlines()
print(poetry[:12])
print('=====')
print(poetry[12:])

Calling readlines() on a piece of poetry gives us access to the whole poem as a list of lines, so we can manipulate it to chunk the poem into pieces that we care about. Above, I separated the poem into two pieces at the volta, the turn in the sonnet that occurs before the couplet.

Those '\n' characters might appear to be a mistake at first, but worry not! They are actually the computer's representation of a newline character, a way of knowing when a line break happens. Before we process these for analysis, we would want to process those out. One way would be to search each string and remove the character:

In [None]:
cleaned_poem = []
for line in poetry:
    clean_line = line.replace('\n', '')
    cleaned_poem.append(clean_line)

print(cleaned_poem)

This points to an important underlying problem in natural language processing: these texts are not formatted in such a way that they are computer ready right away. That's what puts the natural in natural language! In the case of prose, you can expect a few different categories:

1. The text is one continuous string with no line breaks.
2. The text has line breaks that correspond to the ends of the lines as laid out on a page.
3. The text has line breaks that correspond to meaningful categories.
4. Some combination of 2 and 3 (most likely).

In most cases, as when working with prose, the line breaks will be used to shape the legibility of a text. Ie - they are meant to assist with the typographical layout, but they have no underlying interpretive meaning. This means that if you wish to preserve the underlying structure of a text you will need to parse the text in a more sophisticated way than just reading it in either as a lump or line by line.

Of course, nothing will be as reliable as separating things individually by hand. In the case of a book of poetry, you might, for example, separate each poem into its own text file. The sonnets folder has a set of five Shakespearean sonnets in it. Combining what we've learned already, we can read the filenames from the folder, read them each in, and then store them in a variable for manipulation.

In [None]:
filenames = glob.glob('corpus/sonnets/*.txt')
sonnets = []
for filename in filenames:
    with open(filename, 'r') as file_in:
        sonnets.append(file_in.readlines())

We now have a list called sonnets, but it's more properly understood as a list of lists, or a list in which each item is itself a list of more items:

* List level one: sonnet level.
* List level two (sub-list): line level.

And we can manipulate this hierarchy to access different elements of the list. This will give us the first sonnet.

In [None]:
print(sonnets[0])

In [None]:
print(sonnets[2][0:5])

Organizing things manually like this gives you great control over the files you're working with and their underlying structure. When working with a large corpus you might not have such an option. Separating files by hand is feasible when working with a few texts, but when working with thousands of documents you either have to work with what they give you or develop some computational way of recovering the structure of your text. In these cases, you might rely on textual markers to pinpoint sections of a text.

Pinpoint by chapter markers.
