## Working with a folder of files

Many common tutorials tell you how to work with a single file, but a new student might quickly want to scale up beyond a single text. In this folder we have a folder 'corpus' with three texts by Virginia Woolf in it. If you're comfortable with regular expressions you could use the glob library.

In [1]:
import glob

folder_path = 'corpus/'
extension = '*.txt'
print(folder_path)
print(glob.glob(folder_path + extension))

corpus/
['corpus/1922_jacobs_room.txt', 'corpus/sonnet_one.txt', 'corpus/1915_the_voyage_out.txt', 'corpus/1919_night_and_day.txt']


If you want to work with a larger array of files (or you don't want to deal with regular expressions) you can define your own function to return all the files in a particular folder and nuance it as you would like. The os library allows you to do such things.

In [2]:
import os

def all_files(folder_name):
    """given a directory, return the filenames in it"""
    texts = []
    for (root, _, files) in os.walk(folder_name):
        for fn in files:
            path = os.path.join(root, fn)
            texts.append(path)
    return texts

# now we try it
print(all_files('corpus'))

['corpus/1922_jacobs_room.txt', 'corpus/.muck', 'corpus/sonnet_one.txt', 'corpus/1915_the_voyage_out.txt', 'corpus/1919_night_and_day.txt', 'corpus/sonnets/sonnet_two.txt', 'corpus/sonnets/sonnet_five.txt', 'corpus/sonnets/sonnet_four.txt', 'corpus/sonnets/sonnets_three.txt', 'corpus/sonnets/sonnet_one.txt']


While glob.glob() and os.walk might appear to do similar things, there is an important distinction to be made between them. os.walk will walk a director recursively, meaning that, if the given folder contains other folders, your script will crawl through every subfolder to pull out every contained file. glob.glob() will only give you the files that match the gvien regular expression, so it will _not_ navigate the directory structure to find the contents of subfolders.

## Dot files

But wait? What is that .muck file doing there? A file whose name is prefixed with a dot will be hidden from the graphical user interface (GUI). ie. you won't see that file in the desktop. Normally this is fine, but sometimes they can unexpectedly cause issues when working with the file structure from within Python. In particular, .DS_Store is a hidden file produced by Macs and that stores information about the current view (the computer has to store the fact that you want to view a window as icons somewhere!). You'll want to ignore those files that start with hidden characters so that you only work with the ones you care about. In this case, we can add an if statement to the function we defined to take care of this for us.

In [13]:
import os

def all_files(folder_name):
    """given a directory, return the filenames in it"""
    texts = []
    for (root, _, files) in os.walk(folder_name):
        for fn in files:
            if fn[0] == '.': # a new addition!
                pass
            else:
                path = os.path.join(root, fn)
                texts.append(path)
    return texts

# now we try it
print(all_files('corpus'))

['corpus/iliad.tei', 'corpus/brothers_karamazov.tei', 'corpus/1922_jacobs_room.txt', 'corpus/sonnet_one.txt', 'corpus/1915_the_voyage_out.txt', 'corpus/16663-tei.tei.xml', 'corpus/1919_night_and_day.txt', 'corpus/sonnets/sonnet_two.txt', 'corpus/sonnets/sonnet_five.txt', 'corpus/sonnets/sonnet_four.txt', 'corpus/sonnets/sonnets_three.txt', 'corpus/sonnets/sonnet_one.txt']


Remember, a filename is just a string in this case. And strings are iterable in Python, meaning we can treat a filename like a sequence (a list) of characters. So take the lines that I've marked 'a new addition.' This section checks the first character of each filename. If we've got a period in that slot, we're looking at a hidden dot file that we don't care about. So we move along to the next item in the collection. If there is no period in that first slot we append it to the list of files.

## Creating and Deleting Files

Besides reading in files, you often have to create new files for your results, analysis, or processed data. While you can, of course, make new files using the terminal or finder, it is often preferable to make files from inside your Python script. After all, you might want, say, to name your new files dynamically based on the results of your analysis. We can do this, again, by using the `os` library in conjunction with some basic Python functionality. 

Making files is something you can already do with the basic Python library. As long as we have our data in a string we can write it into the file:

In [14]:
data_to_write = "We can write this data to a new file by storing it as a string."
with open('a_test.txt', 'w') as file_out:
    file_out.write(data_to_write)

# we can check that it worked by reading the file back in
with open('a_test.txt', 'r') as file_in:
    print(file_in.read())

We can write this data to a new file by storing it as a string.


We can manipulate the filenames to create these new files dynamically on the fly. So, for example, we might want to create a series of closely related file names based on the connections among the documents. This might be called for when chunking a larger text document into a series of pieces. So let's use some code from the "Chunking of this book. The code will be explained later, but for now suffice to say that the first function will take a text and divide it into a specified number of units:

In [15]:
import math
# will be explianed further in the chunking section.
def get_chunks(text, num_chunks):
    text_length = len(text)
    text_chunks = []
    number_of_chunks = num_chunks
    for i in range(number_of_chunks):
        chunk_size = text_length/number_of_chunks
        chunk_start = math.floor(chunk_size * i)
        chunk_end = math.floor(chunk_size * (i +1))
        text_chunks.append(text[chunk_start:chunk_end])
    return text_chunks

# reads in the iliad and breaks it into 100 pieces.
with open('iliad.txt', 'r') as file_in:
    text = file_in.read()
chunks = get_chunks(text, 100)

# takes the pieces of the iliad and writes each of them to a new file with a filename based on a counter. We collect these into the output folder
counter = 0
for chunk in chunks:
    # Note that when we use the counter in the filename we have to change it into a string.
    with open('output/iliad-' + str(counter) + '.txt', 'w') as file_out:
        file_out.write(chunk)
    counter += 1


FileNotFoundError: [Errno 2] No such file or directory: 'output/iliad-0.txt'

The os library can help us prevent overwriting files if they already exist and, conversely, making sure that the appropriate folders exist for the code to run. If, for example, we could not run the previous code block without actually creating an 'output' folder. It would have errored. The os library can help check to make sure the required folders exist and, if not, create them for us within the program itself. We can use isfile() or isdir() to check that a particular thing exists, but exists() will check on both for us. 

In [18]:
import os
if os.path.exists('output'):
    print("No worries - output exists")
else:
    print("Need to make the output folder. Making it now.")
    os.mkdir('output')

No worries - output exists


So we might put this together with the previous example to chunk, check if the output folder exists, and, if not, make the folder for us. 

In [20]:
import math
# will be explianed further in the chunking section.
def get_chunks(text, num_chunks):
    text_length = len(text)
    text_chunks = []
    number_of_chunks = num_chunks
    for i in range(number_of_chunks):
        chunk_size = text_length/number_of_chunks
        chunk_start = math.floor(chunk_size * i)
        chunk_end = math.floor(chunk_size * (i +1))
        text_chunks.append(text[chunk_start:chunk_end])
    return text_chunks

# reads in the iliad and breaks it into 100 pieces.
with open('iliad.txt', 'r') as file_in:
    text = file_in.read()
chunks = get_chunks(text, 100)

if os.path.exists('output'):
    print("No worries - output exists")
else:
    print("Need to make the output folder. Making it now.")
    os.mkdir('output')

# takes the pieces of the iliad and writes each of them to a new file with a filename based on a counter. We collect these into the output folder
counter = 0
for chunk in chunks:
    # Note that when we use the counter in the filename we have to change it into a string.
    with open('output/iliad-' + str(counter) + '.txt', 'w') as file_out:
        file_out.write(chunk)
    counter += 1

Need to make the output folder. Making it now.


Manipulating the file structure like this is often one of the first steps in making the leap from introductory programming exercises to a full-on natural language processing project. 