## Working with a folder of files

Many common tutorials tell you how to work with a single file, but a new student might quickly want to scale up beyond a single text. In this folder we have a folder 'corpus' with three texts by Virginia Woolf in it. If you're comfortable with regular expressions you could use the glob library.

In [1]:
import glob

folder_path = 'corpus/'
extension = '*.txt'
print(folder_path)
print(glob.glob(folder_path + extension))

corpus/
['corpus/1922_jacobs_room.txt', 'corpus/sonnet_one.txt', 'corpus/1915_the_voyage_out.txt', 'corpus/1919_night_and_day.txt']


If you want to work with a larger array of files (or you don't want to deal with regular expressions) you can define your own function to return all the files in a particular folder and nuance it as you would like. The os library allows you to do such things.

In [2]:
import os

def all_files(folder_name):
    """given a directory, return the filenames in it"""
    texts = []
    for (root, _, files) in os.walk(folder_name):
        for fn in files:
            path = os.path.join(root, fn)
            texts.append(path)
    return texts

# now we try it
print(all_files('corpus'))

['corpus/1922_jacobs_room.txt', 'corpus/.muck', 'corpus/sonnet_one.txt', 'corpus/1915_the_voyage_out.txt', 'corpus/1919_night_and_day.txt', 'corpus/sonnets/sonnet_two.txt', 'corpus/sonnets/sonnet_five.txt', 'corpus/sonnets/sonnet_four.txt', 'corpus/sonnets/sonnets_three.txt', 'corpus/sonnets/sonnet_one.txt']


While glob.glob() and os.walk might appear to do similar things, there is an important distinction to be made between them. os.walk will walk a director recursively, meaning that, if the given folder contains other folders, your script will crawl through every subfolder to pull out every contained file. glob.glob() will only give you the files that match the gvien regular expression, so it will _not_ navigate the directory structure to find the contents of subfolders.

## Dot files

But wait? What is that .muck file doing there? A file whose name is prefixed with a dot will be hidden from the graphical user interface (GUI). ie. you won't see that file in the desktop. Normally this is fine, but sometimes they can unexpectedly cause issues when working with the file structure from within Python. In particular, .DS_Store is a hidden file produced by Macs and that stores information about the current view (the computer has to store the fact that you want to view a window as icons somewhere!). You'll want to ignore those files that start with hidden characters so that you only work with the ones you care about. In this case, we can add an if statement to the function we defined to take care of this for us.

In [3]:
import os

def all_files(folder_name):
    """given a directory, return the filenames in it"""
    texts = []
    for (root, _, files) in os.walk(folder_name):
        for fn in files:
            if fn[0] == '.': # a new addition!
                pass
            else:
                path = os.path.join(root, fn)
                texts.append(path)
    return texts

# now we try it
print(all_files('corpus'))

['corpus/1922_jacobs_room.txt', 'corpus/sonnet_one.txt', 'corpus/1915_the_voyage_out.txt', 'corpus/1919_night_and_day.txt', 'corpus/sonnets/sonnet_two.txt', 'corpus/sonnets/sonnet_five.txt', 'corpus/sonnets/sonnet_four.txt', 'corpus/sonnets/sonnets_three.txt', 'corpus/sonnets/sonnet_one.txt']


Remember, a filename is just a string in this case. And strings are iterable in Python, meaning we can treat a filename like a sequence (a list) of characters. So take the lines that I've marked 'a new addition.' This section checks the first character of each filename. If we've got a period in that slot, we're looking at a hidden dot file that we don't care about. So we move along to the next item in the collection. If there is no period in that first slot we append it to the list of files.