# Filesystem IO in python

Often, the text we want to work with comes in a raw .txt format. Take for example the [Cornell movie review dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz).

We can download the above archive, extract it and read a single review very easily.

In [None]:
with open("Datasets/review_polarity/txt_sentoken/neg/cv001_19502.txt") as f:
    text = f.read()
    
print(text)

Reading a whole file in a single operation may not be practical if the file is large or you wish to operate on it iteratively (i.e. line by line). Python provides a readlines() function which reads text files line-by-line in a lazy way (i.e. it uses yield for each line) so that you can operate on the file on a per line basis without reading the whole file into memory at once.

In [None]:
with open("Datasets/review_polarity/txt_sentoken/neg/cv001_19502.txt") as f:
    for line in f.readlines():
        print (line)
    

This is great for reading one or two files at a time. However, what if you have a whole directory or subdirectory structure to process? You don't want to be typing in all the file names.

That's where the `os` module and specifically `os.walk` function can help. `os` provides interfaces for interacting with the filesystem and processes running on the machine. `os.walk` allows us to recursively navigate a directory tree and inspect all files within that structure. All we need is to specify where to start.

In [None]:
import os

for root, dirs, files in os.walk("Datasets/review_polarity/"):
    # this outer loop iterates over each subdirectory - 
    # and updates 'root' with the current directory being nagivated.
    print("\n--> Current root: ", root)
    
    #inside each subdirectory we get lists of sub-subdirectories (dirs that reside in the current root)
    for directory in dirs:
        print ("DIR  ", directory)
        
    #we also get a list of files in each sub-directory too (files inside the current root dir)
    for file in files:
        print("FILE ", file)
        

What if we are only interested in a specific set of files or directories? We can filter the filenames as we process them and match them with rules. Let's assume that we only want text files.

In [None]:
for root, dirs, files in os.walk("Datasets/review_polarity/"):    
    #although we have root and dirs variables - doing anything with them is optional.
    #we're just interested in files - specifically .txt files
    
    for file in files:
        if file.endswith(".txt"):
            print(file)

That's great. Now how do we get access to the data in these files? We don't know their full path only their basic filename (i.e. we know cv723_8648.txt but we don't know its actually Datasets/review_polarity/txt_sentoken/neg/cv723_8648.txt).

That's where `os.path.join` comes into play. This can be used to join file paths in an OS independent way and it takes care of where to put slash characters for you. 

We know that the current value of `root` holds the full directory path to the file and we know that `file` holds the filename so we can do this:

In [None]:
for root, dirs, files in os.walk("Datasets/review_polarity/"):    
    #although we have root and dirs variables - doing anything with them is optional.
    #we're just interested in files - specifically .txt files
    
    for file in files:
        if file.endswith(".txt"):
            print("\nRoot: ", root)
            print(os.path.join(root,file))

Now we can read in all of the lines in the files. Lets find out how many lines there are in total across all of the reviews.

In [None]:
import os
linecount = 0
for root, dirs, files in os.walk("Datasets/review_polarity/"):    
    #although we have root and dirs variables - doing anything with them is optional.
    #we're just interested in files - specifically .txt files
    
    for file in files:
        if file.endswith(".txt"):
            with open(os.path.join(root,file)) as f:
                for line in f.readlines():
                    linecount += 1
                    
print("Total lines in all txt files", linecount)

### Writing a file

In [None]:
with open('Datasets/review_polarity/fake.txt', 'w') as file:  # Use file to refer to the file object
    file.write('Fake review!')

## Conclusion

We are now able to process directory trees containing multiple files of text data and filter out any files that do not contain relevent information. We can read in a file in one chunk or incrementally line-by-line.