# Filesystem IO in python

Often, the text we want to work with comes in a raw .txt format. Take for example the [Cornell movie review dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz).

I have downloaded the above archive and extracted it to my assets folder in the same directory as the ipython notebook.

We can read in a single review very easily.

In [2]:
with open("assets/review_polarity/txt_sentoken/neg/cv001_19502.txt") as f:
    text = f.read()
    
print(text)

the happy bastard's quick movie review 
damn that y2k bug . 
it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . 
little do they know the power within . . . 
going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . 
we don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout . 
here , it's just " hey , let's chase these people around with some robots " . 
the acting is below average , even from the likes of curtis . 
you're more likely to get a kick out of her work in hallow

Reading a whole file in a single operation may not be practical if the file is large or you wish to operate on it iteratively (i.e. line by line). Python provides a readlines() function which reads text files line-by-line in a lazy way (i.e. it uses yield for each line) so that you can operate on the file on a per line basis without reading the whole file into memory at once.

In [4]:
with open("assets/review_polarity/txt_sentoken/neg/cv001_19502.txt") as f:
    for line in f.readlines():
        print (line)
    

the happy bastard's quick movie review 

damn that y2k bug . 

it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . 

little do they know the power within . . . 

going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . 

we don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout . 

here , it's just " hey , let's chase these people around with some robots " . 

the acting is below average , even from the likes of curtis . 

you're more likely to get a kick out of her work i

This is great for reading one or two files at a time. However, what if you have a whole directory or subdirectory structure to process? You don't want to be typing in all the file names.

That's where the `os` module and specifically `os.walk` function can help. `os` provides interfaces for interacting with the filesystem and processes running on the machine. `os.walk` allows us to recursively navigate a directory tree and inspect all files within that structure. All we need is to specify where to start.

In [5]:
import os

for root, dirs, files in os.walk("assets/review_polarity/"):
    # this outer loop iterates over each subdirectory - 
    # and updates 'root' with the current directory being nagivated.
    print("Current root: ", root)
    
    #inside each subdirectory we get lists of sub-subdirectories (dirs that reside in the current root)
    for directory in dirs:
        print ("DIR  ", directory)
        
    #we also get a list of files in each sub-directory too (files inside the current root dir)
    for file in files:
        print("FILE ", file)
        

Current root:  assets/review_polarity/
DIR   txt_sentoken
FILE  poldata.README.2.0
Current root:  assets/review_polarity/txt_sentoken
DIR   neg
DIR   pos
Current root:  assets/review_polarity/txt_sentoken/neg
FILE  cv315_12638.txt
FILE  cv425_8603.txt
FILE  cv131_11568.txt
FILE  cv060_11754.txt
FILE  cv642_29788.txt
FILE  cv538_28485.txt
FILE  cv018_21672.txt
FILE  cv090_0049.txt
FILE  cv730_10729.txt
FILE  cv924_29397.txt
FILE  cv028_26964.txt
FILE  cv571_29292.txt
FILE  cv382_8393.txt
FILE  cv277_20467.txt
FILE  cv344_5376.txt
FILE  cv024_7033.txt
FILE  cv092_27987.txt
FILE  cv151_17231.txt
FILE  cv288_20212.txt
FILE  cv966_28671.txt
FILE  cv000_29416.txt
FILE  cv928_9478.txt
FILE  cv556_16563.txt
FILE  cv608_24647.txt
FILE  cv465_23401.txt
FILE  cv958_13020.txt
FILE  cv268_20288.txt
FILE  cv834_23192.txt
FILE  cv864_3087.txt
FILE  cv162_10977.txt
FILE  cv674_11593.txt
FILE  cv015_29356.txt
FILE  cv031_19540.txt
FILE  cv004_12641.txt
FILE  cv854_18955.txt
FILE  cv021_17313.txt
FILE  

What if we are only interested in a specific set of files or directories? We can filter the filenames as we process them and match them with rules. Let's assume that we only want text files.

In [6]:
for root, dirs, files in os.walk("assets/review_polarity/"):    
    #although we have root and dirs variables - doing anything with them is optional.
    #we're just interested in files - specifically .txt files
    
    for file in files:
        if file.endswith(".txt"):
            print(file)

cv315_12638.txt
cv425_8603.txt
cv131_11568.txt
cv060_11754.txt
cv642_29788.txt
cv538_28485.txt
cv018_21672.txt
cv090_0049.txt
cv730_10729.txt
cv924_29397.txt
cv028_26964.txt
cv571_29292.txt
cv382_8393.txt
cv277_20467.txt
cv344_5376.txt
cv024_7033.txt
cv092_27987.txt
cv151_17231.txt
cv288_20212.txt
cv966_28671.txt
cv000_29416.txt
cv928_9478.txt
cv556_16563.txt
cv608_24647.txt
cv465_23401.txt
cv958_13020.txt
cv268_20288.txt
cv834_23192.txt
cv864_3087.txt
cv162_10977.txt
cv674_11593.txt
cv015_29356.txt
cv031_19540.txt
cv004_12641.txt
cv854_18955.txt
cv021_17313.txt
cv492_19370.txt
cv238_14285.txt
cv027_26270.txt
cv551_11214.txt
cv550_23226.txt
cv106_18379.txt
cv779_18989.txt
cv234_22123.txt
cv811_22646.txt
cv750_10606.txt
cv352_5414.txt
cv123_12165.txt
cv831_16325.txt
cv726_4365.txt
cv624_11601.txt
cv947_11316.txt
cv515_18484.txt
cv902_13217.txt
cv045_25077.txt
cv297_10104.txt
cv762_15604.txt
cv509_17354.txt
cv731_3968.txt
cv134_23300.txt
cv584_29549.txt
cv020_9234.txt
cv613_23104.txt
cv0

That's great. Now how do we get access to the data in these files? We don't know their full path only their basic filename (i.e. we know cv723_8648.txt but we don't know its actually assets/review_polarity/txt_sentoken/neg/cv723_8648.txt).

That's where `os.path.join` comes into play. This can be used to join file paths in an OS independent way and it takes care of where to put slash characters for you. 

We know that the current value of `root` holds the full directory path to the file and we know that `file` holds the filename so we can do this:

In [7]:
for root, dirs, files in os.walk("assets/review_polarity/"):    
    #although we have root and dirs variables - doing anything with them is optional.
    #we're just interested in files - specifically .txt files
    
    for file in files:
        if file.endswith(".txt"):
            print(os.path.join(root,file))

assets/review_polarity/txt_sentoken/neg/cv315_12638.txt
assets/review_polarity/txt_sentoken/neg/cv425_8603.txt
assets/review_polarity/txt_sentoken/neg/cv131_11568.txt
assets/review_polarity/txt_sentoken/neg/cv060_11754.txt
assets/review_polarity/txt_sentoken/neg/cv642_29788.txt
assets/review_polarity/txt_sentoken/neg/cv538_28485.txt
assets/review_polarity/txt_sentoken/neg/cv018_21672.txt
assets/review_polarity/txt_sentoken/neg/cv090_0049.txt
assets/review_polarity/txt_sentoken/neg/cv730_10729.txt
assets/review_polarity/txt_sentoken/neg/cv924_29397.txt
assets/review_polarity/txt_sentoken/neg/cv028_26964.txt
assets/review_polarity/txt_sentoken/neg/cv571_29292.txt
assets/review_polarity/txt_sentoken/neg/cv382_8393.txt
assets/review_polarity/txt_sentoken/neg/cv277_20467.txt
assets/review_polarity/txt_sentoken/neg/cv344_5376.txt
assets/review_polarity/txt_sentoken/neg/cv024_7033.txt
assets/review_polarity/txt_sentoken/neg/cv092_27987.txt
assets/review_polarity/txt_sentoken/neg/cv151_17231.t

Now we can read in all of the lines in the files. Lets find out how many lines there are in total across all of the reviews.

In [8]:
linecount = 0
for root, dirs, files in os.walk("assets/review_polarity/"):    
    #although we have root and dirs variables - doing anything with them is optional.
    #we're just interested in files - specifically .txt files
    
    for file in files:
        if file.endswith(".txt"):
            with open(os.path.join(root,file)) as f:
                for line in f.readlines():
                    linecount += 1
                    
print("Total lines in all txt files", linecount)

Total lines in all txt files 64720


## Conclusion

We are now able to process directory trees containing multiple files of text data and filter out any files that do not contain relevent information. We can read in a file in one chunk or incrementally line-by-line.

Next we take a look at how to process [XML files](XML%20ElementTree.ipynb)