# Building a corpus from individual files
Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a `\pos\` directory while negative reviews live under `\neg\`. Refer to [moviereviesREADME.txt](../moviereviews/moviereviewsREADME.txt) for more information about the files.

We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:
* using Python's **os module** to build a pandas DataFrame
* using an **nltk** tool called `CategorizedPlaintextCorpusReader` 

## Using Python's `os` module to build a DataFrame

In [None]:
# Perform imports:
import numpy as np
import pandas as pd
import os

### Let's look at what os.walk() does:

In [None]:
gen = os.walk('../moviereviews')
next(gen)

`os.walk()` is a generator that returns a tuple with three items:
1. the name of the current folder
2. a list of names of any subfolders
3. a list of names of any files in the current folder

In [None]:
next(gen)

The subfolder `../moviereviews/neg` contains 1000 text files. 

In [None]:
next(gen) # this walks the /pos/ subfolder
next(gen)

`os.walk()` stopped once it had walked all subfolders.

### Use os.walk() to build a DataFrame

The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.<br>We'll take the following steps to build our list:
1. Start with a list of subdirectory names ('neg' and 'pos')
2. Walk each subdirectory
3. Create a dictionary object for every file in a subdirectory where `label` is either 'neg' or 'pos', and `review` is the text of the file.
4. We need to handle cases where files have no text - perhaps a reviewer ranked a movie without commenting on it - so that records are given NaN values.

In [None]:
row_list = []

for subdir in ['neg','pos']:
    for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):
        for file in filenames:
            d = {'label':subdir}  # assign the name of the subdirectory to the label field
            with open('moviereviews/'+subdir+'/'+file) as f:
                if f.read():      # handles the case of empty files, which become NaN on import
                    f.seek(0)
                    d['review'] = f.read()  # assign the contents of the file to the review field
            row_list.append(d)
        break

In [None]:
df = pd.DataFrame(row_list)

In [None]:
df.head()