## How to Prepare Movie Review Text Data for Sentiment Analysis.

#### 1. Movie Review Dataset

A collection of movie reviews retrieved from the imdb.com website. It has the following properties:
    
    1. The dataset is comprised of only English reviews.
    2. All text has been converted to lowercase.
    3. There is white space around punctuation like periods, commas, and brackets.
    4. Text has been split into one sentence per line.

In [12]:
# Download the data file.
!wget -O Data/review_polarity.tar.gz http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

--2020-03-28 03:05:06--  http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3127238 (3.0M) [application/x-gzip]
Saving to: ‘Data/review_polarity.tar.gz’


2020-03-28 03:05:18 (276 KB/s) - ‘Data/review_polarity.tar.gz’ saved [3127238/3127238]



In [15]:
!ls Data

metamorphosis_clean.txt  metamorphosis.txt  review_polarity.tar


In [14]:
# Uncompress the file
!gunzip Data/review_polarity.tar.gz

In [16]:
# Extract the file
!tar xvf Data/review_polarity.tar

txt_sentoken/neg/cv000_29416.txt
txt_sentoken/neg/cv001_19502.txt
txt_sentoken/neg/cv002_17424.txt
txt_sentoken/neg/cv003_12683.txt
txt_sentoken/neg/cv004_12641.txt
txt_sentoken/neg/cv005_29357.txt
txt_sentoken/neg/cv006_17022.txt
txt_sentoken/neg/cv007_4992.txt
txt_sentoken/neg/cv008_29326.txt
txt_sentoken/neg/cv009_29417.txt
txt_sentoken/neg/cv010_29063.txt
txt_sentoken/neg/cv011_13044.txt
txt_sentoken/neg/cv012_29411.txt
txt_sentoken/neg/cv013_10494.txt
txt_sentoken/neg/cv014_15600.txt
txt_sentoken/neg/cv015_29356.txt
txt_sentoken/neg/cv016_4348.txt
txt_sentoken/neg/cv017_23487.txt
txt_sentoken/neg/cv018_21672.txt
txt_sentoken/neg/cv019_16117.txt
txt_sentoken/neg/cv020_9234.txt
txt_sentoken/neg/cv021_17313.txt
txt_sentoken/neg/cv022_14227.txt
txt_sentoken/neg/cv023_13847.txt
txt_sentoken/neg/cv024_7033.txt
txt_sentoken/neg/cv025_29825.txt
txt_sentoken/neg/cv026_29229.txt
txt_sentoken/neg/cv027_26270.txt
txt_sentoken/neg/cv028_26964.txt
txt_sentoken/neg/c

txt_sentoken/pos/cv006_15448.txt
txt_sentoken/pos/cv007_4968.txt
txt_sentoken/pos/cv008_29435.txt
txt_sentoken/pos/cv009_29592.txt
txt_sentoken/pos/cv010_29198.txt
txt_sentoken/pos/cv011_12166.txt
txt_sentoken/pos/cv012_29576.txt
txt_sentoken/pos/cv013_10159.txt
txt_sentoken/pos/cv014_13924.txt
txt_sentoken/pos/cv015_29439.txt
txt_sentoken/pos/cv016_4659.txt
txt_sentoken/pos/cv017_22464.txt
txt_sentoken/pos/cv018_20137.txt
txt_sentoken/pos/cv019_14482.txt
txt_sentoken/pos/cv020_8825.txt
txt_sentoken/pos/cv021_15838.txt
txt_sentoken/pos/cv022_12864.txt
txt_sentoken/pos/cv023_12672.txt
txt_sentoken/pos/cv024_6778.txt
txt_sentoken/pos/cv025_3108.txt
txt_sentoken/pos/cv026_29325.txt
txt_sentoken/pos/cv027_25219.txt
txt_sentoken/pos/cv028_26746.txt
txt_sentoken/pos/cv029_18643.txt
txt_sentoken/pos/cv030_21593.txt
txt_sentoken/pos/cv031_18452.txt
txt_sentoken/pos/cv032_22550.txt
txt_sentoken/pos/cv033_24444.txt
txt_sentoken/pos/cv034_29647.txt
txt_sentoken/pos/cv

In [18]:
!ls Data

metamorphosis_clean.txt  metamorphosis.txt  review_polarity.tar


In [20]:
# Seems the file unpacked in the base directory. We want it in our Data folder just for efficient house keeping.
!mv txt_sentoken Data/txt_sentoken 

In [21]:
!ls Data/txt_sentoken

neg  pos


We can see two subdirectories within the txt_sentoken folder called neg and pos. Each represents positive and negative classes and contains 1000 reviews each.

Each file in ach folder contains a single review.

#### 2. Load Text Data

Now let's open a single file and read the ASCII text.

In [22]:
# Load one file
filename = 'Data/txt_sentoken/pos/cv991_18645.txt'
# Open the file as read only
file = open(filename, 'r')
# Read all text
text = file.read()
# Close the file
file.close()

In [23]:
print(text)

i don't box with kid gloves . 
i don't play nice , i'm not a nice guy , and i never , ever , go easy on a film . 
i consider it to be a breech of some sort of code of ethics for a movie critic . 
however , i do some favors , and these often come in the form of points that i hand to certain groups due to the artistic bravery . 
rigormortis , the production company that has been my prime example of how money does not need to motivate filmmaking , gets several of these points each time . 
i still , however , will not go easy on them . 
they recently sent me a vhs copy of their down with america trilogy ( which begins , quite wittily , with a disclaimer that they are not trying to undermine america with the making of this film . ) and i decided to spend an hour of my day watching it . 
in the famous lines of many martyrs , i have no regrets . 
well , i do have some regrets , but that is not the point in the previous sentence . 
the point of it was that down with america was a film that , f

We notice it successfully read, so we can create a function from this to read all our files in a loop from their directories.

In [24]:
def load_doc(filename):
    # Open the file as read only
    file = open(filename, 'r')
    # Read all text
    text = file.read()
    # Close the file
    file.close()
    return text

In [25]:
# Load os module to do an os walkthrough.
import os

In [26]:
# Specify directory to load
directory_pos = 'Data/txt_sentoken/pos'
directory_neg = 'Data/txt_sentoken/neg'
# Walkthrough all files in the folder.
for filename in os.listdir(directory_pos):
    if filename.endswith('txt'):
        path = directory_pos + '/' + filename
        document = load_doc(path)
        print('Loaded %s' %filename)

Loaded cv114_18398.txt
Loaded cv471_16858.txt
Loaded cv108_15571.txt
Loaded cv248_13987.txt
Loaded cv879_14903.txt
Loaded cv482_10580.txt
Loaded cv820_22892.txt
Loaded cv405_20399.txt
Loaded cv487_10446.txt
Loaded cv937_9811.txt
Loaded cv892_17576.txt
Loaded cv853_29233.txt
Loaded cv751_15719.txt
Loaded cv792_3832.txt
Loaded cv463_10343.txt
Loaded cv435_23110.txt
Loaded cv852_27523.txt
Loaded cv854_17740.txt
Loaded cv570_29082.txt
Loaded cv575_21150.txt
Loaded cv272_18974.txt
Loaded cv242_10638.txt
Loaded cv547_16324.txt
Loaded cv784_14394.txt
Loaded cv836_12968.txt
Loaded cv609_23877.txt
Loaded cv011_12166.txt
Loaded cv192_14395.txt
Loaded cv181_14401.txt
Loaded cv627_11620.txt
Loaded cv636_15279.txt
Loaded cv658_10532.txt
Loaded cv584_29722.txt
Loaded cv480_19817.txt
Loaded cv371_7630.txt
Loaded cv771_28665.txt
Loaded cv947_10601.txt
Loaded cv104_18134.txt
Loaded cv767_14062.txt
Loaded cv256_14740.txt
Loaded cv911_20260.txt
Loaded cv333_8916.txt
Loaded cv000_29590.txt
Loaded cv313_18

In [27]:
# Another way of doing the above.

# Specify directory to load
directory_pos = 'Data/txt_sentoken/pos'
directory_neg = 'Data/txt_sentoken/neg'
# Walkthrough all files in the folder.
for filename in os.listdir(directory_pos):
    # Skip files that do not have the right extension
    if not filename.endswith('txt'):
        next
    # Create the full path of the file to open
    path = directory_pos + '/' + filename
    document = load_doc(path)
    print('Loaded %s' %filename)

Loaded cv114_18398.txt
Loaded cv471_16858.txt
Loaded cv108_15571.txt
Loaded cv248_13987.txt
Loaded cv879_14903.txt
Loaded cv482_10580.txt
Loaded cv820_22892.txt
Loaded cv405_20399.txt
Loaded cv487_10446.txt
Loaded cv937_9811.txt
Loaded cv892_17576.txt
Loaded cv853_29233.txt
Loaded cv751_15719.txt
Loaded cv792_3832.txt
Loaded cv463_10343.txt
Loaded cv435_23110.txt
Loaded cv852_27523.txt
Loaded cv854_17740.txt
Loaded cv570_29082.txt
Loaded cv575_21150.txt
Loaded cv272_18974.txt
Loaded cv242_10638.txt
Loaded cv547_16324.txt
Loaded cv784_14394.txt
Loaded cv836_12968.txt
Loaded cv609_23877.txt
Loaded cv011_12166.txt
Loaded cv192_14395.txt
Loaded cv181_14401.txt
Loaded cv627_11620.txt
Loaded cv636_15279.txt
Loaded cv658_10532.txt
Loaded cv584_29722.txt
Loaded cv480_19817.txt
Loaded cv371_7630.txt
Loaded cv771_28665.txt
Loaded cv947_10601.txt
Loaded cv104_18134.txt
Loaded cv767_14062.txt
Loaded cv256_14740.txt
Loaded cv911_20260.txt
Loaded cv333_8916.txt
Loaded cv000_29590.txt
Loaded cv313_18

We can also convert any of the above to a function.

In [28]:
def process_docs(directory):
    for filename in os.listdir(directory):
    # Skip files that do not have the right extension
        if not filename.endswith('txt'):
            next
        # Create the full path of the file to open
        path = directory + '/' + filename
        document = load_doc(path)
        print('Loaded %s' %filename)

In [30]:
# Run function.
process_docs(directory_pos)

Loaded cv114_18398.txt
Loaded cv471_16858.txt
Loaded cv108_15571.txt
Loaded cv248_13987.txt
Loaded cv879_14903.txt
Loaded cv482_10580.txt
Loaded cv820_22892.txt
Loaded cv405_20399.txt
Loaded cv487_10446.txt
Loaded cv937_9811.txt
Loaded cv892_17576.txt
Loaded cv853_29233.txt
Loaded cv751_15719.txt
Loaded cv792_3832.txt
Loaded cv463_10343.txt
Loaded cv435_23110.txt
Loaded cv852_27523.txt
Loaded cv854_17740.txt
Loaded cv570_29082.txt
Loaded cv575_21150.txt
Loaded cv272_18974.txt
Loaded cv242_10638.txt
Loaded cv547_16324.txt
Loaded cv784_14394.txt
Loaded cv836_12968.txt
Loaded cv609_23877.txt
Loaded cv011_12166.txt
Loaded cv192_14395.txt
Loaded cv181_14401.txt
Loaded cv627_11620.txt
Loaded cv636_15279.txt
Loaded cv658_10532.txt
Loaded cv584_29722.txt
Loaded cv480_19817.txt
Loaded cv371_7630.txt
Loaded cv771_28665.txt
Loaded cv947_10601.txt
Loaded cv104_18134.txt
Loaded cv767_14062.txt
Loaded cv256_14740.txt
Loaded cv911_20260.txt
Loaded cv333_8916.txt
Loaded cv000_29590.txt
Loaded cv313_18

#### 3. Clean Text Data

Now we have our functions to load the data, let's look at what data cleaning we might want to do. We will assume that we will be using a BOW model or perhaps a word embedding that does not require too much preparation.

3.1. Split into tokens:
    
    First let's load one document and look at the raw tokens split by white space.

In [39]:
# Load one file
filename = 'Data/txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
print(text)

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

In [40]:
# Split into tokens with whitespace
tokens = text.split()
print(tokens[:100])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', "what's", 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind-fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i']


Looking at the text, we can see what needs to be done:
    
    1. Remove punctuation from words.
    2. Remove tokens that are just punctuation.
    3. Remove tokens that contain numbers.
    4. Remove tokens that have just one character.
    5. Remove tokens that dont have much meaning.
    
How:
    
    1. Filter out punctuations from tokens using regular expressions.
    2  We can remove tokens that are just punctuations or contain numbers by using an isalpha() check on each token.
    3. We can remove English stop words using the list loaded using NLTK.
    4. We can filter out short tokens by checking their length, eg, **a**

In [37]:
# Load modules for the above tasks
from nltk.corpus import stopwords
import string, re

In [41]:
# We have loaded a file
# We have split the file by whitespace
# Prepare regex for character filtering
re_punctuation = re.compile('[%s]'%re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punctuation.sub('', w) for w in tokens]
# remove remaining non-alphabetic tokens
tokens = [word for word in tokens if word.isalpha()]
# filter out stopwords
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if w not in stop_words]
# Filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'whats', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'mindfuck', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', 'didnt', 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'whats', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scen

Since this works, let's put it into a function called clean_doc().

In [42]:
def clean_doc(doc):
    tokens = doc.split()
    re_punctuation = re.compile('[%s]'%re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punctuation.sub('', w) for w in tokens]
    # remove remaining non-alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if w not in stop_words]
    # Filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

In [44]:
# Load all functions and document
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'whats', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'mindfuck', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', 'didnt', 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'whats', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scen

#### 4. Develop Vocabulary

Now we have to wrok on the vocabulary. Always remember, the larger the vocab, the more sparse the representation of each word or document. As part of preparing text for sentiment analysis, one has to define and tailor the vocab of the words supported by the model.

We can keep track of the vocab in a Counter, which is a dictionary of words and their count with some convenience functions.

We will thus create a new function to process each document and add it to the vocabulary. The function needs to load a document by calling load_doc(), then clean this document by calling clean_doc() then add all tokens to the Counter, and update the counts. 

In [None]:
# Load doc and add to vocab
def add_doc_to_vocab(filename, vocab): 
    """
    input: A document filename and a Counter vocabulary
    output: Counter update
    """
    # load document
    document = load_doc(filename)
    # clean document
    tokens = clean_doc(document)
    # Update counter
    vocab.update(tokens)
    
# Finally we use our process_docs() function to process all documents in a directory and update it to call 
# add_doc_to_vocab()

def process_docs(directory, vocab):
    """
    input: A directory containing all documents of a class and a Counter vocabulary
    output: Vocabulary counter update
    """
    for filename in os.listdir(directory):
        if not filename.endswith('.txt'):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # add document to vocabulary
        add_doc_to_vocab(path, vocab)

Now we have all these processes outlined, let's put them all together in one script.

In [52]:
"""
Program to create a vocab corpus for NLP tasks. Specifically text classification.
"""

import string, re, os
from collections import Counter
from nltk.corpus import stopwords

# Function to load documents into memory
def load_doc(filename):
    """
    Parameters:
    filename: a filename
    
    Return:
    text: a loaded text
    """
    # Open the file as read only
    file = open(filename, 'r')
    # Read all text
    text = file.read()
    # Close the file
    file.close()
    return text

# Function to clean up a loaded document
def clean_doc(doc):
    """
    Parameters:
    doc: a loaded document
    
    Return
    tokens: a cleaned document"""
    # Split into tokens by white space.
    tokens = doc.split()
    # Prepare regex for character filtering.
    re_punctuation = re.compile('[%s]'%re.escape(string.punctuation))
    # remove punctuation from each word.
    tokens = [re_punctuation.sub('', w) for w in tokens]
    # remove remaining non-alphabetic tokens.
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stopwords.
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if w not in stop_words]
    # Filter out short tokens.
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# Function to load documents and add to vocabulary
def add_doc_to_vocab(filename, vocab): 
    """
    Parameters:
    filename: A document filename 
    Vocab: Counter vocabulary
    
    Return:
    Counter update
    """
    # load document
    document = load_doc(filename)
    # clean document
    tokens = clean_doc(document)
    # Update counter
    vocab.update(tokens)
    
# Function to load all documents in a directory
def process_docs(directory, vocab):
    """
    Parameters:
    directory: A directory containing all documents of a class 
    vocab: Counter vocabulary
    
    Return:
    Vocabulary counter update
    """
    for filename in os.listdir(directory):
        if not filename.endswith('.txt'):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # add document to vocabulary
        add_doc_to_vocab(path, vocab)
        
# Function to truncate vocabulary
def remove_min(occurance, vocab):
    """
    Parameters:
    occurance: a minimum occurance threshold
    vocab: cleaned vocabulary"""
    # Keep tokens with > 5 occurences
    tokens = [i for i, j in vocab.items() if j >= occurance]
    return tokens
        
# Function to save final vocabulary
def save_list(lines, filename):
    """
    Parameters:
    lines: 
    filename: Filename to save data to.
    
    """
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
    print('Done saving vocabulary to %s ...'%filename)
        
        
# Define a counter vocabulary
vocab = Counter()
# Add all docs to vocab
process_docs(directory_neg, vocab)
process_docs(directory_pos, vocab)
# print the size of the vocabulary
tokens = remove_min(5, vocab)
print(len(vocab))
print(len(tokens))
# Save tokens to a vocab file
save_list(tokens, 'Data/vocab.txt')

46557
14803
Done saving to Data/vocab.txt ...


In [47]:
# Print the top words in the vocabulary
print(vocab.most_common(50))

[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]


Nice!!.

Perhaps the least common words , those that appear only once across all reviews are not predictive. Perhaps some of the most common too are not that useful. These are good questinos to ask of your data, and should be tested with a specific predictive model.

Generally, words that appear a few times across 200 reviews are probably not that predictive and can be removed from the vocabulary, greatly cutting down on the tokens we nee to model. We can select a threshold and walk through our data removing words with less than the threshold.

In [50]:
vocab.items()



In [51]:
# Keep tokens with > 5 occurences
min_occurance = 5
tokens = [i for i, j in vocab.items() if j >= min_occurance]
print(len(tokens))

14803


We can see a significant reduction in the vocabulary size. Is it possible a minimum of 5 is too aggresive? We can try others and see.

We can then save the chosen vocabulary of words to a new file. Preferably as an ASCII file type with one word per line. Lets define a function that will do that, and add it to the previous script.

In [None]:
def save_list(lines, filename):
    """
    Parameters:
    lines: 
    filename: Filename to save data to.
    
    """
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

All code are run in the mini script above. Take time to understand all code before running.

#### 5. Save Prepared Data

We can use the data cleaning and chosen vocab to prepare each movie review and save the prepared versions of the reviews ready for modeling.

This is usually a good practice as it decouples the data preparation from modeling, allowing you to focus on modeling and circle back to data prep if you have new ideas.

Let's start off by loading the vocab.txt

In [53]:
# Load vocabulary
vocab_filename = 'Data/vocab.txt'
vocab = load_doc(vocab_filename)
print(vocab)
vocab = vocab.split()
vocab = set(vocab)

part
buddy
comedy
fishoutofwater
story
nature
tale
meet
deedles
nearly
interesting
archetypes
fact
invitation
ought
disregard
phil
deedle
paul
walker
steve
van
twin
sons
famous
millionaire
elton
eric
founder
enterprises
wants
perfect
fortune
instead
two
careless
surf
bums
set
straight
strict
wyoming
boot
camp
pair
inevitably
go
several
misadventures
duo
stunned
discover
theres
brothers
stumble
upon
routine
mistaken
identity
plot
arrive
yellowstone
national
park
believed
new
ranger
recruits
rather
back
home
disappointing
dad
play
along
actually
motives
beautiful
jesse
unfortunately
happens
beloved
stepdaughter
overprotective
captain
douglas
john
problem
week
old
faithful
celebrates
one
birthday
dogs
thousands
assigned
eliminate
dog
menace
knowing
plan
former
head
frank
slater
dennis
hopper
arent
supposed
stupid
like
team
dumb
dumber
bill
ted
brains
operate
different
simpler
realm
accurate
comparison
would
carrot
top
chairman
board
film
resembles
hideous
ways
central
simply
isnt
funny
mo

Next let's clean the reviews, use the loaded vocab to filter out unwanted tokens and save the clean reviews in a new file.

One approach would be to save all positives in one file and negatives in another with the filtered tokens seperated by white space for each review on seperate lines.

First we can define a function that will process a document, clean it, filter it, and return it as a single line that could be saved in a file.

In [54]:
def doc_to_line(filename, vocab): 
    """
    Parameters:
    filename: A document filename 
    vocab: Counter vocabulary (a set)
    
    Return:
    Counter update
    """
    # load document
    document = load_doc(filename)
    # clean document (tokenize)
    tokens = clean_doc(document)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    return ' '.join(tokens)

Next we define a new version of process_docs() to step through all reviews in a folder and convert them to lines by calling doc_to_lines() for each document.

A list of lines will then be returned.

In [55]:
# Load all docs in a directory
def process_docs(directory, vocab):
    """
    Parameters:
    directory: a directory to traverse
    vocab: set of vocabularies
    
    Return:
    lines: cleaned and straightened documents from a directory"""
    lines = list()
    # walkthrough all files in the folder
    for filename in os.listdir(directory):
        # Skip files that do not have the right extension
        if not filename.endswith('.txt'):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # Load and clean the doc
        line = doc_to_line(path, vocab)
        # Add to list
        lines.append(line)
    return lines

So to complete the process, we detail the whole script for saving processed text data.

In [57]:
"""
Python script to load a vocab, create tokens for both negative and positive reviews and save both positive and 
negative tokens in a directory.
"""

# import modules as in previous script
# load_doc() function
# clean_doc() function
# save_list() function

# Function to load document, clean it and return line of tokens
def doc_to_line(filename, vocab): 
    """
    Parameters:
    filename: A document filename 
    vocab: Counter vocabulary (a set)
    
    Return:
    Counter update
    """
    # load document
    document = load_doc(filename)
    # clean document (tokenize)
    tokens = clean_doc(document)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    return ' '.join(tokens)

# Function to load all documents in a directory
# Load all docs in a directory
def process_docs_full(directory, vocab):
    """
    Parameters:
    directory: a directory to traverse
    vocab: set of vocabularies
    
    Return:
    lines: cleaned and straightened documents from a directory"""
    lines = list()
    # walkthrough all files in the folder
    for filename in os.listdir(directory):
        # Skip files that do not have the right extension
        if not filename.endswith('.txt'):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # Load and clean the doc
        line = doc_to_line(path, vocab)
        # Add to list
        lines.append(line)
    return lines


# load vocabulary from file
vocab_filename = 'Data/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
print('Loaded and set vocabulary with length %s...'%len(vocab))

# Prepare negative reviews
negative_lines = process_docs_full('Data/txt_sentoken/neg', vocab)
save_list(negative_lines, 'Data/negative.txt')

# Prepare positive reviews
positive_lines = process_docs_full('Data/txt_sentoken/pos', vocab)
save_list(positive_lines, 'Data/positive.txt')
print('Cleaned and saved both positive and negative reviews to file...')

Loaded and set vocabulary with length 14803...
Done saving to Data/negative.txt ...
Done saving to Data/positive.txt ...
Cleaned and saved both positive and negative reviews to file...


In [58]:
!pip3 freeze > requirements.txt

Now we know:
    
    1. How to load text data, clean it to remove punctuation and other words.
    2. How to develop a vocabulary, tailor it, and save it to file.
    3. How to prepare movie reviews using cleaning and a predefined vocabulary and save them to new files ready for modeling.