<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-2-bag-of-words/1_preparing_movie_review_data_for_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Movie Review Data for Sentiment Analysis

Text data preparation is different for each problem. Preparation starts with simple steps, like loading data, but quickly gets difficult with cleaning tasks that are very specific to the data you are working with. You need help as to where to begin and what order to work through the steps from raw data to data ready for modeling.

## Movie Review Dataset

The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the polarity dataset.

The data has been cleaned up somewhat, for example:
* The dataset is comprised of only English reviews.
* All text has been converted to lowercase.
* There is white space around punctuation like periods, commas, and brackets.
* Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of classical models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-to-82%). More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation.


After unzipping the file, you will have a directory called txt sentoken with two sub-directories containing the text neg and pos for negative and positive reviews. Reviews are stored
one per file with a naming convention from cv000 to cv999 for each of neg and pos.


## Load Text Data

we will look at loading individual text files, then processing the directories of filles. We will fetch data from Github repository where we have storred this Movie Review Polarity Dataset and after fetching it will be available in the current working directory in the folder txt sentoken.

We can load an individual text file by opening it, reading
in the ASCII text, and closing the file. This is standard file handling stuff.

In [2]:
# fetch dataset from github
! git clone https://github.com/rahiakela/machine-learning-datasets -b movie-review-polarity-dataset

Cloning into 'machine-learning-datasets'...
remote: Enumerating objects: 2010, done.[K
remote: Counting objects:   0% (1/2010)[Kremote: Counting objects:   1% (21/2010)[Kremote: Counting objects:   2% (41/2010)[Kremote: Counting objects:   3% (61/2010)[Kremote: Counting objects:   4% (81/2010)[Kremote: Counting objects:   5% (101/2010)[Kremote: Counting objects:   6% (121/2010)[Kremote: Counting objects:   7% (141/2010)[Kremote: Counting objects:   8% (161/2010)[Kremote: Counting objects:   9% (181/2010)[Kremote: Counting objects:  10% (201/2010)[Kremote: Counting objects:  11% (222/2010)[Kremote: Counting objects:  12% (242/2010)[Kremote: Counting objects:  13% (262/2010)[Kremote: Counting objects:  14% (282/2010)[Kremote: Counting objects:  15% (302/2010)[Kremote: Counting objects:  16% (322/2010)[Kremote: Counting objects:  17% (342/2010)[Kremote: Counting objects:  18% (362/2010)[Kremote: Counting objects:  19% (382/2010)[Kremote: Counting o

In [3]:
! ls

gdrive	machine-learning-datasets  sample_data	txt_sentoken


In [6]:
# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')

  # read all text
  text = file.read()

  # close the file
  file.close()

  return text

# load one file
filename = 'machine-learning-datasets/movie-review-polarity-dataset/txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)

# see top 5 char
text[:5]

'plot '

We have two directories each with 1,000 documents each. We can process each directory in turn by first getting a list of files in the directory using the listdir() function, then loading each file in turn.

In [0]:
from os import listdir

# load all docs in a directory
def process_docs(directory):
  # walk through all files in the folder
  for filename in listdir(directory):
    # skip files that do not have the right extension
    if not filename.endswith('.txt'):
      next

    # create the full path of the file to open
    path = directory + '/' + filename

    # load document
    doc = load_doc(path)
    print(f'Loaded {filename}')

# specify directory to load
directory = 'machine-learning-datasets/movie-review-polarity-dataset/txt_sentoken/neg'
process_docs(directory)

## Clean Text Data

In this section, we will look at what data cleaning we might want to do to the movie review
data. We will assume that we will be using a bag-of-words model or perhaps a word embedding
that does not require too much preparation.

### Split into Tokens

We can use the split() function to split the loaded document into tokens separated by white space.

In [10]:
# split into tokens by white space
tokens = text.split()
print(tokens)

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', "what's", 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind-fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of'

Just looking at the raw tokens can give us a lot of ideas of things to try, such as:
Remove punctuation from words (e.g. `what's').
* Removing tokens that are just punctuation (e.g. `-').
* Removing tokens that contain numbers (e.g. `10/10').
* Remove tokens that have one character (e.g. `a').
* Remove tokens that don't have much meaning (e.g. `and').

Some ideas:
* We can filter out punctuation from tokens using regular expressions.
* We can remove tokens that are just punctuation or contain numbers by using an isalpha()
check on each token.
*  We can remove English stop words using the list loaded using NLTK.
*  We can filter out short tokens by checking their length.

In [15]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
import string
import re

# turn a doc into clean tokens
def clean_doc(doc):
  # split into tokens by white space
  tokens = doc.split()
  # prepare regex for char filtering
  re_punc = re.compile('[%s]' % re.escape(string.punctuation))
  # remove punctuation from each word
  tokens = [re_punc.sub('', w) for w in tokens]
  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
  # filter out stop words
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words]
  # filter out short tokens
  tokens = [word for word in tokens if len(word) > 1]
  return tokens

In [16]:
tokens = clean_doc(text)
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'whats', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'mindfuck', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', 'didnt', 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'whats', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scen

## Develop Vocabulary