IMDB Review Data can be collected from the http://ai.stanford.edu/~amaas/data/sentiment/. After downloading and unzip the data, we need to read the data into pandas dataframe, to prepare for the later training process

In [1]:
import pandas as pd
import os

In [10]:
base_path = './data/aclImdb'
labels = {'pos': 1, 'neg': 0}  # we will assign 1 to positive reviews and 0 to negative.
df_data = pd.DataFrame()

# go through the directory and read in all reviews with corresponding label
for sample in ('train', 'test'):
    for label in ('pos', 'neg'):
        path = os.path.join(base_path, sample, label)
        for filename in os.listdir(path):
            # with will ensure the used resource got cleanned after finishing
            with open(os.path.join(path, filename), 'r', encoding='utf-8') as review_file:
                review_text = review_file.read()
            # append new review at the end of dataframe
            df_data = df_data.append([[review_text, labels[label]]], ignore_index=True)
            
df_data.columns = ['review', 'sentiment']

In [12]:
df_data.shape

(50000, 2)

In [11]:
df_data.head()

Unnamed: 0,review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


After reading in the data, we will shuffle the records, so that we have randomly ordered reviews for learning

In [13]:
import numpy as np

In [14]:
np.random.seed(0)
df_data = df_data.reindex(np.random.permutation(df_data.index))

The learning data set is prepared. Now we will save the data into csv, for ease of access.

In [17]:
df_data.to_csv('./data/Imdb_reviews_data.csv', index=False)

In [18]:
df_imdb_reviews = pd.read_csv('./data/Imdb_reviews_data.csv')

In [20]:
df_imdb_reviews.head()

Unnamed: 0,review,sentiment
0,"Often tagged as a comedy, The Man In The White...",1
1,After Chaplin made one of his best films: Doug...,0
2,I think the movie was one sided I watched it r...,0
3,I have fond memories of watching this visually...,1
4,This episode had potential. The basic premise ...,0


In [22]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [23]:
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [24]:
print(count.vocabulary_)

{'the': 6, 'two': 7, 'shining': 3, 'weather': 8, 'is': 1, 'sun': 4, 'and': 0, 'one': 2, 'sweet': 5}


In [25]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


In [29]:
from sklearn.feature_extraction.text import TfidfTransformer

In [30]:
np.set_printoptions(precision=2)
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.    0.56  0.56  0.    0.43  0.    0.  ]
 [ 0.    0.43  0.    0.    0.    0.56  0.43  0.    0.56]
 [ 0.5   0.45  0.5   0.19  0.19  0.19  0.3   0.25  0.19]]


Text Data Cleaning

In [37]:
df_data.loc[10000, 'review'][-50:]

' it 8 out 10.<br /><br />(music-wise 10 out of 10)'

In [38]:
import re

In [43]:
# a helper function to remove html markup, emotion symbols, such as =)
def text_cleaner(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [44]:
text_cleaner(df_data.loc[10000, 'review'][-50:])

' it 8 out 10 music wise 10 out of 10 '

In [45]:
text_cleaner('</a>This :) is :( a test :-)!')

'this is a test :) :( :)'

Word Stemming to transform a work into its root form

In [46]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

# function to split a text at whitespace characters
def tokenizer(text):
    return text.split()

def tokenizer_stemmer(text):
    return [stemmer.stem(work) for work in tokenizer(text)]

In [47]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [48]:
tokenizer_stemmer('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Stop-word removal with downloaded stop words from NLTK

In [50]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/StevenYu/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [54]:
from nltk.corpus import stopwords

stop_words = stopwords.words('English')
[w for w in tokenizer_stemmer('a runner likes running and runs a lot')[-10:] if w not in stop_words]

['runner', 'like', 'run', 'run', 'lot']