Many real world problems relates to mine data on text contents such as blog, forums. While the contents of NLP is non-numeric, most data mining algorithms are designed on processing numeric values. As a result, a processing task includes one more step to convert text to number. This notebook is the first one to explore such venue.   

In this notebook, we use IMDB sentiment dataset from [stanford]("http://ai.stanford.edu/~amaas/data/sentiment/"). We unzip data before continuing the next cell for the first time, otherwise go to second cell.

In [None]:
import os
import pandas as pd
import os
import io

os.chdir("C:/Dataset/aclImdb")

basepath = ''

labels = {'pos': 1, 'neg': 0}
#pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            # For python2, use 'io.open', for Python3, just us 'open' 
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            #pbar.update()
            
df.columns = ['review', 'sentiment']
df.to_csv("IMDBSentiment.csv",index=False)

#### When IMDB dataset is available, experiment from here

In [5]:
import pandas as pd
df = pd.read_csv("C:/Dataset/aclImdb/IMDBSentiment.csv", encoding='latin-1')
df.head()

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [6]:
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [8]:
#  display the last 500 characters
df.loc[0, 'review'][-500:]

' Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.'

In [10]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [11]:
df['review'] = df['review'].apply(preprocessor)

In [12]:
def tokenizer(text):
    return text.split()
tokenizer('People recognises your success  missing you must remember what you learn from your failures before.')

['People',
 'recognises',
 'your',
 'success',
 'missing',
 'you',
 'must',
 'remember',
 'what',
 'you',
 'learn',
 'from',
 'your',
 'failures',
 'before.']

In [13]:
# Porter stemming
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('People recognize your achievement but you should think it as a past and look into future')

['peopl',
 'recogn',
 'your',
 'achiev',
 'but',
 'you',
 'should',
 'think',
 'it',
 'as',
 'a',
 'past',
 'and',
 'look',
 'into',
 'futur']

In [14]:
# SnowballStemmer
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer('english')

def tokenizer_snowball(text):
    return [snowball.stem(word) for word in text.split()]

tokenizer_snowball('Feeling hungry about what you do not know is a sign of success ')

['feel',
 'hungri',
 'about',
 'what',
 'you',
 'do',
 'not',
 'know',
 'is',
 'a',
 'sign',
 'of',
 'success']

In [None]:
# LancasterStemmer
from nltk.stem.lancaster import LancasterStemmer
lancaster = LancasterStemmer()

def tokenizer_lancaster(text):
    return [lancaster.stem(word) for word in text.split()]

tokenizer_lancaster('Feeling hungry about what you do not know is a sign of success.')