https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words

# Overview summary, tell whole story in a single left to right flow. what's the journey?

## what wil they be able to do after the journey? important in any training

### how this is relevant.

break up into chapters

dependencies

# installing dependencies. why choose them? 3 to 5 words
# free! even for commercial

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

import helper

In [2]:
nb_top_words = 5000  # Number of words to keep in the vocabulary

In [3]:
data_path = '/media/mike/tera/data/nlp/techvalley/' # Point this to the path to where IMDB data is stored

# where did data come from?


In [4]:
train = pd.read_csv(data_path + 'imdb_train.csv')

In [5]:
train.head()

Unnamed: 0,review,sentiment
0,After watching the Next Action Star reality TV...,1.0
1,I'm a bit conflicted over this. The show is on...,1.0
2,I originally reviewed this film on Amazon abou...,1.0
3,The violent and rebel twenty-five years old sa...,1.0
4,hello. i just watched this movie earlier today...,1.0


In [6]:
train['review'].iloc[-4]

'This is one of the most irritating, nonsensical movies I\'ve ever had the misfortune to sit through. Every time it started to look like it might be getting good, out come more sepia tone flashbacks, followed by paranoid idiocy masquerading as social commentary. The main character, Maddox, is a manipulative, would-be rebel who lives in a mansion seemingly without any parents or responsibility. The supporting cast are all far more likeable and interesting, but are unfortunately never developed. Nor do we ever really understand the John Stanton character supposedly influencing Maddox to commit the acts of rebellion. At one point, I thought "Aha! Maddox is just nuts and is secretly making up all those communications from escaped mental patient Stanton! Now we\'re getting somewhere!" but of course, that ends up to not be the case and the whole movie turns out to be pointless, both from Maddox\'s perspective and the viewer\'s. Where\'s Ferris Bueller when we need him?'

# what are stopwords?

In [7]:
nltk.download('stopwords') # Download the stopwords
stopwords = nltk.corpus.stopwords.words('english')
# stopwords.append('br') 
stopwords

[nltk_data] Downloading package stopwords to /home/mike/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [8]:
def unstopper(toklist, stoplist=None):
    toklist = [w for w in toklist if not w in stoplist]
    wordstr = ' '.join(toklist)
    return wordstr

In [9]:
# Unleash the power of PANDAS! This is technically one line of code. But since I dislike perl-esque unreadable one-liners, I've split each operation to its own line

train_phrases = train['review']\
                .str.replace(r'<br \/>', ' ')\
                .str.replace(r'[^a-zA-Z]', ' ')\
                .str.lower()\

if False:
    train_phrases = train_phrases.str.split()\
                    .apply(unstopper, stoplist=stopwords)
    
# Note the use of backslash to split to multiple lines for readability
# Remove linebreak <br /> tags
# Replace all non-alphabetic with spaces
# Lowercase only
# Tokenize
# Remove all stopwords


#.str.replace(r' +', ' ')

In [10]:
train.head()['review'].str.split().str.join(' ')

0    After watching the Next Action Star reality TV...
1    I'm a bit conflicted over this. The show is on...
2    I originally reviewed this film on Amazon abou...
3    The violent and rebel twenty-five years old sa...
4    hello. i just watched this movie earlier today...
Name: review, dtype: object

In [11]:
train_phrases[0]

'after watching the next action star reality tv series  i was pleased to see the winners  movie right away  i was leery of such a showcase of new talent  but i was pleasantly surprised and thrilled  billy zane  of course  was his usual great self  but corinne and sean held their own beside him  it was also nice to see jared and jeanne  also from the competition  in their cameo roles  sean s character  not billy s  is the hunted  and his frustration at discovering new rules in the game is well played  corinne walks the tightrope well between her character liking sean s and only being in it for the money  i loved how the game was played right to the last second  and then beyond  not a great movie  but an entertaining one all the way and a great showcase for two folks on their first time out of the gate '

In [12]:
# Now we need to concatenate all the reviews into a single list so we can apply Bag of Words. 
big_list_train_phrases = []
for sentence in train_phrases.values:
    big_list_train_phrases.append(sentence)
print(len(big_list_train_phrases))
# big_list_train_phrases

25000


In [13]:
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = nb_top_words - 1) 

# We have to subtract one in order to make room for the null character. 


# explain vectorizing

In [14]:
train_data_features = vectorizer.fit_transform(big_list_train_phrases)

In [15]:
type(train_data_features)

scipy.sparse.csr.csr_matrix

In [16]:
train_data_features = train_data_features.toarray()
train_data_features.shape

(25000, 4999)

In [17]:
vocab = vectorizer.get_feature_names()

In [18]:
# Sum up the counts of each vocabulary word
freq = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set

In [19]:
df_vocab = pd.DataFrame(list(zip(vocab, freq)), columns=['vocab', 'freq'])

In [20]:
df_vocab

Unnamed: 0,vocab,freq
0,abandoned,187
1,abc,125
2,abilities,108
3,ability,454
4,able,1259
5,about,17375
6,above,819
7,abraham,85
8,absence,116
9,absent,83


In [21]:
df_vocab = df_vocab.sort_values(by='freq', ascending=False)
df_vocab.reset_index(drop=True, inplace=True)
df_vocab.index = df_vocab.index + 1   # We need to increase this to make room for our null character
df_vocab.head(10)

Unnamed: 0,vocab,freq
1,the,336758
2,and,164143
3,of,145867
4,to,135724
5,is,107337
6,it,96472
7,in,93981
8,this,76007
9,that,73287
10,was,48209


In [22]:
# Invert word/int pairs to get our lookup
vocab_idx = {key:value for (key, value) in zip(df_vocab['vocab'], df_vocab.index)}

In [23]:
def words_to_index(wordlist, vocab=None):
    """Minifunction for pandas.apply(). Replaces each word with respective index"""
    return [vocab[word] if word in vocab else 0 for word in wordlist]

In [24]:
train_idx = train_phrases.str.split().apply(words_to_index, vocab=vocab_idx)
train_idx.head()

0    [96, 142, 1, 363, 196, 311, 606, 235, 190, 0, ...
1    [0, 0, 0, 217, 0, 114, 8, 1, 116, 5, 18, 25, 4...
2    [0, 1786, 0, 8, 16, 18, 0, 39, 37, 149, 579, 2...
3    [1, 1079, 2, 4137, 1744, 651, 149, 148, 0, 467...
4    [4591, 0, 38, 286, 8, 14, 881, 493, 12, 1, 805...
Name: review, dtype: object

In [25]:
def get_vocab_index(data_phrases, verbose=False):
    """ Proccess an array-like of strings and generate Bag of Words vocab index
    """
    big_list_phrases = []
    for sentence in data_phrases.values:
        big_list_phrases.append(sentence)

    vectorizer = CountVectorizer(analyzer = "word",   \
                         tokenizer = None,    \
                         preprocessor = None, \
                         stop_words = None,   \
                         max_features = 5000) 

    if verbose: print('Vectorizing')
    data_features = vectorizer.fit_transform(big_list_phrases)
    data_features = data_features.toarray()
    freq = np.sum(train_data_features, axis=0)

    vocab = vectorizer.get_feature_names()
    df_vocab = pd.DataFrame(list(zip(vocab, freq)), columns=['vocab', 'freq'])
    df_vocab = df_vocab.sort_values(by='freq', ascending=False)
    df_vocab.reset_index(drop=True, inplace=True)
    df_vocab.index = df_vocab.index + 1   # We need to increase this to make room for our null character
    vocab_idx = {key:value for (key, value) in zip(df_vocab['vocab'], df_vocab.index)}
    return vocab_idx
    

def load_and_process_imdb_csv(file, vocab_idx=None, stopwords=None, header='infer', delimiter=None, quoting=0, show_proccessed=False, verbose=False):
    if verbose: print('Loading')
    data = pd.read_csv(file, header=header, delimiter=delimiter, quoting=quoting)
    if verbose: print('Preprocesing')
    data_phrases = data['review'].str.replace(r'<br \/>', ' ')\
                    .str.replace(r'[^a-zA-Z]', ' ').str.lower()

    if stopwords:
        data_phrases = data_phrases.str.split()\
                        .apply(unstopper, stoplist=stopwords)
            
    if vocab_idx is None:
        vocab_idx = get_vocab_index(data_phrases, verbose=verbose)

    if verbose: print('Indexing')
    data_idx = data_phrases.str.split().apply(words_to_index, vocab=vocab_idx)
    data['vectors'] = data_idx
    if show_proccessed:
        data['proccessed'] = data_phrases
    return data, vocab_idx
    
    

In [26]:
train_data, vocab_idx = load_and_process_imdb_csv(data_path + 'imdb_train.csv', verbose=1)

Loading
Preprocesing
Vectorizing
Indexing


In [27]:
train_data

Unnamed: 0,review,sentiment,vectors
0,After watching the Next Action Star reality TV...,1.0,"[96, 1038, 723, 317, 196, 4905, 897, 3767, 603..."
1,I'm a bit conflicted over this. The show is on...,1.0,"[0, 0, 0, 217, 0, 4734, 2007, 723, 4584, 1057,..."
2,I originally reviewed this film on Amazon abou...,1.0,"[0, 4796, 0, 2007, 785, 274, 0, 39, 4837, 4528..."
3,The violent and rebel twenty-five years old sa...,1.0,"[723, 3077, 2, 2226, 1402, 4245, 4528, 894, 0,..."
4,hello. i just watched this movie earlier today...,1.0,"[329, 0, 1306, 3590, 2007, 94, 385, 3853, 3652..."
5,Possibly the best movie ever created in the hi...,1.0,"[1163, 723, 112, 94, 169, 1044, 4398, 723, 549..."
6,"okay, but just plain dumb. Not bad for a horro...",1.0,"[148, 15, 1306, 1298, 3634, 2817, 71, 3652, 0,..."
7,"I remember seeing this film years ago on, I th...",1.0,"[0, 1984, 2671, 2007, 785, 4528, 579, 274, 0, ..."
8,One of director Miike Takashi's very best. It'...,1.0,"[637, 118, 151, 1824, 0, 0, 2430, 112, 1073, 0..."
9,I love this show! It's like watching a mini mo...,1.0,"[0, 438, 2007, 4584, 1073, 0, 411, 1038, 0, 36..."


In [28]:
test_data, _ = load_and_process_imdb_csv(data_path + 'imdb_test.csv', vocab_idx=vocab_idx, show_proccessed=1,  verbose=1)

Loading
Preprocesing
Indexing


In [29]:
rotten_data, _ = load_and_process_imdb_csv(data_path + 'rotten.csv', vocab_idx=vocab_idx, verbose=1)

Loading
Preprocesing
Indexing


In [30]:
test_data.head()

Unnamed: 0,review,sentiment,vectors,proccessed
0,After watching the Next Action Star reality TV...,1.0,"[96, 1038, 723, 317, 196, 4905, 897, 3767, 603...",after watching the next action star reality tv...
1,I'm a bit conflicted over this. The show is on...,1.0,"[0, 0, 0, 217, 0, 4734, 2007, 723, 4584, 1057,...",i m a bit conflicted over this the show is on...
2,I originally reviewed this film on Amazon abou...,1.0,"[0, 4796, 0, 2007, 785, 274, 0, 39, 4837, 4528...",i originally reviewed this film on amazon abou...
3,The violent and rebel twenty-five years old sa...,1.0,"[723, 3077, 2, 2226, 1402, 4245, 4528, 894, 0,...",the violent and rebel twenty five years old sa...
4,hello. i just watched this movie earlier today...,1.0,"[329, 0, 1306, 3590, 2007, 94, 385, 3853, 3652...",hello i just watched this movie earlier today...


In [31]:
rotten_data.head()

Unnamed: 0.1,Unnamed: 0,PhraseId,SentenceId,review,sentiment5,sentiment,vectors
0,0,1,1,A series of escapades demonstrating the adage ...,1,0,"[0, 603, 118, 0, 0, 723, 0, 1544, 816, 1057, 4..."
1,1,64,2,"This quiet , introspective and entertaining in...",4,1,"[2007, 4923, 0, 2, 692, 2667, 1057, 3497, 4748]"
2,2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,0,"[2064, 752, 118, 0, 0, 0, 921, 0, 2984, 566, 7..."
3,3,117,4,A positively thrilling combination of ethnogra...,3,1,"[0, 0, 3429, 2176, 118, 0, 2, 26, 723, 3657, 0..."
4,4,157,5,Aggressive self-glorification and a manipulati...,1,0,"[0, 3849, 0, 2, 0, 4633, 0]"


In [32]:
# Shuffle the dataframe
rotten_data = rotten_data.sample(frac=1)
rotten_data.head()

Unnamed: 0.1,Unnamed: 0,PhraseId,SentenceId,review,sentiment5,sentiment,vectors
228,281,7037,282,"First , for a movie that tries to be smart , i...",1,0,"[4546, 3652, 0, 94, 1544, 2290, 493, 24, 1787,..."
3527,4384,84867,4389,A grand fart coming from a director beginning ...,0,0,"[0, 4289, 0, 563, 980, 0, 151, 442, 493, 0, 13..."
6462,8027,147785,8041,Given the fact that virtually no one is bound ...,0,0,"[396, 723, 2274, 1544, 3260, 3018, 637, 1057, ..."
5140,6389,119645,6399,There 's a lot of tooth in Roger Dodger .,3,1,"[1587, 0, 0, 751, 118, 338, 4398, 2788, 0]"
4265,5286,100741,5292,So aggressively cheery that Pollyana would rea...,1,0,"[1809, 0, 0, 1544, 0, 566, 3785, 3652, 0, 0, 3..."


In [33]:
m = len(rotten_data) // 2

In [34]:
rotten_train = rotten_data.iloc[:m]
rotten_test = rotten_data.iloc[m:]
print(rotten_train.shape, rotten_test.shape)

(3437, 7) (3437, 7)


In [35]:
# Make a numpy file so we can easily load it into the other notebook
helper.package_dataset(data_path + 'imdb.npz', np.array(train_data['vectors']), np.array(train_data['sentiment'], dtype='int16'), 
                                              np.array(test_data['vectors']), np.array(test_data['sentiment'], dtype='int16'))

helper.package_dataset(data_path + 'rotten.npz', np.array(rotten_train['vectors']), np.array(rotten_train['sentiment'], dtype='int16'), 
                                              np.array(rotten_test['vectors']), np.array(rotten_test['sentiment'], dtype='int16'))


In [36]:
data =np.load(data_path + 'imdb.npz')

In [37]:
type(data)

numpy.lib.npyio.NpzFile

In [38]:
data.keys()

['arr_0']

In [39]:
data['arr_0'].shape

(2, 2, 25000)

In [41]:
(a, b), (c, d) = helper.unpackage_dataset(data_path + 'imdb.npz')#data['arr_0']

In [None]:
b