https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words

# Natural Language Processing in Keras, Part 1

Welcome! In this tutorial, I am going to cover a few approaches to NLP in the Keras framework. We have a **lot** of ground to cover, and I will try to cater to as broad a range of skill levels as I can manage. If you have any questions, do not hesitate to ask! I'm kind of making this up as I go along, so I am not sure how balanced the difficulty of each section is. We should have plenty of time to cover with ample questions. 

This is a living document, so pardon the rough edges!

# What is NLP?

Natural language processing, or NLP, is a branch of computer science concerned with getting machines to understand and produce human language. Human language can be very tricky. Take the following [garden path sentences](https://en.wikipedia.org/wiki/Garden_path_sentence):

<br><center>**Time flies like an arrow. Fruit flies like a banana**</center>

In the first sentence, *flies* is a verb. In the second, the exact same word is a noun! That is why we humans find it humorous - it throws our brain in a direction it did not expect. To get this into machine-speak, we would have to do something like:

<br><center>**Time is similar to an arrow in that both move swiftly. Drosophila melanogaster enjoy consuming Musa acuminata**</center>

Kinda takes the fun out of it, huh? But that is exactly what we have to do when it comes to processing natural language with computers. We can understand language, in spoken or written form, often with ambiguous context, thanks to the exaflop processor sitting in our skull. Computers have one one-millionth of that to work with, so we are going to have to be clever. 

# The goal

Our primary task is fairly straightforward: given a written review of a movie, can we rate it as positive or negative? I am also working on a secondary challenge, which may or may not make it into the final tutorial. That challenge is, given a model trained on reviews from one source (IMDB), can we use it to accurately rate reviews from another source(Rotten Tomatoes)? The former is a toy example, which the latter has been a bit of a challenge for me! I'm experimenting as I go along, so even as I write this, I'm not sure how it'll turn out!

## Hurdles

There are two main hurdles in our way, and I will be covering some mainstream approaches to tackling them. The first is vocabulary: How do we take the 170,000+ words in the English language, and render that into a number which makes sense to a computer, and does not take a huge amount of information? This process is known as **embedding**. The second is context: As demonstrated in our above phrase, a word by itself is not guaranteed to have a single meaning. We have to look to the *context*, that is, the words and sentences around it. We will tackle this with **Convolutional networks** and **Recurrent networks**. 

## what wil they be able to do after the journey? important in any training

### how this is relevant.

break up into chapters

dependencies

![sequence](https://image.slidesharecdn.com/2-sentimentav0-140117023306-phpapp02/95/big-data-sentiment-analysis-3-638.jpg?cb=1389926058)

# How are computers going to understand sentiment? 

Computers only "understand" numbers. We need a way of converting words, concepts, ideas, and abstractions, into numerical form. Ideally, these numbers are not arbitrary, but convey something about the thing they represent. Just like the points (11,23) and (12,24) are close, wouldn't it be be great if similar concepts were "close"? This is the goal of the thought vector. 

![wat is thought vector](images/morpheus_thought.jpg)

## Thought vectors are a way of representing (encoding) ideas as a vector (collection of scalars). Computers readily process vectors (graphics, video game physics, etc) so this will be useful for processing. 

![thought vector](images/thought_vector.png)

> ...there's way too much information to decode the Matrix. You get used to it, though. Your brain does the translating. I don't even see the code. All I see is blonde, brunette, redhead. 

## I'm going to make the argument that the 'digital rain' in the Matrix is actually a representation of thought vectors. These vectors are very dense, whereas language is quite sparse. We will start off with a sparse representation, but later, our model will condense it for us.

![digital rain](images/matrix_rain.png)

## Ideally, these vectors will have this 'closeness' property we are after. 

![Socher-Bilingual-tsne](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/img/Socher-BillingualTSNE.png)
<center> t-SNE visualization of the bilingual word embedding. Green is Chinese, Yellow is English. (Socher et al. (2013a))</center>

# The first part of this demo will convert the document to a (sparse) numerical representation

![flow 1](images/nlp01.png)

# The second part will use that numerical represntation to train a model, which will be used to predict sentiment from other reviews

![flow 2](images/nlp02.png)

# Dependencies

This tutorial requires the following dependencies:

#### Necessary
- Python 3.5
- pip - Python package manager
- Jupyter - Web based Notebook GUI
- Numpy - Numerical operations on arrays
- Pandas - Manipulation of DataFrames (think excel sheets)
- Scikit-Learn (sklearn) - Machine learning - document processing
- Keras - Deep learning models made easy
  - Keras will automatically install Tensorflow and Theano, underlying deep learning processing


#### Optional
- Virtualenv - Keep your python environments isolated
- NLTK - Natural language toolkit - for stopword dictionary. Also useful all-around for NLP
- tqdm  - Progress bars (*taqadum* is Arabic for "progress")
- keras-tqdm - Progress bars for Keras

#### Really basic setup 

Ideally, first you want to create a virtualenv. You can do this from a console (*nix*, OSX, sorry Windows! Not sure how to do this on Win)
This will create a folder 've', create a virtual env named 'keras' in that, and activate the virtual environment.

```bash
mkdir ~/ve && cd ~/ve
virtualenv -p `which python3` keras
source ~/ve/keras/bin/activate
```

#### Install essentials
With keras virtualenv active, 
```bash
pip install numpy scipy pandas jupyter scikit-learn tensorflow keras nltk tqdm keras-tqdm
```

If you have CUDA set up, you can try `tensorflow-gpu` instead of `tensorflow`. But this can be a headache so I do not want to get caught up on this. 

### All of these packages are free! Even for commercial use! Isn't technology awesome?

We will also need the source data, which I have mirrored here:
[NLP Data (dropbox)](https://www.dropbox.com/s/hu6mjaca9zkgmr8/nlp_data.zip?dl=0)

#### Let's import some libraries and get to coding!

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

import helper

In [2]:
nb_top_words = 5000  # Number of words to keep in the vocabulary

In [3]:
data_path = '/media/mike/tera/data/nlp/techvalley/' # Point this to the path to where the CSV data is unzipped to

# where did data come from?


In [4]:
train = pd.read_csv(data_path + 'imdb_train.csv')

In [5]:
train.head()

Unnamed: 0,review,sentiment
0,After watching the Next Action Star reality TV...,1.0
1,I'm a bit conflicted over this. The show is on...,1.0
2,I originally reviewed this film on Amazon abou...,1.0
3,The violent and rebel twenty-five years old sa...,1.0
4,hello. i just watched this movie earlier today...,1.0


In [6]:
train['review'].iloc[-4]

'This is one of the most irritating, nonsensical movies I\'ve ever had the misfortune to sit through. Every time it started to look like it might be getting good, out come more sepia tone flashbacks, followed by paranoid idiocy masquerading as social commentary. The main character, Maddox, is a manipulative, would-be rebel who lives in a mansion seemingly without any parents or responsibility. The supporting cast are all far more likeable and interesting, but are unfortunately never developed. Nor do we ever really understand the John Stanton character supposedly influencing Maddox to commit the acts of rebellion. At one point, I thought "Aha! Maddox is just nuts and is secretly making up all those communications from escaped mental patient Stanton! Now we\'re getting somewhere!" but of course, that ends up to not be the case and the whole movie turns out to be pointless, both from Maddox\'s perspective and the viewer\'s. Where\'s Ferris Bueller when we need him?'

# what are stopwords?

In [7]:
nltk.download('stopwords') # Download the stopwords
stopwords = nltk.corpus.stopwords.words('english')
# stopwords.append('br') 
stopwords

[nltk_data] Downloading package stopwords to /home/mike/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [43]:
def unstopper(toklist, stoplist=None):
    """Remove all stopwords from the sentence. Takes a list of tokens (a split string)
    Make a list of words only if they are not in the stoplist, and then join it back together."""
    toklist = [w for w in toklist if not w in stoplist]
    wordstr = ' '.join(toklist)
    return wordstr

In [9]:
# Unleash the power of PANDAS! This is technically one line of code. But since I dislike perl-esque unreadable one-liners, I've split each operation to its own line

train_phrases = train['review']\
                .str.replace(r'<br \/>', ' ')\
                .str.replace(r'[^a-zA-Z]', ' ')\
                .str.lower()\

if False:
    train_phrases = train_phrases.str.split()\
                    .apply(unstopper, stoplist=stopwords)
    
# Note the use of backslash to split to multiple lines for readability
# Remove linebreak <br /> tags
# Replace all non-alphabetic with spaces
# Lowercase only
# Tokenize
# Remove all stopwords


#.str.replace(r' +', ' ')

In [11]:
# Look at the data after processing
train_phrases[0]

'after watching the next action star reality tv series  i was pleased to see the winners  movie right away  i was leery of such a showcase of new talent  but i was pleasantly surprised and thrilled  billy zane  of course  was his usual great self  but corinne and sean held their own beside him  it was also nice to see jared and jeanne  also from the competition  in their cameo roles  sean s character  not billy s  is the hunted  and his frustration at discovering new rules in the game is well played  corinne walks the tightrope well between her character liking sean s and only being in it for the money  i loved how the game was played right to the last second  and then beyond  not a great movie  but an entertaining one all the way and a great showcase for two folks on their first time out of the gate '

In [12]:
# Now we need to concatenate all the reviews into a single list so we can apply Bag of Words. 
big_list_train_phrases = []
for sentence in train_phrases.values:
    big_list_train_phrases.append(sentence)
print(len(big_list_train_phrases))
# big_list_train_phrases

25000


## Numerical representation  (Vectorization)



In [13]:
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = nb_top_words - 1) 

# We have to subtract one in order to make room for the null character. 


In [49]:
train_data_features = vectorizer.fit_transform(big_list_train_phrases)
train_data_features = train_data_features.toarray() # Convert to array for easier handling (this may be intense on memory)
train_data_features.shape

(25000, 4999)

In [46]:
# Get the vocabulary
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
freq = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set

In [50]:
df_vocab = pd.DataFrame(list(zip(vocab, freq)), columns=['vocab', 'freq'])
df_vocab.head()

Unnamed: 0,vocab,freq
0,abandoned,187
1,abc,125
2,abilities,108
3,ability,454
4,able,1259


In [21]:
# Sort by frequency rank
df_vocab = df_vocab.sort_values(by='freq', ascending=False)
df_vocab.reset_index(drop=True, inplace=True)
df_vocab.index = df_vocab.index + 1   # We need to increase this to make room for our null character
df_vocab.head(10)

Unnamed: 0,vocab,freq
1,the,336758
2,and,164143
3,of,145867
4,to,135724
5,is,107337
6,it,96472
7,in,93981
8,this,76007
9,that,73287
10,was,48209


## Replace words with integers
Now that we have our vocabulary, we have to create a lookup table (simple dictionary) to replace each word with an integer. Use '0' for words not in the vocab. 

In [51]:
# Invert word/int pairs to get our lookup with word as the key
vocab_idx = {key:value for (key, value) in zip(df_vocab['vocab'], df_vocab.index)}

def words_to_index(wordlist, vocab=None):
    """Minifunction for pandas.apply(). Replaces each word with respective index. If it's not in the vocab, replace with 0"""
    return [vocab[word] if word in vocab else 0 for word in wordlist]

In [52]:
# Split each review and replace each word with an integer
train_idx = train_phrases.str.split().apply(words_to_index, vocab=vocab_idx)
train_idx.head()

0    [82, 4818, 4428, 2987, 39, 4152, 3544, 4608, 3...
1    [0, 0, 0, 437, 0, 3133, 4454, 4428, 3950, 2322...
2    [0, 3114, 0, 4454, 1676, 3088, 0, 5, 3107, 498...
3    [4428, 4749, 166, 3554, 4610, 1704, 4981, 3082...
4    [2045, 0, 2400, 4816, 4454, 2902, 1358, 4507, ...
Name: review, dtype: object

In [53]:
# Here is the whole process wrapped up in functions for convenience

def get_vocab_index(data_phrases, verbose=False):
    """ Proccess an array-like of strings and generate Bag of Words vocab index
    """
    big_list_phrases = []
    for sentence in data_phrases.values:
        big_list_phrases.append(sentence)

    vectorizer = CountVectorizer(analyzer = "word",   \
                         tokenizer = None,    \
                         preprocessor = None, \
                         stop_words = None,   \
                         max_features = 5000) 

    if verbose: print('Vectorizing')
    data_features = vectorizer.fit_transform(big_list_phrases)
    data_features = data_features.toarray()
    freq = np.sum(train_data_features, axis=0)

    vocab = vectorizer.get_feature_names()
    df_vocab = pd.DataFrame(list(zip(vocab, freq)), columns=['vocab', 'freq'])
    df_vocab = df_vocab.sort_values(by='freq', ascending=False)
    df_vocab.reset_index(drop=True, inplace=True)
    df_vocab.index = df_vocab.index + 1   # We need to increase this to make room for our null character
    vocab_idx = {key:value for (key, value) in zip(df_vocab['vocab'], df_vocab.index)}
    return vocab_idx
    

def load_and_process_imdb_csv(file, vocab_idx=None, stopwords=None, header='infer', delimiter=None, quoting=0, show_proccessed=False, verbose=False):
    if verbose: print('Loading')
    data = pd.read_csv(file, header=header, delimiter=delimiter, quoting=quoting)
    if verbose: print('Preprocesing')
    data_phrases = data['review'].str.replace(r'<br \/>', ' ')\
                    .str.replace(r'[^a-zA-Z]', ' ').str.lower()

    if stopwords:
        data_phrases = data_phrases.str.split()\
                        .apply(unstopper, stoplist=stopwords)
            
    if vocab_idx is None:
        vocab_idx = get_vocab_index(data_phrases, verbose=verbose)

    if verbose: print('Indexing')
    data_idx = data_phrases.str.split().apply(words_to_index, vocab=vocab_idx)
    data['vectors'] = data_idx
    if show_proccessed:
        data['proccessed'] = data_phrases
    return data, vocab_idx
    
    

# Process IMDB train/test, and Rotten Tomatoes
Now that we have our vocabulary, we can process all of our datasets. We need to use the same vocabulary in order to get reliable results. I really should do the bag of words on the full dataset, but I am running low on time!

In [26]:
train_data, vocab_idx = load_and_process_imdb_csv(data_path + 'imdb_train.csv', verbose=1)

Loading
Preprocesing
Vectorizing
Indexing


In [54]:
train_data.head()

Unnamed: 0,review,sentiment,vectors
0,After watching the Next Action Star reality TV...,1.0,"[96, 1038, 723, 317, 196, 4905, 897, 3767, 603..."
1,I'm a bit conflicted over this. The show is on...,1.0,"[0, 0, 0, 217, 0, 4734, 2007, 723, 4584, 1057,..."
2,I originally reviewed this film on Amazon abou...,1.0,"[0, 4796, 0, 2007, 785, 274, 0, 39, 4837, 4528..."
3,The violent and rebel twenty-five years old sa...,1.0,"[723, 3077, 2, 2226, 1402, 4245, 4528, 894, 0,..."
4,hello. i just watched this movie earlier today...,1.0,"[329, 0, 1306, 3590, 2007, 94, 385, 3853, 3652..."


In [28]:
test_data, _ = load_and_process_imdb_csv(data_path + 'imdb_test.csv', vocab_idx=vocab_idx, show_proccessed=1,  verbose=1)

Loading
Preprocesing
Indexing


In [29]:
rotten_data, _ = load_and_process_imdb_csv(data_path + 'rotten.csv', vocab_idx=vocab_idx, verbose=1)

Loading
Preprocesing
Indexing


In [30]:
test_data.head()

Unnamed: 0,review,sentiment,vectors,proccessed
0,After watching the Next Action Star reality TV...,1.0,"[96, 1038, 723, 317, 196, 4905, 897, 3767, 603...",after watching the next action star reality tv...
1,I'm a bit conflicted over this. The show is on...,1.0,"[0, 0, 0, 217, 0, 4734, 2007, 723, 4584, 1057,...",i m a bit conflicted over this the show is on...
2,I originally reviewed this film on Amazon abou...,1.0,"[0, 4796, 0, 2007, 785, 274, 0, 39, 4837, 4528...",i originally reviewed this film on amazon abou...
3,The violent and rebel twenty-five years old sa...,1.0,"[723, 3077, 2, 2226, 1402, 4245, 4528, 894, 0,...",the violent and rebel twenty five years old sa...
4,hello. i just watched this movie earlier today...,1.0,"[329, 0, 1306, 3590, 2007, 94, 385, 3853, 3652...",hello i just watched this movie earlier today...


In [31]:
rotten_data.head()

Unnamed: 0.1,Unnamed: 0,PhraseId,SentenceId,review,sentiment5,sentiment,vectors
0,0,1,1,A series of escapades demonstrating the adage ...,1,0,"[0, 603, 118, 0, 0, 723, 0, 1544, 816, 1057, 4..."
1,1,64,2,"This quiet , introspective and entertaining in...",4,1,"[2007, 4923, 0, 2, 692, 2667, 1057, 3497, 4748]"
2,2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,0,"[2064, 752, 118, 0, 0, 0, 921, 0, 2984, 566, 7..."
3,3,117,4,A positively thrilling combination of ethnogra...,3,1,"[0, 0, 3429, 2176, 118, 0, 2, 26, 723, 3657, 0..."
4,4,157,5,Aggressive self-glorification and a manipulati...,1,0,"[0, 3849, 0, 2, 0, 4633, 0]"


In [32]:
# Shuffle the dataframe
rotten_data = rotten_data.sample(frac=1)
rotten_data.head()

Unnamed: 0.1,Unnamed: 0,PhraseId,SentenceId,review,sentiment5,sentiment,vectors
228,281,7037,282,"First , for a movie that tries to be smart , i...",1,0,"[4546, 3652, 0, 94, 1544, 2290, 493, 24, 1787,..."
3527,4384,84867,4389,A grand fart coming from a director beginning ...,0,0,"[0, 4289, 0, 563, 980, 0, 151, 442, 493, 0, 13..."
6462,8027,147785,8041,Given the fact that virtually no one is bound ...,0,0,"[396, 723, 2274, 1544, 3260, 3018, 637, 1057, ..."
5140,6389,119645,6399,There 's a lot of tooth in Roger Dodger .,3,1,"[1587, 0, 0, 751, 118, 338, 4398, 2788, 0]"
4265,5286,100741,5292,So aggressively cheery that Pollyana would rea...,1,0,"[1809, 0, 0, 1544, 0, 566, 3785, 3652, 0, 0, 3..."


In [34]:
# Split the data in half to get training and test sets. 
m = len(rotten_data) // 2
rotten_train = rotten_data.iloc[:m]
rotten_test = rotten_data.iloc[m:]
print(rotten_train.shape, rotten_test.shape)

(3437, 7) (3437, 7)


In [35]:
# Make a numpy file so we can easily load it into the other notebook
helper.package_dataset(data_path + 'imdb.npz', np.array(train_data['vectors']), np.array(train_data['sentiment'], dtype='int16'), 
                                              np.array(test_data['vectors']), np.array(test_data['sentiment'], dtype='int16'))

helper.package_dataset(data_path + 'rotten.npz', np.array(rotten_train['vectors']), np.array(rotten_train['sentiment'], dtype='int16'), 
                                              np.array(rotten_test['vectors']), np.array(rotten_test['sentiment'], dtype='int16'))


# And that's it for preprocessing! Move on to the TVMLAI_CNN_LSTM for Part 2!