## 1. Movie Review Dataset

The movie data is a collection of movie reviews retreived from the IMDB.com websit in the early 2000s by Bo Pang and Lee. The reviews were collected and made available as part of their research on NLP.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at imdb.com. The authors refer to this dataset as the “polarity dataset.



Download data from here: https://raw.githubusercontent.com/jbrownlee/Datasets/master/review_polarity.tar.gz

### Directory set up
- texttoken
- - pos (Contains the postive review in txt format)
- - neg (Contains the negative review in txt format)


In [3]:
from nltk.corpus import stopwords
import string

In [4]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, "r")
	text = file.read()
	file.close()

	return text

In [10]:
# turn a doc into clean tokens
def clean_doc(doc):
    # split into token by white space
    tokens =doc.split()
    # remove punctuation from each token
    table = str.maketrans("","", string.punctuation)

    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic

    tokens =  [word for word in tokens if word.isalpha()]

    # filter out the stop words
    stop_words = set(stopwords.words("english"))

    tokens = [w for w in tokens if w not in stop_words]
    # filter out the short tokens
    tokens = [word for word in tokens if len(word)>1]


    return tokens


In [11]:
# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', 'theyre', 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', 'theres', 'never', 'really', 'comic', 'book', 'like', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'whole', 'new', 'level', 'mid', 'series', 'called', 'watchmen', 'say', 'moore', 'campbell', 'thoroughly', 'researched', 'subject', 'jack', 'ripper', 'would', 'like', 'saying', 'michael', 'jackson', 'starting', 'look', 'little', 'odd', 'book', 'graphic', 'novel', 'pages', 'long', 'includes', 'nearly', 'consist', 'nothing', 'footnotes', 'words', 'dont', 'dismiss', 'film', 'source', 'get', 'past', 'whole', 'comic', 'book', 'thing', 'might', 'find', 'another', 'stumbling', 'block', 'hells', 'directors', 'albert', 'allen', 'hughes', 'getting', 'hughes', 'brothers', 'direct', 'seems', 'almost', 'ludicrous', 'casting', 'carrot', 'top', 'well', 'anythi

In [2]:
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

In [15]:
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

In [16]:
def process_docs(directory, vocab, is_trian:bool):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'): # only consdering first 900 data for trainig
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		
		add_doc_to_vocab(path, vocab)

## 2. Define Vocabulary

In [17]:
from collections import Counter
vocab = Counter()
process_docs('txt_sentoken/neg', vocab, True)
process_docs('txt_sentoken/pos', vocab, True)

In [20]:
print("Most Common 50 words from this dataset of IMBD reviews")
print(vocab.most_common(50))

Most Common 50 words from this dataset of IMBD reviews
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('bad', 1248), ('could', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]


In [23]:
# keep tokens with a min occurrece

main_occurance =2
tokens = [k for k,c in vocab.items() if c >= main_occurance]

len(tokens)

25767

In [26]:
tokens

['plot',
 'two',
 'teen',
 'couples',
 'go',
 'church',
 'party',
 'drink',
 'drive',
 'get',
 'accident',
 'one',
 'guys',
 'dies',
 'girlfriend',
 'continues',
 'see',
 'life',
 'nightmares',
 'whats',
 'deal',
 'watch',
 'movie',
 'sorta',
 'find',
 'critique',
 'mindfuck',
 'generation',
 'touches',
 'cool',
 'idea',
 'presents',
 'bad',
 'package',
 'makes',
 'review',
 'even',
 'harder',
 'write',
 'since',
 'generally',
 'applaud',
 'films',
 'attempt',
 'break',
 'mold',
 'mess',
 'head',
 'lost',
 'highway',
 'memento',
 'good',
 'ways',
 'making',
 'types',
 'folks',
 'didnt',
 'snag',
 'correctly',
 'seem',
 'taken',
 'pretty',
 'neat',
 'concept',
 'executed',
 'terribly',
 'problems',
 'well',
 'main',
 'problem',
 'simply',
 'jumbled',
 'starts',
 'normal',
 'downshifts',
 'fantasy',
 'world',
 'audience',
 'member',
 'going',
 'dreams',
 'characters',
 'coming',
 'back',
 'dead',
 'others',
 'look',
 'like',
 'strange',
 'apparitions',
 'disappearances',
 'chase',
 'scen

In [24]:
# save list to file
def save_list(lines, filename):
	# convert lines to a single blob of text
	data = '\n'.join(lines)
	# open file
	file = open(filename, 'w')
	# write text
	file.write(data)
	# close file
	file.close()

In [25]:
save_list(tokens, 'vocab.txt') # We will use this later for training the embedding

## 3. Train Embedding layer

In [5]:
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

In [6]:
# update clean_doc 
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

In [7]:
# update process_docs
def process_docs(directory, vocab, is_trian):
    document=[]

    for filename in listdir(directory):
        if is_trian and filename.startswith("cv9"):
            continue
        if not is_trian and not filename.startswith("cv9"):
            continue
        path = directory + "/" + filename

        doc = load_doc(path)
        tokens = clean_doc(doc, vocab)

        document.append(tokens)
        return document


In [8]:
# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs

**The Keras Embedding layer requires integer inputs where each integer maps to a single token that has a specific real-valued vector representation within the embedding. These vectors are random at the beginning of training, but during training become meaningful to the network.**


We can encode the training documents as sequences of integers using the Tokenizer class in the Keras API.

In [68]:
tokenizer = Tokenizer()

In [69]:
tokenizer.fit_on_texts(train_docs)

In [70]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)

#### We also need to ensure that all documents have the same length.

This is a requirement of Keras for efficient computation. We could truncate reviews to the smallest size or zero-pad (pad with the value ‘0’) reviews to the maximum length, or some hybrid. In this case, we will pad all reviews to the length of the longest review in the training dataset.

In [71]:
## pad sequence
max_length = max([len(s.split()) for s in train_docs])

Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding="post")

In [72]:
# define training labels (This may not work if you load and add negative docs first)
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

In [73]:
len(ytrain)

1800

In [74]:
len(Xtrain)

2

In [75]:
# load the test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs

In [76]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
#max_length = max([len(s.split()) for s in train_docs])
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

## We are now ready to define our neural network model.

In [77]:
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

In [78]:
# vector space 
vec_space = 100  # One can use 50 , 150 etc

In [102]:
## embedding NN model
model = Sequential()
model.add(Embedding(vocab_size,
                    100,
                    input_length=max_length
                    )) 
model.add(Conv1D(filters=32, kernel_size=8, activation="relu"))

model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())

model.add(Dense(10,activation="relu"))
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))



In [103]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 387, 100)          54000     
                                                                 
 conv1d_3 (Conv1D)           (None, 380, 32)           25632     
                                                                 
 max_pooling1d_3 (MaxPooling  (None, 190, 32)          0         
 1D)                                                             
                                                                 
 flatten_3 (Flatten)         (None, 6080)              0         
                                                                 
 dense_6 (Dense)             (None, 10)                60810     
                                                                 
 flatten_4 (Flatten)         (None, 10)                0         
                                                      

We use a binary cross entropy loss function because the problem we are learning is a binary classification problem. 

In [104]:
#compile network
model.compile(loss= "binary_crossentropy", optimizer = "adam", metrics=["accuracy"])

In [50]:
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

In [51]:
ytrain.shape

(1800,)

In [93]:
x= Xtrain
y= ytrain.reshape(2,-1)

In [101]:
x= x.reshape(2,387)

In [99]:
x.shape

(1, 774)

In [94]:
y.shape

(2, 900)

In [105]:
# fit network

model.fit(Xtrain, y, epochs=10, verbose=2)

Epoch 1/10


ValueError: in user code:

    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 860, in train_step
        loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 918, in compute_loss
        return self.compiled_loss(
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\compile_utils.py", line 201, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\losses.py", line 141, in __call__
        losses = call_fn(y_true, y_pred)
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\losses.py", line 245, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\losses.py", line 1932, in binary_crossentropy
        backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
    File "C:\Users\temp\anaconda3\envs\tf_gpu\lib\site-packages\keras\backend.py", line 5247, in binary_crossentropy
        return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

    ValueError: `logits` and `labels` must have the same shape, received ((None, 1) vs (None, 900)).
