Word embeddings are a technique for representing text where different words with similar meaning have a similar real-valued vector representation. 

They are a key breakthrough that has led to great performance of neural network models on a suite of challenging natural language processing problems. In this project, we will discover how to develop word embedding models with convolutional neural networks to classify movie reviews. 

**Steps:** 

1.  prepare movie review text data 
   forclassification with deep learning methods.

2. develop a neural classification
   model with word embedding and 
   convolutional layers. 
 
3. evaluate the developed a 
   neural classification model.

In [None]:
# importing libraries

import tensorflow as tf

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

In [None]:
import re
import string

In [None]:
# downloading data from website an unzipping using keras/tensorflow 
zip_path=tf.keras.utils.get_file(origin='https://raw.githubusercontent.com/jbrownlee/Datasets/master/review_polarity.tar.gz',
                                 fname='review_polarity.tar.gz',
                                 extract=True)

In [None]:
!ls /root/.keras/datasets/txt_sentoken

neg  pos


In [None]:
!ls /root/.keras/datasets/txt_sentoken/neg/

cv000_29416.txt  cv250_26462.txt  cv500_10722.txt  cv750_10606.txt
cv001_19502.txt  cv251_23901.txt  cv501_12675.txt  cv751_17208.txt
cv002_17424.txt  cv252_24974.txt  cv502_10970.txt  cv752_25330.txt
cv003_12683.txt  cv253_10190.txt  cv503_11196.txt  cv753_11812.txt
cv004_12641.txt  cv254_5870.txt   cv504_29120.txt  cv754_7709.txt
cv005_29357.txt  cv255_15267.txt  cv505_12926.txt  cv755_24881.txt
cv006_17022.txt  cv256_16529.txt  cv506_17521.txt  cv756_23676.txt
cv007_4992.txt	 cv257_11856.txt  cv507_9509.txt   cv757_10668.txt
cv008_29326.txt  cv258_5627.txt   cv508_17742.txt  cv758_9740.txt
cv009_29417.txt  cv259_11827.txt  cv509_17354.txt  cv759_15091.txt
cv010_29063.txt  cv260_15652.txt  cv510_24758.txt  cv760_8977.txt
cv011_13044.txt  cv261_11855.txt  cv511_10360.txt  cv761_13769.txt
cv012_29411.txt  cv262_13812.txt  cv512_17618.txt  cv762_15604.txt
cv013_10494.txt  cv263_20693.txt  cv513_7236.txt   cv763_16486.txt
cv014_15600.txt  cv264_14108.txt  cv514_12173.txt  cv764_12701.txt

# **Loading and Cleaning Reviews**

1.Split tokens on white space. 

2.Remove all punctuation from words.

3.Remove all words that are not purely comprised of alphabetical characters.

4.Remove all words that are known stop words. 

5.Remove all words that have a length ≤ 1 

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from collections import Counter

# load doc into memory
def load_doc(filename):

# open the file as read only 
 file = open(filename, 'r')

# read all text 
 text = file.read()

# close the file 
 file.close() 
 
 return text



# turn a doc into clean tokens
def clean_doc(doc): 

  # split into tokens by white space 
  tokens = doc.split() 
    
  # prepare regex for char filtering
  re_punc = re.compile('[%s]' % re.escape(string.punctuation))

  # remove punctuation from each word 
  tokens = [re_punc.sub('', w) for w in tokens]

  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()] 

  # filter out stop words 
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words]

  # filter out short tokens 
  tokens = [word for word in tokens if len(word) > 1]
  return tokens

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Define a Vocabulary:**

 It is important to define a vocabulary of known words when using a text model. The more words, the larger the representation of documents, therefore it is important to constrain the words to only those believed to be predictive.
 
1. This is difficult to know beforehand and often it is important to test different hypotheses about how to construct a useful vocabulary. We have already seen how we can remove punctuation and numbers from the vocabulary in the previous step


2. We can repeat this for all documents and build a set of all known words. We can develop a vocabulary as a Counter, which is a dictionary mapping of words and their count that allows us to easily update and query. 
 
3. Each document can be added to the counter (a new function called add doc_to_vocab()) and we can step over all of the reviews in the negative directory and then the positive directory (a new function called process_docs()).

In [None]:
# load doc and add to vocab
def add_doc_to_vocab(filename,vocab):
  # load doc
  doc=load_doc(filename)
  #clean doc
  tokens=clean_doc(doc)
  # update Counts
  vocab.update(tokens)

In [None]:

# load all docs in a directory
def process_docs(directory, vocab): 

# walk through all files in the folder 
  for filename in os.listdir(directory):

    # skip any reviews in the test set
    if filename.startswith('cv9'):
      continue 

  # create the full path of the 
    path = directory + '/' + filename

 # add doc to vocab
    add_doc_to_vocab(path, vocab)


In [None]:

#define vocab
vocab=Counter()

#add all docs to vocab
process_docs('/root/.keras/datasets/txt_sentoken/neg/',vocab)
process_docs('/root/.keras/datasets/txt_sentoken/pos/',vocab)

#print size of vocab
print(len(vocab))

# print top words in vocab
print(vocab.most_common(50))


44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('bad', 1248), ('could', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]


In [None]:
# keep tokens with a min occurrence

min_occurence=2
tokens=[k for k,v in vocab.items() if v>=min_occurence]
print(len(tokens))

25767


In [None]:
# save list to file 
def save_list(lines, filename):

   # convert lines to a single blob of text 
   data = '\n'.join(lines) 

   # open file 
   file = open(filename, 'w')

    # write text 
   file.write(data)
   
    # close file 
   file.close()
     
# save tokens to a vocabulary file 
save_list(tokens, 'vocab.txt')

In [None]:
filename='/root/.keras/datasets/txt_sentoken/pos/cv000_29590.txt'
doc=load_doc(filename)
print(doc)
tokens=clean_doc(doc)
print(tokens)

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almost as 

In [None]:
def load_doc(filename):
  file=open(filename,'r')
  text=file.read()
  file.close()
  return text


In [None]:
stop_words=set(stopwords.words('english'))

# **Train CNN With Embedding Layer**

In this step we will learn a word embedding while training a convolutional neural network on the classification problem.


 A word embedding is a way of representing text where each word in the vocabulary is represented by a real valued vector in a high-dimensional space. The vectors are learned in such a way that words that have similar meanings will have similar representation in the vector space (close in the vector space). 
 
This is a more expressive representation for text than more classical methods like bag-of-words, where relationships between words or tokens are ignored, or forced in bigram and trigram approaches. 


The real valued vector representation for words can be learned while training the neural network. We can do this in the Keras deep learning library using the Embedding layer. 

The f irst step is to load the vocabulary. We will use it to filter out words from movie reviews that we are not interested in. 

In the previous section,  local file called vocab.txt with one word per line. We can load that file and build a vocabulary as a set for checking the validity of tokens.

In [None]:

def load_doc(filename):
  file=open(filename,'r')
  text=file.read()
  file.close()
  return text

def clean_doc(doc,vocab):
  tokens=doc.split()
  re_punc=re.compile('[%s]'%re.escape(string.punctuation))
  tokens=[re_punc.sub('',w) for w in tokens]
  tokens=[word.lower() for word in tokens]

  tokens=[word for word in tokens if word.isalpha()]
  #tokens=[word for word in tokens if not word in stop_words]
  tokens=[word for word in tokens if word in vocab]
  tokens=' '.join(tokens)
  return tokens

In [None]:
def preprocess_doc(directory,vocab,is_train):
  documents=list()

  for filename in os.listdir(directory):

    if is_train and filename.startswith('cv9'):
      continue
    if not is_train and not filename.startswith('cv9'):
      continue
    
    path=directory +'/'+filename

    doc=load_doc(path)
    tokens=clean_doc(doc,vocab)
    documents.append(tokens)
  return documents

In [None]:
filename='vocab.txt'
vocabt=load_doc(filename)
vocab=set(vocabt.split())

In [None]:
import numpy as np


We can call the process clean docs function for both the neg and pos directories and combine the reviews into a single train or test dataset. 

We also can define the class labels for the dataset. The load dataset() function below will load all reviews and prepare class labels for the training or test dataset.

In [None]:
def load_clean_doc(vocab,is_train):
  # load documents
  neg=preprocess_doc('/root/.keras/datasets/txt_sentoken/neg/',vocab,is_train)
  pos=preprocess_doc('/root/.keras/datasets/txt_sentoken/pos/',vocab,is_train)
  docs =neg+pos

  # prepare labels
  labels=np.array([0 for _ in range(len(neg))]+[1 for _ in range(len(pos))])
  return docs,labels

                  

In [None]:
model=define_model(vocab_size,max_length)

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 1317, 100)         2576800   
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 1310, 32)          25632     
_________________________________________________________________
max_pooling1d_8 (MaxPooling1 (None, 655, 32)           0         
_________________________________________________________________
flatten_8 (Flatten)          (None, 20960)             0         
_________________________________________________________________
dense_16 (Dense)             (None, 10)                209610    
_________________________________________________________________
dense_17 (Dense)             (None, 1)                 11        
Total params: 2,812,053
Trainable params: 2,812,053
Non-trainable params: 0
____________________________________________

In [None]:
train_docs,ytrain=load_clean_doc(vocab,True)

In [None]:
train_docs

The next step is to encode each document as a sequence of integers. 

The Keras Embedding layer requires integer inputs where each integer maps to a single token that has a specific real-valued vector representation within the embedding. These vectors are random at the beginning of training, but during training become meaningful to the network. 

We can encode the training documents as sequences of integers using the Tokenizer class in the Keras API. First, we must construct an instance of the class then train it on all documents in the training dataset. In this case, it develops a vocabulary of all tokens in the training dataset and develops a consistent mapping from words in the vocabulary to unique integers. 

We could just as easily develop this mapping ourselves using our vocabulary file. The create below will prepare a Tokenizer from the training data.

In [None]:
def create_tokenizer(lines):
  tokenizer=Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer

In [None]:
tokenizer=create_tokenizer(train_docs)

In [None]:
tokenizer

<keras_preprocessing.text.Tokenizer at 0x7facebf92610>

Now that the mapping of words to integers has been prepared, we can use it to encode the reviews in the training dataset. We can do that by calling the texts_to_sequences() function on the Tokenizer.

 We also need to ensure that all documents have the same length. This is a requirement of Keras for efficient computation. We could truncate reviews to the smallest size or zero-pad (pad with the value 0) reviews to the maximum length, or some hybrid.
 
  In this case, we will pad all reviews to the length of the longest review in the training dataset. First, we can f ind the longest review using the max() function on the training dataset and take its length. We can then call the Keras function pad length by adding 0 values on the end.

In [None]:

# integer encode and pad documents
def encode_docs(tokenizer,max_length,docs):
  # integer encode
  encoded=tokenizer.texts_to_sequences(docs)
  # pad sequences
  padded=pad_sequences(encoded,maxlen=max_length,padding='post')
  return padded

In [None]:

# calculate the maximum sequence length
max_length=max([len(s.split()) for s in train_docs])
print('Max length is ',max_length)

Max length is  1317


In [None]:
# define vocabulary size
#Calculate the size of the vocabulary for the Embedding layer.
vocab_size=len(tokenizer.word_index)+1
vocab_size

25768

## **Defining model architechture**

We will use a 100-dimensional vector space, but you could try other values, such as 50 or 150. Finally, the maximum document length was calculated above in the max length variable used during padding.

 The complete model definition is listed below including the Embedding layer. We use a Convolutional Neural Network (CNN) as they have proven to be successful at document classification problems. 
 
A conservative CNN configuration is used with 32 filters (parallel fields for processing words) and a kernel size of 8 with a rectified linear (relu) activation function. This is followed by a pooling layer that reduces the output of the convolutional layer by half. 

Next, the 2D output from the CNN part of the model is flattened to one long 2D vector to represent the features extracted by the CNN. 

The back-end of the model is a standard Multilayer Perceptron layers to interpret the CNN features. 

The output layer uses a sigmoid activation function to output a value between 0 and 1 for the negative and positive sentiment in the review.

In [None]:
import tensorflow
from tensorflow.keras.models import Sequential,load_model
from tensorflow.keras.layers import Flatten,Dense,MaxPooling1D,Embedding,Conv1D
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
def define_model(vocab_size,max_length):
  model=Sequential()
  model.add(Embedding(vocab_size,100,input_length=max_length))
  model.add(Conv1D(32,8 ,activation='relu')) 
  # n=32 number of filter for conv1d
  #m=1317 number of words in sentence
  #k=8 filter size. output shape number_filters(kernel_size*embed_dim)
  model.add(MaxPooling1D())
  model.add(Flatten())
  model.add(Dense(10,activation='relu'))
  model.add(Dense(1,activation='sigmoid'))
  model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
  model.summary()
  return model

In [None]:
Xtrain=encode_docs(tokenizer,max_length,train_docs)

In [None]:
Xtrain

array([[ 1492,     1,  1297, ...,     0,     0,     0],
       [  214,  1284,     1, ...,     0,     0,     0],
       [    3,   263,   681, ...,     0,     0,     0],
       ...,
       [  185, 10975,   716, ...,     0,     0,     0],
       [ 3731,  9258,  1943, ...,     0,     0,     0],
       [ 2564,  5715,   349, ...,     0,     0,     0]], dtype=int32)

we fit the network on the training data. We use a binary cross-entropy loss function because the problem we are learning is a binary classification problem. The efficient Adam implementation of stochastic gradient descent is used and we keep track of accuracy in addition to loss during training. The model is trained for 10 epochs.

In [None]:
# fit network
model.fit(Xtrain,ytrain,epochs=10,verbose=2)

Epoch 1/10
57/57 - 1s - loss: 0.6934 - accuracy: 0.5178
Epoch 2/10
57/57 - 0s - loss: 0.5922 - accuracy: 0.6750
Epoch 3/10
57/57 - 0s - loss: 0.1572 - accuracy: 0.9483
Epoch 4/10
57/57 - 0s - loss: 0.0124 - accuracy: 0.9983
Epoch 5/10
57/57 - 0s - loss: 0.0022 - accuracy: 1.0000
Epoch 6/10
57/57 - 0s - loss: 0.0013 - accuracy: 1.0000
Epoch 7/10
57/57 - 0s - loss: 9.5671e-04 - accuracy: 1.0000
Epoch 8/10
57/57 - 0s - loss: 7.4839e-04 - accuracy: 1.0000
Epoch 9/10
57/57 - 0s - loss: 6.0934e-04 - accuracy: 1.0000
Epoch 10/10
57/57 - 0s - loss: 5.1145e-04 - accuracy: 1.0000


<keras.callbacks.History at 0x7facecc1a2d0>

In [None]:
model.save('model.h5') # save the model

In [None]:
train_doc,ytrain=load_clean_doc(vocab,True)
test_doc,y_test=load_clean_doc(vocab,False)
Xtest=encode_docs(tokenizer,max_length,test_doc)

In [None]:
# load the model
#model=load_model('model.h5')

# evaluate model on training dataset
_,acc=model.evaluate(Xtrain,ytrain,verbose=0)
print('train accuracy',(acc*100))


# evaluate model on test dataset

_,acc=model.evaluate(Xtest,y_test,verbose=0)
print('Test accuracy',(acc*100))

train accuracy 100.0
Test accuracy 88.49999904632568


In [None]:
model.predict(Xtest)

array([[6.3639782e-02],
       [9.4005809e-04],
       [6.4791783e-07],
       [1.5333651e-05],
       [3.9064748e-06],
       [4.3336245e-06],
       [1.3360527e-01],
       [2.3614551e-04],
       [5.0739106e-07],
       [7.8642982e-01],
       [2.1959063e-06],
       [1.6228665e-05],
       [2.1453020e-03],
       [1.5732989e-02],
       [1.4940880e-03],
       [1.2398261e-03],
       [1.9317281e-02],
       [2.6634575e-03],
       [9.9558300e-01],
       [3.5103521e-01],
       [1.2156356e-02],
       [6.4676057e-04],
       [6.1596598e-04],
       [7.3108450e-04],
       [2.5078602e-05],
       [6.7723191e-01],
       [2.0167528e-05],
       [6.0389412e-04],
       [2.1343967e-06],
       [1.5003771e-01],
       [2.0404351e-03],
       [1.7327410e-06],
       [1.7500315e-02],
       [2.9757783e-01],
       [8.5410334e-02],
       [1.0875522e-03],
       [1.5935737e-03],
       [6.9199261e-05],
       [7.5063581e-05],
       [4.8229420e-05],
       [4.1458725e-07],
       [8.892901

## **Function to predict the sentiment for an ad hoc movie review**

In [None]:
# classify a review as negative or positive

def predict_review(review,vocab,tokenizer,max_length,model):
  # clean review
  line=clean_doc(review,vocab)

  # encode and pad review
  padded=encode_docs(tokenizer,max_length,[line])

  # predict sentiment
  yhat=model.predict(padded,verbose=0)

  # retrieve predicted percentage and label
  percent_pos=yhat[0,0]
  if round(percent_pos)==0:
    return (1-percent_pos),'NEG'
  
  return percent_pos, 'POS'

In [None]:
text='This is bad movie. Donot watch it'
percent,sentiment=predict_review(text,vocab,tokenizer,max_length,model)

In [None]:
percent,sentiment

(0.5455211997032166, 'NEG')

In [None]:
text='Everyone will enjoy this film. I love it, recommended!'
percent1,sentiment1=predict_review(text,vocab,tokenizer,max_length,model)

In [None]:
percent1,sentiment1

(0.5075193, 'POS')

# **Conclusion:**


Running the example first prints the skill of the model on the training and test dataset. We can see that the model achieves 100% accuracy on the training dataset and 87.5% on the test dataset, an impressive score. 

 Next, we can see that the model makes the correct prediction on two contrived movie reviews. We can see that the percentage or confidence of the prediction is close to 50% for both, this may be because the two contrived reviews are very short and the model is expecting sequences of 1,000 or more words.