## Deep learning
#### First load the dataset, download at http://www.cs.cornell.edu/People/pabo/movie-review-data/, open each file, read all its content as a string, delete punctuations and transfer it into a word list, then map each word to an integer index according to the word index dictionary downloaded at https://s3.amazonaws.com/text-datasets/imdb_word_index.json, only consider 5000 most frequently used words.
#### Next, create an embedding layer, propagate training data through it to get word embeddings(every 32 dimensional vector represents a word).
#### Then use Keras to create a LSTM layer with 100 units to get the output, use dense layer and sigmoid activation function to get the final result, if>0.5, it is positive, else negative. Use adam optimizer to optimize the layer.
#### Then check its training and testing accuracy.

### Import library 

In [1]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
import os
import json
from keras.utils import get_file
from nltk.tokenize import RegexpTokenizer

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### Word index dictionary

In [2]:
def get_word_index(path='imdb_word_index.json'):
    path = get_file(path,origin='https://s3.amazonaws.com/text-datasets/imdb_word_index.json',
                    file_hash='bfafd718b763782e994055a2d397834f')
    with open(path) as f:
        return json.load(f)

### Load dataset

In [33]:
word_index=get_word_index(path='imdb_word_index.json')
path = "Desktop/review_polarity/txt_sentoken/neg" 
files= os.listdir(path)
tokenizer = RegexpTokenizer(r'\w+')
X_neg = []
for file in files: 
          f = open(path+"/"+file)
          iter_f = iter(f)
          str = ""
          for line in iter_f: 
              str = str + line.lower()
          temp=tokenizer.tokenize(str)
          X_neg.append(temp)
X_neg=[[word_index[w] for w in x if w in word_index.keys() and word_index[w]<5000]for x in X_neg]
Y_neg=[0 for w in range(len(X_neg))]
path2 = "Desktop/review_polarity/txt_sentoken/pos" 
files2= os.listdir(path2) 
X_pos = []
for file2 in files2: 
          f = open(path2+"/"+file2)
          iter_f = iter(f)
          str = ""
          for line in iter_f: 
              str = str + line.lower()
          temp=tokenizer.tokenize(str)
          X_pos.append(temp)
X_pos=[[word_index[w] for w in x if w in word_index.keys()and word_index[w]<5000]for x in X_pos]
Y_pos=[1 for w in range(len(X_pos))]
X=X_neg+X_pos
Y=Y_neg+Y_pos
np.random.seed(10)
indices = np.arange(len(X))
np.random.shuffle(indices)
x=[]
y=[]
for i in indices:
    x.append(X[i])
    y.append(Y[i])    
idx = int(0.8*len(X))
X_train, y_train = np.array(x[:idx]), np.array(y[:idx])
X_test, y_test = np.array(x[idx:]), np.array(y[idx:])

### Pad X_train and X_test to same length

In [None]:
max_review_length = 1000
X_train=sequence.pad_sequences(X_train, maxlen = max_review_length)
X_test=sequence.pad_sequences(X_test, maxlen = max_review_length)

### Build the model with embedding and LSTM layer

In [37]:

embedding_vector_length=32
model=Sequential()
model.add(Embedding(top_words,embedding_vector_length,input_length=max_review_length))
model.add(LSTM(100)) 
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=5,batch_size=64)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 1000, 32)          160000    
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 1600 samples, validate on 400 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x182c3e4898>

### Test

In [38]:
scores=model.evaluate(X_test,y_test,verbose=0)
print("Accuracy: %.2f%%"%(scores[1]*100))

Accuracy: 64.50%
