The Goal of this model is taking the reviews of different amazon food reviews and being able to predict the reviews of each item 
from scale of 1-5. 
The Steps we going to take are:
    
1. Take each and every review and convert review into batch set using tokenizing
2. Load in the glove vector embeddings
3. Replace each word with its associated glove embedding
4. Make Training and Testing Set Division
5. Using Keras build model in Keras (Embedding Layer - Convolution Network - LSTM Network - Dense Layer - (1-5))
6. Train the model and Test the model for accuracy

#### Take each and every review and convert review into batch set using tokenizing

In [32]:
import numpy as np
import pandas as pd
from keras.models import Model
import nltk
from tqdm import tqdm
from keras.layers import *
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from statistics import mean,mode, median, median_high
try:
    import h5py
except ImportError:
    h5py = None

In [33]:
#Load in the reviews file and inspect the reviews

df = pd.read_csv('data/Reviews.csv')

X = df['Text'].values
y = df['Score'].values

df.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,5,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,5,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [34]:
#Tokenizing Every Review and Preprocessing the text
MAX_SENTENCE_SIZE = 200

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)

word_index = tokenizer.word_index

#Adding in the padding

padded_sequences = pad_sequences(sequences, maxlen=MAX_SENTENCE_SIZE)

# X = [text_to_word_sequence(sentence,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
#                                                lower=True,
#                                                split=" ") for sentence in tqdm(X)]

In [35]:
print("Total Number of Reviews:",len(sequences))

Total Number of Reviews: 568454


#### Load in the glove vector embeddings

In [36]:
#Loading up the embedding matrix from GLove DB

embeddings_index = {}
f = open('data/embedding/glove.6B.100d.txt',"r",encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


#### Replace each word with its associated glove embedding

In [37]:
#Create the Embedding matrix
#Embedding Matrix is combination of word list (index-word) and word emebeddings to give (index-embeddings)

EMBEDDING_DIM=100
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

#### One Hot Encode the Label Data

In [38]:
#Conver the Labels into one hot encoding output
LABEL_CATEGORIES = 6
y = np.eye(LABEL_CATEGORIES)[y.reshape(-1)]

#### Make Training and Testing Set Division

In [39]:
train_limit = int(len(padded_sequences) * 0.85)
X_train = padded_sequences[:train_limit]
X_test = padded_sequences[train_limit:]
y_train = y[:train_limit]
y_test = y[train_limit:]

In [40]:
def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p], b[p]
def get_training_sets():
    return X_train,y_train #unison_shuffled_copies(X_train, y_train)
def get_testing_sets():
    return X_test, y_test

#### Using Keras build model in Keras (Embedding Layer - Convolution Network - LSTM Network - Dense Layer - (1-5))

#### Create the Neural Network

In [41]:
def create_neural_network():

    input_layer = Input(shape=(MAX_SENTENCE_SIZE,), dtype='int32')
    
    embedding_layer = Embedding(len(word_index)+1, EMBEDDING_DIM, weights=[embedding_matrix],
                               input_length=MAX_SENTENCE_SIZE, trainable=False)
    embedding_layer = embedding_layer(input_layer)
     
    conv_layer = Conv1D(256, 8, activation='relu')(embedding_layer)
    conv_layer = MaxPooling1D(8)(conv_layer)
    
    lstm_layer = LSTM(256)(conv_layer)
    
    #flatten_layer = Flatten()(lstm_layer)
    dense_layer = Dense(256, activation='relu')(lstm_layer)
    
    prediction = Dense(LABEL_CATEGORIES, activation='softmax')(dense_layer)
    
    model = Model(inputs=input_layer,outputs=prediction)
    
    return model

In [42]:
model = create_neural_network()
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

#### Train the model and Test the model for accuracy

In [43]:
x_train,y_train = get_training_sets()
x_test,y_test = get_testing_sets()
model.fit(x_train, y_train, validation_data=(x_test,y_test), epochs=25, verbose=2, batch_size=512)

Train on 483185 samples, validate on 85269 samples
Epoch 1/25
 - 55s - loss: 0.9289 - acc: 0.6701 - val_loss: 0.7617 - val_acc: 0.7161
Epoch 2/25
 - 55s - loss: 0.7396 - acc: 0.7244 - val_loss: 0.6819 - val_acc: 0.7453
Epoch 3/25
 - 55s - loss: 0.6487 - acc: 0.7581 - val_loss: 0.6504 - val_acc: 0.7568
Epoch 4/25
 - 55s - loss: 0.5717 - acc: 0.7896 - val_loss: 0.6120 - val_acc: 0.7800
Epoch 5/25
 - 55s - loss: 0.5034 - acc: 0.8171 - val_loss: 0.5998 - val_acc: 0.7895
Epoch 6/25
 - 55s - loss: 0.4428 - acc: 0.8407 - val_loss: 0.6141 - val_acc: 0.8018
Epoch 7/25
 - 55s - loss: 0.3875 - acc: 0.8620 - val_loss: 0.6297 - val_acc: 0.8037
Epoch 8/25
 - 55s - loss: 0.3366 - acc: 0.8806 - val_loss: 0.6411 - val_acc: 0.8030
Epoch 9/25
 - 55s - loss: 0.2914 - acc: 0.8967 - val_loss: 0.6666 - val_acc: 0.8078
Epoch 10/25
 - 54s - loss: 0.2512 - acc: 0.9114 - val_loss: 0.7298 - val_acc: 0.8093
Epoch 11/25
 - 55s - loss: 0.2168 - acc: 0.9233 - val_loss: 0.8227 - val_acc: 0.7870
Epoch 12/25
 - 54s - lo

<keras.callbacks.History at 0x14b10d27cf8>

In [44]:
#Saving the Model
model.save('sentiment_analysis.h5')