# Using glove with rnn

Using glove 6B token dataset
glove: https://nlp.stanford.edu/projects/glove/

Using gensim to load in glove, pre-converted into word2vec format for gensim

Download dataset from kaggle: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
Extract dataset to dataset folder in the same directory of this notebook 

In [1]:
import gensim
import numpy as np
import tensorflow as tf
import pandas as pd
import string
import nltk
import re

  from ._conv import register_converters as _register_converters


In [2]:
# # convert glove to gensim word2vec format
# glove2word2vec('glove.6B.50d.txt', 'gloveWord2vec.txt')

In [3]:
df = pd.read_csv('./dataset/Womens Clothing E-Commerce Reviews.csv', index_col=0)

In [4]:
df

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits
7,858,39,"Shimmer, surprisingly goes with lots","I ordered this in carbon for store pick up, an...",4,1,4,General Petite,Tops,Knits
8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


In [5]:
df2 = df.copy(deep=True)

In [6]:
df2.iloc[1, 3]

'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [7]:
df2 = df2[['Review Text', 'Rating', 'Recommended IND']]

In [8]:
df2

Unnamed: 0,Review Text,Rating,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,4,1
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,This shirt is very flattering to all due to th...,5,1
5,"I love tracy reese dresses, but this one is no...",2,0
6,I aded this in my basket at hte last mintue to...,5,1
7,"I ordered this in carbon for store pick up, an...",4,1
8,I love this dress. i usually get an xs but it ...,5,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1


In [9]:
df2.dropna(how='any', inplace=True)

In [10]:
df2

Unnamed: 0,Review Text,Rating,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,4,1
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,This shirt is very flattering to all due to th...,5,1
5,"I love tracy reese dresses, but this one is no...",2,0
6,I aded this in my basket at hte last mintue to...,5,1
7,"I ordered this in carbon for store pick up, an...",4,1
8,I love this dress. i usually get an xs but it ...,5,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1


In [11]:
df3 = df2[['Review Text', 'Rating']]

In [12]:
df3

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,4
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
3,"I love, love, love this jumpsuit. it's fun, fl...",5
4,This shirt is very flattering to all due to th...,5
5,"I love tracy reese dresses, but this one is no...",2
6,I aded this in my basket at hte last mintue to...,5
7,"I ordered this in carbon for store pick up, an...",4
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In [13]:
intToStringRatingDict = {1:'vbad', 2:'bad', 3:'meh', 4:'good', 5:'vgood'}

In [14]:
df3['Rating'] = df3['Rating'].apply(lambda x: intToStringRatingDict[x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [15]:
df3

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,good
1,Love this dress! it's sooo pretty. i happene...,vgood
2,I had such high hopes for this dress and reall...,meh
3,"I love, love, love this jumpsuit. it's fun, fl...",vgood
4,This shirt is very flattering to all due to th...,vgood
5,"I love tracy reese dresses, but this one is no...",bad
6,I aded this in my basket at hte last mintue to...,vgood
7,"I ordered this in carbon for store pick up, an...",good
8,I love this dress. i usually get an xs but it ...,vgood
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",vgood


### preprocessing the sentences

In [16]:
# taking one as a sample sentence
sampleReview = df3.iloc[1, 0]

In [17]:
sampleReview

'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [18]:
def preprocess(sentence):
    outputSentence = sentence.lower()
    outputSentence = replaceContractions(outputSentence)
    outputSentence = removePunc(outputSentence)
    outputSentence = removeNumbers(outputSentence)
    
    return outputSentence

In [19]:
def replaceContractions(sentence):
    outputSentence = sentence
    outputSentence = outputSentence.replace("won't", "will not")
    outputSentence = outputSentence.replace("can\'t", "can not")
    outputSentence = outputSentence.replace("n\'t", " not")
    outputSentence = outputSentence.replace("\'re", " are")
    outputSentence = outputSentence.replace("\'s", " is")
    outputSentence = outputSentence.replace("\'d", " would")
    outputSentence = outputSentence.replace("\'ll", " will")
    outputSentence = outputSentence.replace("\'t", " not")
    outputSentence = outputSentence.replace("\'ve", " have")
    outputSentence = outputSentence.replace("\'m", " am")
    return outputSentence

In [20]:
def removePunc(sentence):
    removePuncTrans = str.maketrans("", "", string.punctuation)
    outputSentence = sentence.translate(removePuncTrans)
    return outputSentence

In [21]:
def removeNumbers(sentence):
    outputSentence = sentence
    removeDigitsTrans = str.maketrans('', '', string.digits)
    outputSentence = outputSentence.translate(removeDigitsTrans)
    return outputSentence

In [22]:
replaceContractions(sampleReview)

'Love this dress!  it is sooo pretty.  i happened to find it in a store, and i am glad i did bc i never would have ordered it online bc it is petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [23]:
sampleReview

'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [24]:
preprocess(sampleReview)

'love this dress  it is sooo pretty  i happened to find it in a store and i am glad i did bc i never would have ordered it online bc it is petite  i bought a petite and am   i love the length on me hits just a little below the knee  would definitely be a true midi on someone who is truly petite'

In [25]:
df3

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,good
1,Love this dress! it's sooo pretty. i happene...,vgood
2,I had such high hopes for this dress and reall...,meh
3,"I love, love, love this jumpsuit. it's fun, fl...",vgood
4,This shirt is very flattering to all due to th...,vgood
5,"I love tracy reese dresses, but this one is no...",bad
6,I aded this in my basket at hte last mintue to...,vgood
7,"I ordered this in carbon for store pick up, an...",good
8,I love this dress. i usually get an xs but it ...,vgood
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",vgood


In [26]:
df3['Review Text'] = df3['Review Text'].apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [27]:
df3

Unnamed: 0,Review Text,Rating
0,absolutely wonderful silky and sexy and comfo...,good
1,love this dress it is sooo pretty i happened...,vgood
2,i had such high hopes for this dress and reall...,meh
3,i love love love this jumpsuit it is fun flirt...,vgood
4,this shirt is very flattering to all due to th...,vgood
5,i love tracy reese dresses but this one is not...,bad
6,i aded this in my basket at hte last mintue to...,vgood
7,i ordered this in carbon for store pick up and...,good
8,i love this dress i usually get an xs but it r...,vgood
9,i am and lbs i ordered the s petite to make ...,vgood


## word2vec preprocessing

Now to load the glove vectors into gensim as word2vec format, pre-converted earlier.

Also need a function to take in a sentence and return a stack of vectors of words in sentence, in order.

In [28]:
gloveModel = gensim.models.KeyedVectors.load_word2vec_format('gloveWord2vec.txt')

In [29]:
# try retrieving glove vector of word, using 50 dim version
gloveModel.get_vector('test')

array([ 0.13175 , -0.25517 , -0.067915,  0.26193 , -0.26155 ,  0.23569 ,
        0.13077 , -0.011801,  1.7659  ,  0.20781 ,  0.26198 , -0.16428 ,
       -0.84642 ,  0.020094,  0.070176,  0.39778 ,  0.15278 , -0.20213 ,
       -1.6184  , -0.54327 , -0.17856 ,  0.53894 ,  0.49868 , -0.10171 ,
        0.66265 , -1.7051  ,  0.057193, -0.32405 , -0.66835 ,  0.26654 ,
        2.842   ,  0.26844 , -0.59537 , -0.5004  ,  1.5199  ,  0.039641,
        1.6659  ,  0.99758 , -0.5597  , -0.70493 , -0.0309  , -0.28302 ,
       -0.13564 ,  0.6429  ,  0.41491 ,  1.2362  ,  0.76587 ,  0.97798 ,
        0.58507 , -0.30176 ], dtype=float32)

In [30]:
len(gloveModel.vocab)

400000

In [31]:
def getSentenceWordStack(sentence, gensimModel):
    sentenceWordList = sentence.split()
    vectorList = []
    for word in sentenceWordList:
        # to handle the case whereby the word in sentence is not in the vocab of the model.
        if word in gensimModel.vocab:
            vectorList.append(gensimModel.get_vector(word))
        else:
            vectorList.append(np.zeros(50)) # because we are using 50 dim vector, so 50 dim zero vector
    stackedVector = np.array(vectorList)
    return stackedVector

### Prepare the train, dev and test set

In [32]:
# create df4 which is a shuffed version of df3
df4 = df3.sample(frac=1).reset_index(drop=True)

In [33]:
df4.iloc[0, 0]

'i got these in the neutral color  quickly understood why they are categorized as loungewear if you are going to wear them out of the house wear a tunic or a dress over them i folded the scrunch part up over my calves  wore them like boot socks when i went out i got a ml  they fit about on par with other leggings  tights of that size range\n\nwhen i got home i rolled them down  put the bottoms over my heels the rest of the bottoms scrunched around my ankles i recommend them for around'

In [34]:
df3.iloc[0, 0]

'absolutely wonderful  silky and sexy and comfortable'

In [35]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22641 entries, 0 to 22640
Data columns (total 2 columns):
Review Text    22641 non-null object
Rating         22641 non-null object
dtypes: object(2)
memory usage: 353.8+ KB


In [36]:
traindf = df4.iloc[:17000, :]

In [37]:
devdf = df4.iloc[17000:20000, :]

In [38]:
testdf = df4.iloc[20000:, :]

In [39]:
traindf

Unnamed: 0,Review Text,Rating
0,i got these in the neutral color quickly unde...,good
1,this top is very pretty i ordered both xs p an...,vgood
2,visions of sartorial splendor danced in my hea...,vbad
3,i just received this in the mail today first o...,bad
4,i love how pretty and soft this shirt is it fa...,vgood
5,i just ordered this top and i love it it is a ...,vgood
6,when i saw this dress online i immediately pas...,good
7,great stylish swinging dress softly draped fab...,vgood
8,the sweater is very soft it has an interesting...,vgood
9,i purchased the white in the store in a size ...,good


In [40]:
devdf

Unnamed: 0,Review Text,Rating
17000,this is adorable light and breezy i felt it fi...,vgood
17001,i love this shirt so much i bought it nearly f...,vgood
17002,i bought this tank in the blue popsicle print ...,vgood
17003,i just purchased this velvet top in my regular...,vgood
17004,my favorite leggings they are feminine and per...,vgood
17005,i received this dress as a birthday gift and a...,vgood
17006,i tried on this dress in the store and absolut...,vgood
17007,i purchased this jacket in red i have several ...,good
17008,this is a great pair of trousers for work but ...,vgood
17009,this is such a cute design but it is so short ...,meh


In [41]:
testdf

Unnamed: 0,Review Text,Rating
20000,i love this tunic with that said the reason i ...,good
20001,i like the dress on line color flowers cut etc...,good
20002,i tried on a returned p in my store they were ...,good
20003,as stated by others this is the type of unique...,good
20004,i wanted to love this top i really did the col...,meh
20005,this top is very cuteon other people i ordered...,good
20006,i love this dress for fall it is slightly shee...,vgood
20007,i bought this top in white and my usual size l...,good
20008,i have got the postbaby bulge and this dress m...,vgood
20009,this sweater is fine for the casual days i bou...,meh


When writing an RNN, it is possible to have arbitrary length input, because the recurrent units unfolds in time. However, when training, each batch needs to be of the same length. Here, I am going to try using a batchsize of 1, so that there is no need for padding, this is definitely not the best approach, just trying it out. It could bring instability to the training.

In [42]:
traindf['Review Text'] = traindf['Review Text'].apply(lambda x: getSentenceWordStack(x, gloveModel))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [43]:
# now we have converted all sentenced into stacks of word vectors
traindf

Unnamed: 0,Review Text,Rating
0,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",good
1,"[[0.5307400226593018, 0.4011699855327606, -0.4...",vgood
2,"[[1.1139, 0.86435, -0.47556, -0.33697, 0.08507...",vbad
3,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",bad
4,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",vgood
5,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",vgood
6,"[[0.27062, -0.36596, 0.097193, -0.50708, 0.373...",good
7,"[[-0.0265670008957386, 1.3357000350952148, -1....",vgood
8,"[[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -...",vgood
9,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",good


In [44]:
devdf['Review Text'] = devdf['Review Text'].apply(lambda x: getSentenceWordStack(x, gloveModel))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [45]:
testdf['Review Text'] = testdf['Review Text'].apply(lambda x: getSentenceWordStack(x, gloveModel))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [46]:
# building a dictionary to map label class into one hot vectors and another one for reverse
labelToOneHotIndex = {}
oneHotMaxToLabel = {}
possibleLabel = ['good', 'vgood', 'meh', 'vbad', 'bad']
for i in range(len(possibleLabel)):
    label = possibleLabel[i]
    oneHotMaxToLabel[i] = label
    labelToOneHotIndex[label] = i

In [47]:
def getOneHotVector(index):
    # here building according to 5 classes
    outputVector = np.zeros(5)
    outputVector[index] = outputVector[index] + 1
    return outputVector

In [48]:
labelToOneHotIndex

{'bad': 4, 'good': 0, 'meh': 2, 'vbad': 3, 'vgood': 1}

In [49]:
oneHotMaxToLabel

{0: 'good', 1: 'vgood', 2: 'meh', 3: 'vbad', 4: 'bad'}

### now a function that builds the model

In [50]:
traindf.iloc[0, 0].shape

(97, 50)

In [51]:
x_train = traindf.iloc[:, 0].values

In [52]:
y_train = traindf.iloc[:, 1].apply(lambda x: labelToOneHotIndex[x])

In [53]:
y_train = y_train.apply(getOneHotVector)

In [54]:
y_train

0        [1.0, 0.0, 0.0, 0.0, 0.0]
1        [0.0, 1.0, 0.0, 0.0, 0.0]
2        [0.0, 0.0, 0.0, 1.0, 0.0]
3        [0.0, 0.0, 0.0, 0.0, 1.0]
4        [0.0, 1.0, 0.0, 0.0, 0.0]
5        [0.0, 1.0, 0.0, 0.0, 0.0]
6        [1.0, 0.0, 0.0, 0.0, 0.0]
7        [0.0, 1.0, 0.0, 0.0, 0.0]
8        [0.0, 1.0, 0.0, 0.0, 0.0]
9        [1.0, 0.0, 0.0, 0.0, 0.0]
10       [0.0, 1.0, 0.0, 0.0, 0.0]
11       [0.0, 1.0, 0.0, 0.0, 0.0]
12       [0.0, 0.0, 0.0, 1.0, 0.0]
13       [0.0, 1.0, 0.0, 0.0, 0.0]
14       [0.0, 0.0, 0.0, 1.0, 0.0]
15       [0.0, 1.0, 0.0, 0.0, 0.0]
16       [0.0, 1.0, 0.0, 0.0, 0.0]
17       [0.0, 1.0, 0.0, 0.0, 0.0]
18       [1.0, 0.0, 0.0, 0.0, 0.0]
19       [0.0, 0.0, 1.0, 0.0, 0.0]
20       [1.0, 0.0, 0.0, 0.0, 0.0]
21       [0.0, 0.0, 0.0, 0.0, 1.0]
22       [0.0, 0.0, 1.0, 0.0, 0.0]
23       [0.0, 0.0, 0.0, 1.0, 0.0]
24       [0.0, 0.0, 1.0, 0.0, 0.0]
25       [0.0, 1.0, 0.0, 0.0, 0.0]
26       [0.0, 0.0, 1.0, 0.0, 0.0]
27       [0.0, 1.0, 0.0, 0.0, 0.0]
28       [0.0, 1.0, 

In [55]:
y_train_list = list(y_train.values)

In [56]:
type(x_train)

numpy.ndarray

In [57]:
x_train.shape

(17000,)

In [58]:
x_train[0]

array([[ 0.11891 ,  0.15255 , -0.082073, ..., -0.57512 , -0.26671 ,
         0.92121 ],
       [-0.4097  , -0.37167 ,  0.38852 , ..., -0.25414 ,  0.040372,
         0.38652 ],
       [ 1.0074  ,  0.18912 , -0.11732 , ...,  0.12912 , -0.3995  ,
        -0.25768 ],
       ...,
       [ 0.64642 , -0.556   ,  0.47038 , ..., -0.35831 , -0.10995 ,
        -0.447   ],
       [ 0.15272 ,  0.36181 , -0.22168 , ...,  0.43382 , -0.084617,
         0.1214  ],
       [ 0.77604 ,  0.22584 ,  0.45044 , ..., -0.97437 , -0.78565 ,
        -0.8177  ]], dtype=float32)

In [59]:
x_train_list = list(x_train)

In [60]:
x_train_list[0]

array([[ 0.11891 ,  0.15255 , -0.082073, ..., -0.57512 , -0.26671 ,
         0.92121 ],
       [-0.4097  , -0.37167 ,  0.38852 , ..., -0.25414 ,  0.040372,
         0.38652 ],
       [ 1.0074  ,  0.18912 , -0.11732 , ...,  0.12912 , -0.3995  ,
        -0.25768 ],
       ...,
       [ 0.64642 , -0.556   ,  0.47038 , ..., -0.35831 , -0.10995 ,
        -0.447   ],
       [ 0.15272 ,  0.36181 , -0.22168 , ...,  0.43382 , -0.084617,
         0.1214  ],
       [ 0.77604 ,  0.22584 ,  0.45044 , ..., -0.97437 , -0.78565 ,
        -0.8177  ]], dtype=float32)

In [61]:
def buildModel(inputShape):
    
    x_input = tf.keras.layers.Input(inputShape)
    
    x = tf.keras.layers.GRU(units=15, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, name='GRU1')(x_input)
    
    x = tf.keras.layers.GRU(units=7, dropout=0.2, recurrent_dropout=0.2, return_sequences=False, name='GRU2')(x)
    
    x = tf.keras.layers.Dense(units=5, activation='softmax', name='softmax_output')(x)
    
    model = tf.keras.Model(inputs=x_input, outputs=x, name='rnn_glove')
    
    return model

In [62]:
rnnModel = buildModel([None, 50]) # arbitrary length of 50 dim word vectors

In [63]:
rnnModel.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [64]:
rnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, None, 50)          0         
_________________________________________________________________
GRU1 (GRU)                   (None, None, 15)          2970      
_________________________________________________________________
GRU2 (GRU)                   (None, 7)                 483       
_________________________________________________________________
softmax_output (Dense)       (None, 5)                 40        
Total params: 3,493
Trainable params: 3,493
Non-trainable params: 0
_________________________________________________________________


In [66]:
manualEpoch = 2

In [68]:
for j in range(manualEpoch):
    print('Epoch: ', j)
    
    for i in range(len(x_train_list)):
        if i % 1000 == 0:
            print(i)
        inputMatrix = x_train_list[i].reshape(1, -1, 50)
        labelMatrix = y_train_list[i].reshape(1, -1)
        rnnModel.fit(x=inputMatrix, y=labelMatrix, batch_size=1, epochs=1, verbose=0)

Epoch:  0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
Epoch:  1
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000


In [69]:
tf.keras.models.save_model(rnnModel, 'rnnModel.h5')

In [108]:
# using batchsize of 1, because I just wanted to try variable size input, so I don't have to pad per batch
rnnModel.fit(x=x_train_list[0].reshape(1, -1, 50), y=y_train[0].reshape(1, -1), epochs=20, batch_size=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras._impl.keras.callbacks.History at 0x7fbf277fe710>

## Loading the model back for evaluation

In [65]:
trainedModel = tf.keras.models.load_model('rnnModel.h5')

In [67]:
x_dev = devdf.iloc[:,0].values

In [68]:
type(x_dev)

numpy.ndarray

In [74]:
x_dev_list = list(x_dev)

In [81]:
y_dev = devdf.iloc[:,1].apply(lambda x: labelToOneHotIndex[x])

In [82]:
y_dev

17000    1
17001    1
17002    1
17003    1
17004    1
17005    1
17006    1
17007    0
17008    1
17009    2
17010    2
17011    1
17012    0
17013    1
17014    2
17015    1
17016    1
17017    1
17018    1
17019    0
17020    1
17021    1
17022    1
17023    1
17024    1
17025    1
17026    1
17027    1
17028    2
17029    1
        ..
19970    0
19971    4
19972    1
19973    1
19974    1
19975    1
19976    1
19977    0
19978    4
19979    1
19980    4
19981    1
19982    2
19983    1
19984    2
19985    1
19986    0
19987    0
19988    0
19989    0
19990    1
19991    1
19992    2
19993    0
19994    1
19995    3
19996    4
19997    1
19998    2
19999    1
Name: Rating, Length: 3000, dtype: int64

In [83]:
y_dev_list = list(y_dev.values)

In [66]:
devdf

Unnamed: 0,Review Text,Rating
17000,"[[0.53074, 0.40117, -0.40785, 0.15444, 0.47782...",vgood
17001,"[[0.11890999972820282, 0.15254999697208405, -0...",vgood
17002,"[[0.11890999972820282, 0.15254999697208405, -0...",vgood
17003,"[[0.11890999972820282, 0.15254999697208405, -0...",vgood
17004,"[[-0.27279, 0.77515, -0.10181, -0.9166, 0.9047...",vgood
17005,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",vgood
17006,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",vgood
17007,"[[0.11890999972820282, 0.15254999697208405, -0...",good
17008,"[[0.5307400226593018, 0.4011699855327606, -0.4...",vgood
17009,"[[0.53074, 0.40117, -0.40785, 0.15444, 0.47782...",meh


In [76]:
len(x_dev_list)

3000

In [80]:
x_dev[4].shape

(33, 50)

In [85]:
correctCount = 0
for i in range(len(x_dev_list)):
        inputMatrix = x_dev_list[i].reshape(1, -1, 50)
        correctLabelIndex = y_dev_list[i]
        result = np.argmax(trainedModel.predict(inputMatrix))
        if result == correctLabelIndex:
            correctCount = correctCount + 1
        if i % 1000 == 0:
            print(i)
print('percentage correct: ', float(correctCount)/len(x_dev_list))

0
1000
2000
percentage correct:  0.615


In [86]:
testdf

Unnamed: 0,Review Text,Rating
20000,"[[0.11890999972820282, 0.15254999697208405, -0...",good
20001,"[[0.11890999972820282, 0.15254999697208405, -0...",good
20002,"[[0.11890999972820282, 0.15254999697208405, -0...",good
20003,"[[0.20782, 0.12713, -0.30188, -0.23125, 0.3017...",good
20004,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",meh
20005,"[[0.5307400226593018, 0.4011699855327606, -0.4...",good
20006,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",vgood
20007,"[[0.11891, 0.15255, -0.082073, -0.74144, 0.759...",good
20008,"[[0.11890999972820282, 0.15254999697208405, -0...",vgood
20009,"[[0.53074, 0.40117, -0.40785, 0.15444, 0.47782...",meh


In [87]:
y_test = testdf.iloc[:, 1].apply(lambda x: labelToOneHotIndex[x])

In [90]:
y_test.describe()

count    2641.000000
mean        1.191215
std         1.039091
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         4.000000
Name: Rating, dtype: float64

In [93]:
y_test[y_test == 1].count()

1468

In [94]:
y_test[y_test == 2].count()

322

In [95]:
y_test[y_test == 3].count()

98

In [96]:
y_test[y_test == 4].count()

185

In [103]:
y_test[y_test == 0].count()

568

In [104]:
stratifiedTestDf = testdf.groupby('Rating', group_keys=False).apply(lambda x: x.sample(min(len(x), 98)))

In [110]:
# reset index and shuffled
stratifiedTestDf = stratifiedTestDf.sample(frac=1).reset_index(drop=True)

In [111]:
x_test_strat = stratifiedTestDf.iloc[:, 0].values

In [112]:
y_test_strat = stratifiedTestDf.iloc[:, 1].apply(lambda x : labelToOneHotIndex[x])

In [113]:
y_test_strat

0      1
1      3
2      2
3      3
4      4
5      1
6      4
7      2
8      0
9      3
10     2
11     0
12     0
13     0
14     3
15     2
16     3
17     1
18     4
19     4
20     0
21     0
22     2
23     0
24     2
25     0
26     4
27     2
28     3
29     1
      ..
460    3
461    3
462    1
463    4
464    3
465    4
466    1
467    3
468    1
469    1
470    1
471    0
472    4
473    0
474    1
475    4
476    0
477    2
478    4
479    2
480    2
481    4
482    1
483    4
484    2
485    2
486    4
487    2
488    4
489    1
Name: Rating, Length: 490, dtype: int64

In [114]:
correctCount = 0
for i in range(len(x_test_strat)):
        inputMatrix = x_test_strat[i].reshape(1, -1, 50)
        correctLabelIndex = y_test_strat[i]
        result = np.argmax(trainedModel.predict(inputMatrix))
        if result == correctLabelIndex:
            correctCount = correctCount + 1
        if i % 1000 == 0:
            print(i)
print('percentage correct: ', float(correctCount)/len(x_test_strat))

0
percentage correct:  0.3224489795918367


In [116]:
x_test_list = testdf.iloc[:,0].tolist()

In [119]:
y_test_list = list(y_test)

In [120]:
correctCount = 0
for i in range(len(x_test_list)):
        inputMatrix = x_test_list[i].reshape(1, -1, 50)
        correctLabelIndex = y_test_list[i]
        result = np.argmax(trainedModel.predict(inputMatrix))
        if result == correctLabelIndex:
            correctCount = correctCount + 1
        if i % 1000 == 0:
            print(i)
print('percentage correct: ', float(correctCount)/len(x_test_list))

0
1000
2000
percentage correct:  0.6001514577811435


## should see what happens if we train it for another couple to epochs