# Using glove with rnn

Using glove 6B token dataset
glove: https://nlp.stanford.edu/projects/glove/

Using gensim to load in glove, pre-converted into word2vec format for gensim

Download dataset from kaggle: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
Extract dataset to dataset folder in the same directory of this notebook 

In [1]:
import gensim
import numpy as np
import tensorflow as tf
import pandas as pd
import string
import nltk
import re

  from ._conv import register_converters as _register_converters


In [2]:
# # convert glove to gensim word2vec format
# glove2word2vec('glove.6B.50d.txt', 'gloveWord2vec.txt')

In [3]:
df = pd.read_csv('./dataset/Womens Clothing E-Commerce Reviews.csv', index_col=0)

In [4]:
df

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits
7,858,39,"Shimmer, surprisingly goes with lots","I ordered this in carbon for store pick up, an...",4,1,4,General Petite,Tops,Knits
8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


In [5]:
df2 = df.copy(deep=True)

In [6]:
df2.iloc[1, 3]

'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [7]:
df2 = df2[['Review Text', 'Rating', 'Recommended IND']]

In [8]:
df2

Unnamed: 0,Review Text,Rating,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,4,1
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,This shirt is very flattering to all due to th...,5,1
5,"I love tracy reese dresses, but this one is no...",2,0
6,I aded this in my basket at hte last mintue to...,5,1
7,"I ordered this in carbon for store pick up, an...",4,1
8,I love this dress. i usually get an xs but it ...,5,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1


In [9]:
df2.dropna(how='any', inplace=True)

In [10]:
df2

Unnamed: 0,Review Text,Rating,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,4,1
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,This shirt is very flattering to all due to th...,5,1
5,"I love tracy reese dresses, but this one is no...",2,0
6,I aded this in my basket at hte last mintue to...,5,1
7,"I ordered this in carbon for store pick up, an...",4,1
8,I love this dress. i usually get an xs but it ...,5,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1


In [11]:
df3 = df2[['Review Text', 'Rating']]

In [12]:
df3

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,4
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
3,"I love, love, love this jumpsuit. it's fun, fl...",5
4,This shirt is very flattering to all due to th...,5
5,"I love tracy reese dresses, but this one is no...",2
6,I aded this in my basket at hte last mintue to...,5
7,"I ordered this in carbon for store pick up, an...",4
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In [13]:
intToStringRatingDict = {1:'vbad', 2:'bad', 3:'meh', 4:'good', 5:'vgood'}

In [14]:
df3['Rating'] = df3['Rating'].apply(lambda x: intToStringRatingDict[x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [15]:
df3

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,good
1,Love this dress! it's sooo pretty. i happene...,vgood
2,I had such high hopes for this dress and reall...,meh
3,"I love, love, love this jumpsuit. it's fun, fl...",vgood
4,This shirt is very flattering to all due to th...,vgood
5,"I love tracy reese dresses, but this one is no...",bad
6,I aded this in my basket at hte last mintue to...,vgood
7,"I ordered this in carbon for store pick up, an...",good
8,I love this dress. i usually get an xs but it ...,vgood
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",vgood


### preprocessing the sentences

In [16]:
# taking one as a sample sentence
sampleReview = df3.iloc[1, 0]

In [17]:
sampleReview

'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [18]:
def preprocess(sentence):
    outputSentence = sentence.lower()
    outputSentence = replaceContractions(outputSentence)
    outputSentence = removePunc(outputSentence)
    outputSentence = removeNumbers(outputSentence)
    
    return outputSentence

In [19]:
def replaceContractions(sentence):
    outputSentence = sentence
    outputSentence = outputSentence.replace("won't", "will not")
    outputSentence = outputSentence.replace("can\'t", "can not")
    outputSentence = outputSentence.replace("n\'t", " not")
    outputSentence = outputSentence.replace("\'re", " are")
    outputSentence = outputSentence.replace("\'s", " is")
    outputSentence = outputSentence.replace("\'d", " would")
    outputSentence = outputSentence.replace("\'ll", " will")
    outputSentence = outputSentence.replace("\'t", " not")
    outputSentence = outputSentence.replace("\'ve", " have")
    outputSentence = outputSentence.replace("\'m", " am")
    return outputSentence

In [20]:
def removePunc(sentence):
    removePuncTrans = str.maketrans("", "", string.punctuation)
    outputSentence = sentence.translate(removePuncTrans)
    return outputSentence

In [21]:
def removeNumbers(sentence):
    outputSentence = sentence
    removeDigitsTrans = str.maketrans('', '', string.digits)
    outputSentence = outputSentence.translate(removeDigitsTrans)
    return outputSentence

In [22]:
replaceContractions(sampleReview)

'Love this dress!  it is sooo pretty.  i happened to find it in a store, and i am glad i did bc i never would have ordered it online bc it is petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [23]:
sampleReview

'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.'

In [24]:
preprocess(sampleReview)

'love this dress  it is sooo pretty  i happened to find it in a store and i am glad i did bc i never would have ordered it online bc it is petite  i bought a petite and am   i love the length on me hits just a little below the knee  would definitely be a true midi on someone who is truly petite'

In [25]:
df3

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,good
1,Love this dress! it's sooo pretty. i happene...,vgood
2,I had such high hopes for this dress and reall...,meh
3,"I love, love, love this jumpsuit. it's fun, fl...",vgood
4,This shirt is very flattering to all due to th...,vgood
5,"I love tracy reese dresses, but this one is no...",bad
6,I aded this in my basket at hte last mintue to...,vgood
7,"I ordered this in carbon for store pick up, an...",good
8,I love this dress. i usually get an xs but it ...,vgood
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",vgood


In [26]:
df3['Review Text'] = df3['Review Text'].apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [27]:
df3

Unnamed: 0,Review Text,Rating
0,absolutely wonderful silky and sexy and comfo...,good
1,love this dress it is sooo pretty i happened...,vgood
2,i had such high hopes for this dress and reall...,meh
3,i love love love this jumpsuit it is fun flirt...,vgood
4,this shirt is very flattering to all due to th...,vgood
5,i love tracy reese dresses but this one is not...,bad
6,i aded this in my basket at hte last mintue to...,vgood
7,i ordered this in carbon for store pick up and...,good
8,i love this dress i usually get an xs but it r...,vgood
9,i am and lbs i ordered the s petite to make ...,vgood


## word2vec preprocessing

Now to load the glove vectors into gensim as word2vec format, pre-converted earlier.

Also need a function to take in a sentence and return a stack of vectors of words in sentence, in order.

In [28]:
gloveModel = gensim.models.KeyedVectors.load_word2vec_format('gloveWord2vec.txt')

In [32]:
# try retrieving glove vector of word, using 50 dim version
gloveModel.get_vector('test')

array([ 0.13175 , -0.25517 , -0.067915,  0.26193 , -0.26155 ,  0.23569 ,
        0.13077 , -0.011801,  1.7659  ,  0.20781 ,  0.26198 , -0.16428 ,
       -0.84642 ,  0.020094,  0.070176,  0.39778 ,  0.15278 , -0.20213 ,
       -1.6184  , -0.54327 , -0.17856 ,  0.53894 ,  0.49868 , -0.10171 ,
        0.66265 , -1.7051  ,  0.057193, -0.32405 , -0.66835 ,  0.26654 ,
        2.842   ,  0.26844 , -0.59537 , -0.5004  ,  1.5199  ,  0.039641,
        1.6659  ,  0.99758 , -0.5597  , -0.70493 , -0.0309  , -0.28302 ,
       -0.13564 ,  0.6429  ,  0.41491 ,  1.2362  ,  0.76587 ,  0.97798 ,
        0.58507 , -0.30176 ], dtype=float32)

In [44]:
len(gloveModel.vocab)

400000

In [53]:
def getSentenceWordStack(sentence, gensimModel):
    sentenceWordList = sentence.split()
    vectorList = []
    for word in sentenceWordList:
        # to handle the case whereby the word in sentence is not in the vocab of the model.
        if word in gensimModel.vocab:
            vectorList.append(gensimModel.get_vector(word))
        else:
            vectorList.append(np.zeros(50)) # because we are using 50 dim vector, so 50 dim zero vector
    stackedVector = np.array(vectorList)
    return stackedVector