## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it &mdash; it's largely about food, after all!

**Note:** If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:
1. Please visit the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/)
1. Click "Get the Data"
1. Please review, agree to, and respect Yelp's terms of use!
1. The dataset downloads as a compressed .tgz file; uncompress it
1. Place the uncompressed dataset files (*yelp_academic_dataset_business.json*, etc.) in a directory named *yelp_dataset_challenge_academic_dataset*
1. Place the *yelp_dataset_challenge_academic_dataset* within the *data* directory in the *Modern NLP in Python* project folder

That's it! You're ready to go.

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. 

In [1]:
%pylab inline
# !conda install -y tensorflow
# !conda install -y keras

Populating the interactive namespace from numpy and matplotlib


In [2]:
import os
import re
import json
from sklearn.model_selection import train_test_split
import tensorflow as tf

from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential, model_from_json
from keras.layers import Dense, Embedding, Input, LSTM
from keras.callbacks import EarlyStopping
from keras.preprocessing.text import Tokenizer
#!conda update -y dask #if pandas version is above approx. 0.19 Keras throws exception without dask update.

def set_seed(seed=42):
    tf.set_random_seed(seed)
    np.random.seed(seed)
set_seed()

Using TensorFlow backend.


The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews.

A few attributes of note on the review records:
- __text__ &mdash; _the natural language text the user wrote_
- __stars__ &mdash; _the number of stars the reviewer left_

The _text_ and the _stars_ attribute will be our focus today!

In [3]:
json_dir = os.path.join('..', 'data',
                              'yelp_dataset_challenge_academic_dataset', 'dataset')

json_review_filepath = os.path.join(json_dir,
                                    'review.json')

with open(json_review_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)

sentiment_data_dir = os.path.join('..','data','sentiment_data')

text_filepath = os.path.join(sentiment_data_dir,'sentiment.txt')
sentiment_filepath = os.path.join(sentiment_data_dir,'number_of_stars.txt')

sentiment_train_dir = os.path.join(sentiment_data_dir, 'train')
sentiment_test_dir = os.path.join(sentiment_data_dir, 'test')

X_train_file_path = os.path.join(sentiment_train_dir, 'X_train.txt')
y_train_file_path = os.path.join(sentiment_train_dir, 'y_train.txt')
X_test_file_path = os.path.join(sentiment_test_dir, 'X_test.txt')
y_test_file_path = os.path.join(sentiment_test_dir, 'y_test.txt')





{"review_id":"VfBHSwC5Vz_pbFluy07i9Q","user_id":"cjpdDjZyprfyDG3RlkVG3w","business_id":"uYHaNptLzDLoV_JZ_MuzUA","stars":5,"date":"2016-07-12","text":"My girlfriend and I stayed here for 3 nights and loved it. The location of this hotel and very decent price makes this an amazing deal. When you walk out the front door Scott Monument and Princes street are right in front of you, Edinburgh Castle and the Royal Mile is a 2 minute walk via a close right around the corner, and there are so many hidden gems nearby including Calton Hill and the newly opened Arches that made this location incredible.\n\nThe hotel itself was also very nice with a reasonably priced bar, very considerate staff, and small but comfortable rooms with excellent bathrooms and showers. Only two minor complaints are no telephones in room for room service (not a huge deal for us) and no AC in the room, but they have huge windows which can be fully opened. The staff were incredible though, letting us borrow umbrellas for t

In [4]:
%%time
# Make the if statement True
# if you want to execute data prep yourself once you've got the yelp dataset saved.

data_dir = os.path.join('..','data')
stopword_en_filepath = os.path.join(data_dir, 'stopwords-en.txt')


if False:
    #load stopwords
    with open(stopword_en_filepath) as f:
        stopwords_en = set(f.read().split('\n'))
    
    review_count = 0

    # create & open a new files in write mode
    with open(text_filepath, 'w', encoding='utf_8') as review_txt_file:
        with open(sentiment_filepath, 'w', encoding='utf_8') as review_sentiment_file:

            # open the existing review json file
            with open(json_review_filepath, encoding='utf_8') as review_json_file:
                # loop through all reviews in the existing file and convert to dict
                for review_json in review_json_file:
                    review = json.loads(review_json)
                    # write the review as a line in the new file
                    # escape newline characters in the original review text
                    
                    #TTD     pull stopwords out before writing file
                    #TTD     make sure review is english, ie does not have german or french or spanish stopwords
                    review_txt_file.write(review.get('text','NA').replace('\n', r'\n') + '\n')
                    review_sentiment_file.write(str(review.get('stars','NA')) +'\n')
                    review_count =  review_count + 1

    print ('Text from {} reviews written to the new txt file.'.format(review_count))
    
else:
    
    #count the lines in the above files

    from itertools import (takewhile,repeat)

    def rawincount(filename):
        with open(filename, 'rb') as f:
            bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
            return sum( buf.count(b'\n') for buf in bufgen )

    print('Len of review text file:{}\nLen of review sentiment file:{}'.format(rawincount(text_filepath), 
                                                                               rawincount(sentiment_filepath)))



Len of review text file:4736897
Len of review sentiment file:4736897
CPU times: user 2.46 s, sys: 1.43 s, total: 3.89 s
Wall time: 4.49 s


In [5]:
MAX_NB_WORDS = 20000
NUM_ROWS = 5000 #number of rows to load into memory



def find_sentiment(line): 
    ''' convert sentiment text (1-5 star rating) into positive or negative review'''
    if int(line.rstrip()) >= 3: #three stars or higher is positive review
        return 1
    else:
        return 0


with open(sentiment_filepath, 'r') as f:
    y = [find_sentiment(line) for rows,line in enumerate(f) if rows < NUM_ROWS ]

with open(text_filepath, 'r') as f:
    X = [line.rstrip() for rows,line in enumerate(f) if rows < NUM_ROWS]

In [6]:
#next, vectorize the text samples into a 2D integer tensor based on words
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X)


X = tokenizer.texts_to_sequences(X)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

Found 20227 unique tokens.


In [7]:
from collections import Counter
count = Counter(y)
print('split of positive and neg sentiment of y:',count)

def median(lst):
    quotient, remainder = divmod(len(lst), 2)
    if remainder:
        return sorted(lst)[quotient]
    return sum(sorted(lst)[quotient - 1:quotient + 1]) / 2
#check distribution of lengths of reviews
len_x = [len(item) for item in X]
max_length_count = max(len_x)
median_length_count = median(len_x)


split of positive and neg sentiment of y: Counter({1: 3929, 0: 1071})


In [8]:
print(X[0])

[13, 1459, 2, 4, 1572, 38, 11, 154, 1332, 2, 346, 8, 1, 178, 7, 18, 549, 2, 35, 371, 201, 384, 18, 59, 159, 479, 52, 17, 372, 37, 1, 426, 624, 10812, 10813, 2, 3665, 524, 27, 164, 10, 426, 7, 17, 1611, 97, 2, 1, 3666, 2347, 9, 3, 123, 831, 372, 2774, 3, 462, 164, 176, 1, 1098, 2, 40, 27, 25, 214, 1356, 4608, 1161, 678, 10814, 2270, 2, 1, 4203, 606, 8046, 14, 134, 18, 178, 915, 12, 78, 549, 574, 6, 72, 35, 83, 16, 3, 1273, 650, 226, 35, 5098, 127, 2, 126, 15, 775, 1384, 16, 275, 2013, 2, 10815, 69, 144, 2775, 1854, 27, 68, 10816, 10, 329, 11, 329, 51, 22, 3, 388, 479, 11, 95, 2, 68, 6661, 10, 1, 329, 15, 19, 23, 388, 2271, 63, 65, 32, 2435, 606, 1, 127, 26, 915, 187, 2776, 95, 8047, 5718, 11, 1, 3236, 739, 95, 6662, 2, 4204, 2, 72, 52, 21, 24, 1306, 56, 69, 5719, 8048, 11, 2777, 56, 3926, 330, 95, 3, 35, 836, 46, 11, 293, 12, 98, 58, 391, 185, 18, 549, 5, 263, 2, 52, 4, 525, 5, 1611, 63, 4, 211, 121, 67, 4, 67, 32, 1540, 38, 308, 160, 4205]


In [9]:
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2


# # first, build index mapping words in the embeddings set
# # to their embedding vector

# print('Indexing word vectors.')

glove_dir = os.path.join('..','data', 'glove_data') #location of the word embeddings
embeddings_index = {}
with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs


print('Found %s word vectors.' % len(embeddings_index))



Found 400000 word vectors.


In [10]:
print('Preparing embedding matrix.')
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words+1, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words+1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            mask_zero=True,
                            trainable=False)

## TRANSFORM Training input to sequences
print('Pad sequences (samples x time)')
X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)
X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)



Preparing embedding matrix.
Pad sequences (samples x time)
X_train shape: (4000, 1000)
X_test shape: (1000, 1000)


In [None]:
model_dir = os.path.join('models', 'yelp_sentiment_analysis_glove_embeddings')
model_json_filepath = os.path.join(model_dir,"model.json" )
model_weights_filepath = os.path.join(model_dir, 'model.h5')

BATCH_SIZE = 10

model = None

def build_model():
    model = Sequential()
    model.add(embedding_layer)
    model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(1, activation='sigmoid'))
    return model

if False:
    try:
        # load model and weights from disk
        with open(model_json_filepath, 'r') as json_file:
            model_json = json_file.read()
            model = model_from_json(model_json)
        # load weights into new model
        model.load_weights(model_weights_filepath)
        print("Loaded model from disk!")
    except Exception as e:
        print('Could not load file from disk', e)

if model is None:
    print('Building model...')
    model = build_model()
    
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Training...')
model.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=1,
          callbacks=[EarlyStopping(patience=3, verbose=1)],
          validation_data=(X_test, y_test)
         )

score, acc = model.evaluate(X_test, y_test,
                            batch_size=BATCH_SIZE)
print('Test score:', score)
print('Test accuracy:', acc)

Building model...
Training...
Train on 4000 samples, validate on 1000 samples
Epoch 1/1
  80/4000 [..............................] - ETA: 1530s - loss: 0.5176 - acc: 0.7875

In [None]:
# serialize model to JSON
model_json = model.to_json()

with open(model_json_filepath, "w") as json_file:
    json_file.write(model_json)

# serialize weights to HDF5
model.save_weights(model_weights_filepath)
print("Saved model to disk")