## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it &mdash; it's largely about food, after all!

**Note:** If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:
1. Please visit the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/)
1. Click "Get the Data"
1. Please review, agree to, and respect Yelp's terms of use!
1. The dataset downloads as a compressed .tgz file; uncompress it
1. Place the uncompressed dataset files (*yelp_academic_dataset_business.json*, etc.) in a directory named *yelp_dataset_challenge_academic_dataset*
1. Place the *yelp_dataset_challenge_academic_dataset* within the *data* directory in the *Modern NLP in Python* project folder

That's it! You're ready to go.

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. 

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
# !conda install -y tensorflow
# !conda install -y keras

In [3]:
import os
import re
import json
from sklearn.model_selection import train_test_split
import codecs
import tensorflow as tf

SEED = 42
def reset_graph(seed=SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
reset_graph()

The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews.

In [4]:
json_dir = os.path.join('..', 'data',
                              'yelp_dataset_challenge_academic_dataset', 'dataset')

json_review_filepath = os.path.join(json_dir,
                                    'review.json')

with open(json_review_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)

{"review_id":"VfBHSwC5Vz_pbFluy07i9Q","user_id":"cjpdDjZyprfyDG3RlkVG3w","business_id":"uYHaNptLzDLoV_JZ_MuzUA","stars":5,"date":"2016-07-12","text":"My girlfriend and I stayed here for 3 nights and loved it. The location of this hotel and very decent price makes this an amazing deal. When you walk out the front door Scott Monument and Princes street are right in front of you, Edinburgh Castle and the Royal Mile is a 2 minute walk via a close right around the corner, and there are so many hidden gems nearby including Calton Hill and the newly opened Arches that made this location incredible.\n\nThe hotel itself was also very nice with a reasonably priced bar, very considerate staff, and small but comfortable rooms with excellent bathrooms and showers. Only two minor complaints are no telephones in room for room service (not a huge deal for us) and no AC in the room, but they have huge windows which can be fully opened. The staff were incredible though, letting us borrow umbrellas for t

A few attributes of note on the review records:
- __text__ &mdash; _the natural language text the user wrote_
- __stars__ &mdash; _the number of stars the reviewer left_

The _text_ and the _stars_ attribute will be our focus today!

In [5]:
sentiment_data_dir = os.path.join('..','data','sentiment_data')

text_filepath = os.path.join(sentiment_data_dir,'sentiment.txt')
sentiment_filepath = os.path.join(sentiment_data_dir,'number_of_stars.txt')



In [6]:
%%time
# Make the if statement True
# if you want to execute data prep yourself once you've got the yelp dataset saved.

if False:
    
    review_count = 0

    # create & open a new files in write mode
    with open(text_filepath, 'w', encoding='utf_8') as review_txt_file:
        with open(sentiment_filepath, 'w', encoding='utf_8') as review_sentiment_file:

            # open the existing review json file
            with open(json_review_filepath, encoding='utf_8') as review_json_file:
                # loop through all reviews in the existing file and convert to dict
                for review_json in review_json_file:
                    review = json.loads(review_json)
                    # write the review as a line in the new file
                    # escape newline characters in the original review text
                    review_txt_file.write(review.get('text','NA').replace('\n', r'\n') + '\n')
                    review_sentiment_file.write(str(review.get('stars','NA')) +'\n')
                    review_count =  review_count + 1

    print ('Text from {} reviews written to the new txt file.'.format(review_count))
    
else:
    
    with open(text_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print('Text from {} reviews in the txt file.'.format(review_count + 1))



Text from 4739863 reviews in the txt file.
CPU times: user 8.78 s, sys: 1.44 s, total: 10.2 s
Wall time: 11 s


In [7]:
# #count the lines in the above files

# from itertools import (takewhile,repeat)

# def rawincount(filename):
#     with open(filename, 'rb') as f:
#         bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
#         return sum( buf.count(b'\n') for buf in bufgen )

# print('Len of review text file:{}\nLen of review sentiment file:{}'.format(rawincount(review_txt_filepath), rawincount(review_sentiment_filepath)))

In [8]:
# Good!  The lengths of the files match!

In [9]:
#lets do train-test split


In [10]:
sentiment_train_dir = os.path.join(sentiment_data_dir, 'train')
sentiment_test_dir = os.path.join(sentiment_data_dir, 'test')

X_train_file_path = os.path.join(sentiment_train_dir, 'X_train.txt')
y_train_file_path = os.path.join(sentiment_train_dir, 'y_train.txt')
X_test_file_path = os.path.join(sentiment_test_dir, 'X_test.txt')
y_test_file_path = os.path.join(sentiment_test_dir, 'y_test.txt')

In [11]:
def find_sentiment(line): 
    ''' convert sentiment text (1-5 star rating) into positive or negative review'''
    try:
        if int(line.rstrip()) >= 3: #three stars or higher is positive review
            return 1
        else:
            return 0
    except:
        pass
    return "NA"

def simple_clean(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    #text = re.sub(r'[^a-zA-Z!.?\,:;]', ' ', text)
    # shorten any extra dead space created above
    text = re.sub(r' {2,}',' ', text)
    return text

#Build subset of X and y in memory
num_rows = 300
with open(sentiment_filepath, 'r') as f:
    y = [find_sentiment(line) for num,line in enumerate(f) if num < num_rows ]

with open(text_filepath, 'r') as f:
    X = [simple_clean(line.rstrip()) for num,line in enumerate(f) if num < num_rows]

In [12]:
X[7]

'This is a fairly new property I think It is a German company and has most of the amenities you would want It is priced on the budget minded side so it won t break your bank nLocation is really good Near the Royal Mile and Waverley station without being too noisy Very easy to walk to everything we wanted to do Has WiFi but we did have to re log in every day '

In [13]:
from collections import Counter

In [14]:
#check distribution of positive and negative sentiments

count = Counter(y)
count

Counter({0: 41, 1: 259})

In [15]:
def median(lst):
    quotient, remainder = divmod(len(lst), 2)
    if remainder:
        return sorted(lst)[quotient]
    return sum(sorted(lst)[quotient - 1:quotient + 1]) / 2

In [16]:
#check distribution of lengths of reviews
len_x = [len(item) for item in X]
max_length_count = max(len_x)
median_length_count = median(len_x)

print(median_length_count, max_length_count)

422.5 3827


In [17]:
print(X[0])

My girlfriend and I stayed here for nights and loved it The location of this hotel and very decent price makes this an amazing deal When you walk out the front door Scott Monument and Princes street are right in front of you Edinburgh Castle and the Royal Mile is a minute walk via a close right around the corner and there are so many hidden gems nearby including Calton Hill and the newly opened Arches that made this location incredible n nThe hotel itself was also very nice with a reasonably priced bar very considerate staff and small but comfortable rooms with excellent bathrooms and showers Only two minor complaints are no telephones in room for room service not a huge deal for us and no AC in the room but they have huge windows which can be fully opened The staff were incredible though letting us borrow umbrellas for the rain giving us maps and directions and also when we had lost our only UK adapter for charging our phones gave us a very fancy one for free n nI would highly recomme

In [18]:
#!conda update -y dask #if pandas version is above approx. 0.19 Keras throws exception without dask update.
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers import Dense, Embedding, Input, LSTM, Layer
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [19]:
BATCH_SIZE = 10

In [20]:
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2


# # first, build index mapping words in the embeddings set
# # to their embedding vector

# print('Indexing word vectors.')

glove_dir = os.path.join('..','data', 'glove_data') #location of the word embeddings
embeddings_index = {}
with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs


print('Found %s word vectors.' % len(embeddings_index))

#next, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))



Found 400000 word vectors.
Found 4559 unique tokens.


In [21]:
from keras.utils import to_categorical
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(y))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

## NEED TO TRANSFORM THE INPUT INTO SEQUENCES
if True:
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=SEED)
else:
    pass

print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words+1, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix[1:]],
                            input_length=MAX_SEQUENCE_LENGTH,
                            mask_zero=True,
                            trainable=False)


Shape of data tensor: (300, 1000)
Shape of label tensor: (300, 2)
Preparing embedding matrix.


AttributeError: 'list' object has no attribute 'shape'

In [24]:
print('Building model...')

# rewrite model according to https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
# see also https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/


sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='seq input')
l = embedding_layer(sequence_input)
l = LSTM(128, dropout=0.2, recurrent_dropout=0.2)(l)
output = Dense(1, activation='sigmoid')(l)

model = Model(sequence_input, output)
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=5,
          validation_data=(X_test, y_test))

score, acc = model.evaluate(X_test, y_test,
                            batch_size=BATCH_SIZE)
print('Test score:', score)
print('Test accuracy:', acc)

Building model...
Train...


ValueError: Error when checking input: expected input_1 to have shape (None, 1000) but got array with shape (240, 1)

In [None]:

# MAX_NB_WORDS = 20000
# MAX_SEQUENCE_LENGTH = 100

# maxlen = 80  # cut texts after this number of words (among top max_features most common words)
# batch_size = 32
# finally, vectorize the text samples into a 2D integer tensor
# tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
# tokenizer.fit_on_texts(X_train)
# sequences = tokenizer.texts_to_sequences(X_train)

# word_index = tokenizer.word_index
# print('Found %s unique tokens.' % len(word_index))

# data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)


# print('Pad sequences (samples x time)')
# X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)
# X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)
# print('x_train shape:', X_train.shape)
# print('x_test shape:', X_test.shape)

# print('Build model...')
# model = Sequential()
# model.add(Embedding(MAX_SEQUENCE_LENGTH, 128))
# model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
# model.add(Dense(1, activation='sigmoid'))

# # try using different optimizers and different optimizer configs
# model.compile(loss='binary_crossentropy',
#               optimizer='adam',
#               metrics=['accuracy'])

# print('Train...')
# model.fit(X_train, y_train,
#           batch_size=batch_size,
#           epochs=5,
#           validation_data=(X_test, y_test))
# score, acc = model.evaluate(X_test, y_test,
#                             batch_size=batch_size)
# print('Test score:', score)
# print('Test accuracy:', acc)