Mounting my stuff first from google drive, as I used Google Colab for this.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Just a sanity check over here.

In [0]:
!ls /content/drive/My\ Drive/yelp

'Dataset preparation.ipynb'   my_yelp_data.csv
'LSTM yelp TFIDF.ipynb'       yelp_review.csv


In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
import gensim
from sklearn.linear_model import LogisticRegression
from gensim.models.doc2vec import TaggedDocument
from sklearn import utils
import numpy as np

The CSV for the yelp data has been slightly modified. The link to the CSV can be found in the readme of the GIT

In [0]:
df = pd.read_csv("/content/drive/My Drive/yelp/my_yelp_data.csv")

What we want to do here is to do a time series prediction based on the reviews of customers. 
<br>
<br>
Let's consider that we have some hotels with unique hotel ids. Some of the hotels have been closed and some of the hotels are still open. We have some reviews for each of the hotels ids from customers over a date. In this dataset, we have one review per date for each hotel id and we have 10 such reviews. We also have stars given for each of the unique hotels. Consider that stars are not the rating of the review that is posted, but the rating of the hotel overall.
<br>
<br>
There can also be a different format for this. For instance, multiple reviews can be posted on the same date. We just need to keep in mind that we need sorted reviews by date-time in order to use this approach. You can always sort the reviews according to a column name in pandas.
<br>
<br>
This is how the Dataframe looks like.

In [0]:
df.head()

Unnamed: 0,date,hotel_id,stars,text
0,2016-05-28,1,5,Super simple place but amazing nonetheless. It...
1,2016-05-29,1,5,Small unassuming place that changes their menu...
2,2016-05-30,1,5,Lester's is located in a beautiful neighborhoo...
3,2016-05-31,1,5,Love coming here. Yes the place always needs t...
4,2016-06-01,1,5,Had their chocolate almond croissant and it wa...


We want to predict if the business will be closed or not base on the reviews. A metric for that would be based on the stars. Let's say that the hotels with a rating less than 3 are shut down and those with ratings more than 3 are open. We'll add another column in the dataframe which will map this condition from the stars.
<br>
<br>
I might use the words 'notes' and 'reviews' in this notebook. Please consider that both of them are same and they are referring to the column "text" in the dataframe. Don't get confused.

In [0]:
df['business_closed'] = df['stars'].map(lambda x : 1 if x < 3 else 0)

This is how the DataFrame looks now.

In [0]:
df.head()

Unnamed: 0,date,hotel_id,stars,text,business_closed
0,2016-05-28,1,5,Super simple place but amazing nonetheless. It...,0
1,2016-05-29,1,5,Small unassuming place that changes their menu...,0
2,2016-05-30,1,5,Lester's is located in a beautiful neighborhoo...,0
3,2016-05-31,1,5,Love coming here. Yes the place always needs t...,0
4,2016-06-01,1,5,Had their chocolate almond croissant and it wa...,0


Just cleaning the text over here. Removing stop words, removing unwanted characters like . ; : and so on. Also tokenizing the sentence.

In [0]:
def cleanText(text):
    from bs4 import BeautifulSoup
    import re
    import string
    text = BeautifulSoup(text, "lxml").text
    text = text.replace('\n','')
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\|\|\|', r' ', text) 
    text = re.sub(r'http\S+', r'<URL>', text)
    text = text.lower()
    text = text.replace('x', '')
    return text

In [0]:
full_cleaned = [cleanText(i) for i in df['text']]

This is how the cleaned text looks like.

In [0]:
df['text'][0]

"Super simple place but amazing nonetheless. It's been around since the 30's and they still serve the same thing they started with: a bologna and salami sandwich with mustard. \n\nStaff was very helpful and friendly."

In [0]:
full_cleaned[0]

'super simple place but amazing nonetheless its been around since the 30s and they still serve the same thing they started with a bologna and salami sandwich with mustard staff was very helpful and friendly'

Converting the clean text with Count Vectorizer and transformer. This is how the text is converted to TFIDF.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_features = 100)#We are extracting only 100 features for our text.
#You can change this number according to your own dataset.
X_train_counts = count_vect.fit_transform(full_cleaned)

These are the standard steps for getting TFIDF features of a text. For more info regarding this check [this](https://stackoverflow.com/questions/36800654/how-is-the-tfidfvectorizer-in-scikit-learn-supposed-to-work). 
<br>
As we have used number of features as 100 in the previous step, our data will have dimensions as 1000,100 as the number of reviews is 1000 and their 100 dimensions, which is shown as the output of this cell.

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(10000, 100)

Converting the list to an array, for use in Keras.

In [0]:
X_train_tfidf = X_train_tfidf.toarray()

Let's say you have around 100 notes per hotel which are arranged in time series. You want to consider just the latest 10 or 20 notes before the hotel closed. You would then take the last 10 or 20 notes in the dataframe for each hotel id. The variable 
```
notes_per_hotel_taken
```
considers how many notes you want to consider and takes the latest notes of that number for each hotel.



In [0]:
notes_per_hotel_taken = 8

In our example, we have 10 notes per hotel. Let's say we wish to look at only the latest 8 notes. So, we are now using the latest 8 notes for each hotel id.

In [0]:
unique_hotelids = df['hotel_id'].unique()
notes_per_hotel_id = {} #This will store the lates 8 notes for each hotel_id.
notes_per_hotel_id_full = {} #This will store all the notes for each hotel_id, irrespective of the 
#notes that we've decided to take before.

for i in unique_hotelids:
  temp = df[df['hotel_id'] == i].tail(notes_per_hotel_taken)#Taking the last 8 i.e latest 8 notes of the hotel.
  temp1 = df[df['hotel_id'] == i]
  
  temp_list = [X_train_tfidf[j] for j in temp.index]#appending the latest 8 notes in a list
  temp1_list = [X_train_tfidf[j] for j in temp1.index]#appending all the notes of the hotel id in the list.
  
  notes_per_hotel_id[i] = temp_list #appending just the 8 notes per hotel id
  notes_per_hotel_id_full[i] = temp1_list #appending all the notes
  """
  now dictionary will be like this 
  notes_per_hotel_id = {2 : ['first note', 'second note'.. upto 8 notes]}
  for each hotel_id
  This will directly contain the vectors of each note and not the text of the notes
  """

There can also be some hotels who do not 8 latest review. They may just have 3 or 4 latest review for a hotel (in our dataset, there is no such case). But just in case there is, then we have padded 0.
<br>
<br>
Consider that for hotel_id = 4, there are just 4 reviews present. But earlier we have determined that notes_per_hotel_taken is 8. That is we need 8 notes per hotel. So in order to make our input uniform for the LSTM, we add zero padding. The zero padding should match the dimension of our rest of the reviews, remember that we have used 100 as dimension for TFIDF.
<br>
So now, for hotel_id = 4, our list of notes would be [0, 0, 0, 0, note1,  note2,  note3,  note 4].
<br>
In this way, we make our input uniform.

In [0]:
import numpy as np

for i in unique_hotelids:
  if len(notes_per_hotel_id[i]) < notes_per_hotel_taken:#checking if length of reviews for 
    #a hotel id is less than the length that we determined before.
    number_of_padded_lists = notes_per_hotel_taken - len(notes_per_hotel_id[i])
    #How many padded zeros will be required for that hotel_id
    t = [np.zeros(100, dtype = float) for i in range(number_of_padded_lists)]
    #Getting padded lists, with 100 as dimension for our TFIDF is 100, Change this according to your dimension of TFIDF.
    notes_per_hotel_id[i] = np.array(t + notes_per_hotel_id[i])
    #Appending and making the input uniform.

Setting parameters for the model.

In [0]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
import numpy as np

data_dim = 100
number_of_notes_per_hotel = notes_per_hotel_taken
n_notes_train = len(notes_per_hotel_id)
#tunable parameter
batch_size = 50
epochs = 5

Getting the features and the labels for our model.

In [0]:
x_train_lstm = np.array([notes_per_hotel_id[i] for i in notes_per_hotel_id])
y_lstm = []
for i in unique_hotelids:
  temp = df[df['hotel_id'] == i]['business_closed']
  temp.index = range(len(temp))
  y_lstm.append(temp[0])
  
y_lstm = np.array(y_lstm)

In [0]:
print("Length of train set is",len(x_train_lstm))
print("Length of label set is",len(y_lstm))

Length of train set is 1000
Length of label set is 1000


Model architecture

In [0]:
model = Sequential()
model.add(LSTM(100, input_shape=(None, data_dim),return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(200))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])

Setting 200 data points for validation.

In [0]:
model.fit(x_train_lstm[0:800],y_lstm[0:800], validation_data = (x_train_lstm[800:],y_lstm[800:]), batch_size=batch_size, epochs=epochs)

Train on 800 samples, validate on 200 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f8738f8b0b8>

For predicting on a datapoint, we would again want to consider the lates 8 notes for each hotels.

In [0]:
all_notevectors = [notes_per_hotel_id_full[i] for i in notes_per_hotel_id_full]
all_notevectors = [j for i in all_notevectors for j in i]

Running the prediction again of the reviews that were used for training the model, (for an example)

In [0]:
prediction_array = []
k = 0

for i in unique_hotelids:
  print("Now processing ", k, "subject id out of ", len(unique_hotelids))  
  t = df[df['hotel_id'] == i]
  if len(t) == 1:
    idx = t.index[0]
    preds = model.predict_proba([[[all_notevectors[idx]]]])
    prediction_array.append(preds)
  else:
    first_idx = t.index[0]
    for j in t.index[1:]:
      preds = model.predict_proba([[all_notevectors[first_idx : j]]])
      prediction_array.append(preds)
    last_idx = t.index[-1] + 1
    preds = model.predict_proba([[all_notevectors[first_idx : last_idx]]])
    prediction_array.append(preds)
    
  k += 1

Now processing  0 subject id out of  1000
Now processing  1 subject id out of  1000
Now processing  2 subject id out of  1000
Now processing  3 subject id out of  1000
Now processing  4 subject id out of  1000
Now processing  5 subject id out of  1000
Now processing  6 subject id out of  1000
Now processing  7 subject id out of  1000
Now processing  8 subject id out of  1000
Now processing  9 subject id out of  1000
Now processing  10 subject id out of  1000
Now processing  11 subject id out of  1000
Now processing  12 subject id out of  1000
Now processing  13 subject id out of  1000
Now processing  14 subject id out of  1000
Now processing  15 subject id out of  1000
Now processing  16 subject id out of  1000
Now processing  17 subject id out of  1000
Now processing  18 subject id out of  1000
Now processing  19 subject id out of  1000
Now processing  20 subject id out of  1000
Now processing  21 subject id out of  1000
Now processing  22 subject id out of  1000
Now processing  23 su

All of your predictions will be stored in 
```
prediction_array
```
