What this notebook does:

Reviews for coffee shops are preprocessed and lemmatized, 3 sets of reviews are saved - lemmatized nouns, lemmatized nouns + verbs, lemmatized nouns + adjectives.  Preprocessing steps include stopword removal, and bigram identification.

Businesses defined as coffee shops are in './ProcessedData/coffeeshops_withcfcutoff.csv', the output of './ProcessingRawYelpData/EDAandDataCleaning_OnlyCoffeeShops'.

A table of the reviews lemmatized, processed, and raw text is saved in './ProcessedData/lemmatizedreviews.csv'

In [1]:
#Step one directory up to access the yelp scraping function in the helper_functions module
import os
print(os.getcwd())
os.chdir('../')
os.getcwd()

/Users/thomasyoung/Dropbox/TYInsightProject/LDA_Fitting


'/Users/thomasyoung/Dropbox/TYInsightProject'

In [2]:
import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
#from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_lg

from tqdm import tqdm_notebook as tqdm
from pprint import pprint

import re

# NLTK Stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from helper_functions.nlp_helpers import sent_to_words, doc_to_words_split, remove_stopwords, make_bigrams, make_trigrams, lemmatization
#from nlp_helpers import sent_to_words

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thomasyoung/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
shops = pd.read_csv('./ProcessedData/coffeeshops_withcfcutoff.csv')
reviews = pd.read_csv('./ProcessedData/reviews_precovid_txtprocessed.csv')
merged = pd.merge(shops,reviews,how='inner',on = ['alias'])
print(reviews.shape)
print(merged.shape)
print(merged.alias.value_counts().shape)
print(merged.alias.nunique())
print(merged.reviewtxt.nunique())

(46077, 7)
(46077, 30)
(791,)
791
46077


In [4]:
# Use the regular expression library re to replace punctuation with spaces
# Remove punctuation
merged['mreviewtxt_nopunc'] = merged['mreviewtxt'].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
merged['mreviewtxt_nopunc'] = merged['mreviewtxt_nopunc'].map(lambda x: x.lower())
# Print out the first rows of papers
merged['mreviewtxt_nopunc'].head()

0    it was my first time to the little canal  i wa...
1    just moved to the area and although there are ...
2    daytime: cafe nighttime: chillest  coziest bar...
3    i always end up in here after i go to the metr...
4    stopped here sunday  / /  late in the day afte...
Name: mreviewtxt_nopunc, dtype: object

In [5]:


data = merged.mreviewtxt_nopunc.values.tolist()
data_words = list(sent_to_words(data))
print(data_words[:1])

[['it', 'was', 'my', 'first', 'time', 'to', 'the', 'little', 'canal', 'was', 'looking', 'for', 'an', 'iced', 'drink', 'and', 'decided', 'to', 'come', 'in', 'after', 'seeing', 'such', 'positive', 'reviews', 'the', 'interior', 'is', 'small', 'cozy', 'and', 'intimate', 'the', 'person', 'that', 'help', 'me', 'was', 'not', 'he', 'wasn', 'very', 'friendly', 'or', 'personable', 'ordered', 'an', 'iced', 'oat', 'latte', 'the', 'drink', 'was', 'good', 'nothing', 'to', 'rave', 'or', 'go', 'out', 'of', 'my', 'way', 'to', 'if', 'you', 're', 'looking', 'to', 'catch', 'up', 'with', 'friend', 'think', 'this', 'would', 'be', 'great', 'place', 'to', 'go', 'to', 'if', 'there', 'is', 'room']]


Bi-grams and tri-grams

In [6]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Removing stopwords, making bigrams, and lemmatizing

In [7]:
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use','good','great','love','go','always','go','order','get','say','try','nice','need','order','really','also','but','starbuck','dunkin','gregory','pret','bluestone','la colombe',
                  'starbucks','gregorys','store','area','people','location','drink'])

In [8]:

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words,stop_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops,bigram_mod)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_lg", disable=['parser', 'ner'])

# Do lemmatization keeping only nouns
data_lemmatized_nouns = lemmatization(data_words_bigrams, nlp, allowed_postags=['NOUN'])
data_lemmatized_nounsverbs = lemmatization(data_words_bigrams, nlp, allowed_postags=['NOUN','VERB'])
data_lemmatized_nounsadj= lemmatization(data_words_bigrams, nlp, allowed_postags=['NOUN','ADJ'])


In [11]:
print(data_lemmatized_nouns[1001])
print(data_lemmatized_nounsverbs[1001])
print(data_lemmatized_nounsadj[1001])
print(merged.reviewtxt[42808])

['shop', 'cure', 'day', 'service', 'coffee', 'shop', 'barista', 'craft', 'coffee']
['shop', 'cure', 'day', 'service', 'super', 'coffee', 'shop', 'barista', 'know', 'craft', 'show', 'coffee']
['caffeine_fix', 'shop', 'cure', 'tasty', 'perfect', 'warm', 'day', 'service', 'friendly', 'quick', 'small', 'coffee', 'shop', 'barista', 'craft', 'delicious', 'coffee']
I should first say I hope they survive because the neighborhood needs this. That said, they need to tweak things. The seats are really uncomfortable and not functional. Weird, rotating tables that I'm sure looked cool when they bought them, are also not very functional.  Not many plug outlets as well. The message they are sending is, "please stay for 10 minutes and then leave."
Classic rock music? How about some nice jazz? It's only a couple of blocks from one of the best jazz clubs in the city (Smoke) so wouldn't it be nice to hear some Miles or Coltrane while sipping your coffee? Just my two cents.
Coffee is fine and baked goods 

In [12]:
merged = merged[reviews.columns]
merged['review_lem_noun'] = pd.Series([' '.join(i) for i in data_lemmatized_nouns])
merged['review_lem_nounverb'] = pd.Series([' '.join(i) for i in data_lemmatized_nounsverbs])
merged['review_lem_nounadj'] = pd.Series([' '.join(i) for i in data_lemmatized_nounsadj])

### Creating a column of reviews, where sentences are separated by '.'

In [11]:
#Creating a column of reviews, where sentences are separated by '.', but the words in the sentences have been lemmatized
merged['reviewtxt_periodonly'] = merged.reviewtxt.copy()
mtext_field = "reviewtxt_periodonly"
df = merged
#Removing unnecessary punctuation
df[mtext_field] = df[mtext_field].str.replace(r"http", "")
df[mtext_field] = df[mtext_field].str.replace(r"@\S+", "")
df[mtext_field] = df[mtext_field].str.replace(r"&", "and")
df[mtext_field] = df[mtext_field].str.replace(r"#", " ")
df[mtext_field] = df[mtext_field].str.replace(r"@", "at ")
#df[text_field] = df[text_field].str.replace(r"[^A-Za-z0-9(),!?@\"\_\n]", " ")
df[mtext_field] = df[mtext_field].str.replace(r"\d+", " ")
df[mtext_field] = df[mtext_field].str.replace(r"`", "'")
df[mtext_field] = df[mtext_field].str.replace(r",", " ")           
df[mtext_field] = df[mtext_field].str.replace(r"(", " ") 
df[mtext_field] = df[mtext_field].str.replace(r")", " ") 
df[mtext_field] = df[mtext_field].str.replace(r"?", " ") 
df[mtext_field] = df[mtext_field].str.replace(r"@", " ") 
df[mtext_field] = df[mtext_field].str.replace(r"_", " ") 
df[mtext_field] = df[mtext_field].str.replace(r"%", " ") 
df[mtext_field] = df[mtext_field].str.replace(r"$", " ") 




In [35]:
data = merged.reviewtxt_periodonly.values.tolist()
data_words = list(doc_to_words_split(data))

In [43]:

#Form Bigrams
data_words_bigrams = make_bigrams(data_words, bigram_mod)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_lg", disable=['parser', 'ner'])

#Lemmatization
data_lemmatized_withperiod= lemmatization(data_words_bigrams, nlp, allowed_postags=['NOUN','PROPN','VERB','ADJ','PUNCT','ADV','AUX','ADP'])


In [46]:
merged['review_lem_withperiod'] = pd.Series([' '.join(i) for i in data_lemmatized_withperiod])


In [47]:
merged.head(5)

Unnamed: 0,reviewidx,shopidx,alias,date,revrating,reviewtxt,mreviewtxt,review_lem_noun,review_lem_nounverb,review_lem_nounadj,reviewtxt_periodonly,review_lem_withperiod
0,6,0,little-canal-new-york-2,2019-12-21,3.0,It was my first time to the Little Canal. I w...,it was my first time to the little canal. i w...,time canal review person oat latte way catch f...,time canal look ice decide come see review per...,first time little canal positive review interi...,It was my first time to the Little Canal. I w...,be first time to little canal . be look for ic...
1,7,0,little-canal-new-york-2,2019-12-19,5.0,Just moved to the area and although there are ...,just moved to the area and although there are ...,cafe neighborhood spot job staff vibe anxiety ...,move cafe choose stop work feel neighborhood s...,many cafe favorite friendly neighborhood spot ...,Just moved to the area and although there are ...,just move to area be many cafe choose from one...
2,8,0,little-canal-new-york-2,2019-12-14,5.0,"Daytime: cafe. Nighttime: chillest, coziest ba...",daytime: cafe. nighttime: chillest coziest ba...,nighttime bar vibe vibe price bank table wait ...,nighttime bar could want come vibe stay vibe p...,nighttime cozy bar vibe vibe price bank weekni...,Daytime: cafe. Nighttime: chillest coziest ba...,daytime cafe . nighttime chillest cozy bar cou...
3,9,0,little-canal-new-york-2,2019-11-04,4.0,I always end up in here after I go to the Metr...,i always end up in here after i go to the metr...,spritz commissary canal hip atmosphere pretens...,want spend spritz commissary canal chill hip a...,spritz commissary little canal hip atmosphere ...,I always end up in here after I go to the Metr...,always end up in here after go to metrograph d...
4,10,0,little-canal-new-york-2,2019-10-26,5.0,Stopped here Sunday 10/11/19 late in the day a...,stopped here sunday / / late in the day afte...,gallery beer hummus sandwich day sort way service,stop gallery open beer hummus sandwich would e...,gallery nearby fantastic beer hummus sandwich ...,Stopped here Sunday / / late in the day afte...,stop here sunday late in day after gallery ope...


In [48]:

merged.to_csv('./ProcessedData/lemmatizedreviews.csv',index=False)

In [52]:
merged.review_lem_withperiod[100]

'big fan of ninth_street . big . fan . drink be always top_notch expect huge window in front make place . full of natural_light perfect kill time work just enjoy moment with coffee . love have place close by . .'

In [53]:
merged.reviewtxt_periodonly[100]

"Big fan of Ninth Street. Big. Fan. \n\nThe drinks are always top notch like you'd expect  but the huge windows in front make this place. It's full of natural light that's perfect to kill some time working or just enjoying the moment with your coffee. Love having this place close by."