What this notebook does:

The first time Yelp reviews were scraped ('/GettingYelpData/GettingYelpReviews_Part1.ipynb'), a large number of shops had 0 reviews obtained. In this notebook, I identify these shops, inorder to reattempt review scraping. The identified shops are saved in './GettingData/noreviewshops.csv'.  An improved Yelp scraping procedure was used to get reviews for these shops.

In [None]:
#Step one directory up to access the yelp scraping function in the helper_functions module
import os
print(os.getcwd())
os.chdir('../')
os.getcwd()

In [36]:
import pandas as pd

In [37]:
reviews = pd.read_csv('./ProcessedData/allreviews_txtprocessed.csv')
candidateshops = pd.read_csv('./GettingData/mhcoffeeshops_6_4_20_noduplicates.csv')

In [38]:
#getting the coffee shops for which there were no reviews
reviewcountdf = reviews.groupby(['alias']).count()
print(reviewcountdf.shape)
reviewcountdf.head(20)

(1009, 6)


Unnamed: 0_level_0,reviewidx,shopidx,date,rating,reviewtxt,mreviewtxt
alias,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
11th-street-cafe-new-york,40,40,40,40,40,40
12-corners-new-york-4,60,60,60,60,60,60
2beans-new-york,40,40,40,40,40,40
7-eleven-new-york-16,18,18,18,18,18,18
7-eleven-new-york-17,14,14,14,14,14,14
7-eleven-new-york-2,34,34,34,34,34,34
7-eleven-new-york-21,27,27,27,27,27,27
7-eleven-new-york-23,18,18,18,18,18,18
7-eleven-new-york-27,5,5,5,5,5,5
7-eleven-new-york-29,19,19,19,19,19,19


In [39]:
reviewcountdf = reviewcountdf.sort_values(by='rating')
reviewcountdf.head(100)

Unnamed: 0_level_0,reviewidx,shopidx,date,rating,reviewtxt,mreviewtxt
alias,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
on-off-coffee-new-york,1,1,1,1,1,1
golden-carriage-bakery-new-york,1,1,1,1,1,1
heart-of-tea-manhattan,1,1,1,1,1,1
coffee-truck-new-york,1,1,1,1,1,1
birch-coffee-new-york-19,1,1,1,1,1,1
...,...,...,...,...,...,...
tonys-coffee-cart-manhattan,6,6,6,6,6,6
alidoro-new-york-6,6,6,6,6,6,6
starbucks-new-york-526,6,6,6,6,6,6
787-coffee-new-york-4,6,6,6,6,6,6


In [40]:
#All coffee shops in the list have at least 1 review

In [41]:
#Check for candidate shops that don't occur in the list of processed reviews
aliaseswithreviews = reviews.alias.unique()
aliasesincand = candidateshops.alias.unique()

In [42]:
noreviewaliases = list(set(aliasesincand)-set(aliaseswithreviews))

In [43]:
len(noreviewaliases)

522

In [44]:
noreviewaliases

['little-collins-new-york-4',
 'starbucks-new-york-469',
 'starbucks-new-york-120',
 'the-lazy-llama-coffee-bar-new-york-2',
 'bonjour-crepes-and-wine-new-york-2',
 'joe-and-the-juice-new-york-39',
 'orens-daily-roast-new-york-2',
 'chokolat-patisserie-new-york',
 'dunkin-new-york-172',
 'kabisera-new-york-2',
 'dunkin-new-york-88',
 'bangklyn-east-harlem-new-york-2',
 'dr-smood-murray-street-new-york',
 'allegro-coffee-company-new-york',
 'mcdonalds-new-york-111',
 'dunkin-new-york-49',
 'starbucks-new-york-494',
 'joes-coffee-new-york',
 'black-press-coffee-new-york-3',
 'vin-sur-vingt-new-york-5',
 'cafe-aloaf-new-york',
 'coco-moka-cafe-new-york-2',
 'nespresso-boutique-at-bloomingdales-new-york',
 'zuckers-bagels-and-smoked-fish-new-york-7',
 'the-sensuous-bean-new-york',
 'dunkin-new-york-47',
 'proper-food-new-york-3',
 'dunkin-new-york-71',
 'le-moulin-a-cafe-new-york',
 'the-bagel-mill-new-york',
 'dunkin-new-york-12',
 'starbucks-new-york-540',
 'dunkin-new-york-65',
 'starbu

In [45]:
candidateshops = candidateshops[candidateshops.alias.isin(noreviewaliases)]
print(candidateshops[candidateshops.alias == 'mcdonalds-new-york-136'])
print(candidateshops[candidateshops.alias == '7-eleven-new-york-65'])

                         id        name                   alias  is_closed  \
822  VpISwPV93Zw1c4vkoRca7w  McDonald's  mcdonalds-new-york-136      False   

                                            categories  review_count price  \
822  [{'alias': 'burgers', 'title': 'Burgers'}, {'a...            33     $   

     rating  transactions   latitude  longitude  
822     2.0  ['delivery']  40.747569 -73.997088  
                         id      name                 alias  is_closed  \
840  JNoIyXRMVG1NOZbpfgq6WA  7-Eleven  7-eleven-new-york-65      False   

                                            categories  review_count price  \
840  [{'alias': 'convenience', 'title': 'Convenienc...             8     $   

     rating  transactions   latitude  longitude  
840     3.0  ['delivery']  40.744691 -73.997424  


In [46]:
#Get rid of 7-11, and mcdonalds

In [47]:
candidateshops = candidateshops[candidateshops.name!='McDonald\'s']
print(candidateshops.shape)

(503, 11)


In [48]:
candidateshops = candidateshops[candidateshops.name!='7-Eleven']
print(candidateshops.shape)

(483, 11)


In [49]:
candidateshops.to_csv('./GettingData/noreviewshops.csv',index=False)