# Summary Tags
We've assigned tags to 295 restaurants centered at Reno (N1 to N3), Now try to explore and see methods to generate a collection of summary tags for those restaurants. Because these tags will be generated weight preferences and weights, so we call it the unweighted summary tags.


In this exploration, try to find a "transformer" to transfrom a (collection of restaurants) into a (collection of tags) for people to vote.

In [1]:
import pickle
import pandas as pd
import gensim.downloader as api
import re

In [2]:
# load the shortlist restaurants from Reno as our inital pool of restaurant
with open('df_restaurants_Reno_shortlist,pkl','rb') as f:
    df_Reno_restaurants_shortlist = pickle.load(f)
bids = list(df_Reno_restaurants_shortlist['business_id'])
len(bids)

295

### 1. Process the summary tags, and add them to the pandas dataframe

### 1.1 parse total summary paragraph into lines for each quesiton

In [3]:
# get a parser
def parse_summary_perbid(bid, pattern):
    fname = './restaurant_summaries/res_'+bid+'.pkl'
    with open(fname, 'rb') as f:
        s = pickle.load(f)  
    parsed_s = re.findall(pattern, s['summary'])
    return parsed_s

In [4]:
%%time
# check the format of the summary texts, see if it is always parsed into 10 lines that corresponds to 10 questions.
difficult_s = []
pattern = re.compile(r'\d+\.[^\n]*')
for bid in bids:
    parsed_s = parse_summary_perbid(bid, pattern)
    if len(parsed_s) == 10:
        pass
    else:
        difficult_bids.append(bid)
        print('\n')

CPU times: user 12.1 ms, sys: 24.6 ms, total: 36.7 ms
Wall time: 108 ms


In [5]:
parsed_s = parse_summary_perbid(bids[0], pattern)
parsed_s

['1. [Brewed beer on site], [Variety of food options]',
 '2. [not mentioned]',
 '3. [yes]',
 '4. [yes]',
 '5. [yes]',
 '6. [Brewery], [Pizza], [Onion Rings], [Wings], [Burgers], [BBQ], [Salads], [Sandwiches], [Meatloaf], [French Dip]',
 '7. [Average], [Roasted], [Balsamic], [Tender], [Juicy], [Dry], [Crispy], [Lean], [Subtle Kick], [Seasoned]',
 "8. [Appetizer of roasted brussels sprouts], [Meatloaf], [French dip with a side of cole slaw], [Bison Burger], [Hot Pastrami Pretzel Roll], [French fries], [Buffalo wings], [Pizza], [Fish 'N Chips], [Taco Salad]",
 '9. [Mac & Cheese], [Fish had no flavor]',
 '10. [not mentioned]']

#### 2. for each question, parse out the tags.
here are the questions used in the prompt get the summary:
1. tell me 2 things about this restaurant that people like about it the most.
2. good for breakfast/brunch? answer [yes] or [not mentioend]
3. good for lunch? answer [yes] or [not mentioend]
4. good for dinner? answer [yes] or [not mentioend]
5. Based on people's reviews, is it good for group of friends? answer [yes] or [not mentioend]
6. use 10 words to describe what categories of foods are offered at the restaurant that is not characterized by the previouse questions.
7. use 10 words to describe the taste categories of the foods offered att he restaurant.
8. give top 10 recommended foods.
9. top 2 not recommended foods.
10. use 3 words to describe the accomoations for food resstrictions. If not clearly indicated, say [not mentioned].

In [6]:
question_names =['good aspects (2)', 
                 'good for breakfast or brunch', 
                 'good for lunch',
                 'good for dinner',
                 'good for group',
                 '10 tags for food categories',
                 '10 tags for taste categories',
                 '10 recommended items',
                 '2 not recommended items',
                 'accomodations']
tags_pattern = re.compile(r'\[(.*?)\]')
def parse_tags_for_all_questions(parsed_s, question_names, tags_pattern):
    tags_dict = {}
    for i in range(10):
        s = parsed_s[i]
        o = re.findall(tags_pattern, s)
        tags_dict[question_names[i]] = o
    return tags_dict

In [8]:
%%time
tags_dict_all = []
for ind in range(len(bids)):
    bid = bids[ind]
    parsed_s = parse_summary_perbid(bid, pattern)
    tags_dict = parse_tags_for_all_questions(parsed_s, question_names, tags_pattern)
    tags_dict['business_id'] = bid
    tags_dict_all.append(tags_dict)
# seem to work pretty well!

CPU times: user 20.6 ms, sys: 17.6 ms, total: 38.2 ms
Wall time: 91.4 ms


In [9]:
# merge the dictionaries of tags to the dataframe for the filtere restaurant list
tags_df = pd.DataFrame(tags_dict_all)
merged_df = pd.merge(df_Reno_restaurants_shortlist, tags_df, on='business_id')
# save the shortlist restaurants from Reno with tags
with open('df_restaurants_Reno_shortlist_wtags.pkl','wb') as f:
    pickle.dump(merged_df, f)

In [10]:
merged_df.shape

(295, 28)

### 1.2. add the categories tags for each restaurant to the dataframe

In [11]:
category_tags_pattern = re.compile(r'"(?:[A-Za-z &-])*"')
category_tags = [
                "Asian Fusion", "Mediterranean", "Italian", "American", "Indian", 
                "Mexican", "Japanese", "Chinese", "Thai", "French", "Korean", 
                "Seafood", "Vegetarian", "Middle Eastern", "Latin American", 
                "Spanish", "German", "Caribbean", "Vietnamese", "Burmese", 
                "Fast Food", "Fine Dining", "Casual Dining", "Family Style", 
                "Breakfast & Brunch", "Sandwiches & Wraps", "Bakery & Pastry", 
                "Coffee & Tea", "Bar & Pub", "Grill & Steakhouse", 
                "Street Food & Food Truck", "Buffet", "Sweets & Desserts", 
                "Gluten-Free & Health Food", "Fusion", "Eastern European", 
                "Kosher", "Irish & British", "BBQ & Grill", "Steak house", 
                "Belgian & Dutch", "Hawaiian", "Wine Bar & Lounge"
                ]
    
# get a parser
def parse_category_tags_per_bid(bid, pattern, collections=None):
    fname = './restaurant_category_tags/res_'+bid+'.pkl'
    with open(fname, 'rb') as f:
        s = pickle.load(f)  
    o = re.findall(pattern, s['summary'])
    o = [a[1:-1] for a in o]
    o = [a for a in o if (a in collections)]
    return o


category_tags_dict_all = []
for ind in range(len(bids)):
    tags_dict={}
    bid = bids[ind]
    o = parse_category_tags_per_bid(bid, pattern = category_tags_pattern, collections = category_tags)
    tags_dict['business_id'] = bid
    tags_dict['category tags'] = o
    category_tags_dict_all.append(tags_dict)
# seem to work pretty well!

df_category_tags = pd.DataFrame(category_tags_dict_all)

In [12]:
merged_df2 = pd.merge(merged_df, df_category_tags, on='business_id')
# save the shortlist restaurants from Reno with tags
with open('df_restaurants_Reno_shortlist_wtags_n_categorytags.pkl','wb') as f:
    pickle.dump(merged_df2, f)

### 1.3 think about refining the restaurant pool based on the category tags and votes

In [7]:
# save the shortlist restaurants from Reno with tags
with open('df_restaurants_Reno_shortlist_wtags_n_categorytags.pkl','rb') as f:
    merged_df2 = pickle.load(f)
x = list(merged_df2['category tags'])
y = [a for sublist in x for a in sublist]
print('length of total tags for all restaurants is '+str(len(y)))
uniquey = list(set(y))
print('length of unique tags for all restaurants is '+str(len(uniquey)))


FileNotFoundError: [Errno 2] No such file or directory: 'df_restaurants_Reno_shortlist_wtags_n_categorytags.pkl'

In [5]:
# OK, now look at the occurance of these category tags
from collections import Counter
word_counts = Counter(y)
ranked_words = word_counts.most_common()
ranked_words


NameError: name 'y' is not defined

# Now we are supposed to use the voting results to generate a new pool of restaurants

so, go through all the categorical tags for all restaurant, and filter only the voted tags, give it a voting score.
the score is calculated by:

    - presence of tag (0 or 1) * multiplier
    multiplier = # of votes

In [24]:
# now simulate some votes
votes = \
{'Casual Dining': 2,
 'American': 6,
 'Family Style': 5,
 'Bar & Pub': 0,
 'Breakfast & Brunch': 0,
 'Sandwiches & Wraps': 0,
 'Seafood': 0,
 'Fast Food': 0,
 'Vegetarian': 3,
 'Coffee & Tea': 0,
 'Fine Dining': 8,
 'Bakery & Pastry': 0,
 'Mexican': 2,
 'Italian': 7,
 'Wine Bar & Lounge': 0,
 'Sweets & Desserts': 0,
 'Street Food & Food Truck': 0,
 'Buffet': 0,
 'Gluten-Free & Health Food': 10,
 'Steak house': 2
}

In [26]:
def calculate_score(votes, tags_list):
    score = 0
    for tag in tags_list:
        score += votes.get(tag, 0)  # Add the vote value if tag is in votes, otherwise add 0
    return score

In [27]:
df_category_tags['votes']=df_category_tags['category tags'].apply(lambda x: calculate_score(votes, tags_list=x))

In [5]:
with open('./df_restaurants_Reno_shortlist_wtags_n_categorytags.pkl','rb') as f:
    df = pickle.load(f)

In [9]:
df['10 tags for food categories'].iloc[0]

['Brewery',
 'Pizza',
 'Onion Rings',
 'Wings',
 'Burgers',
 'BBQ',
 'Salads',
 'Sandwiches',
 'Meatloaf',
 'French Dip']

In [4]:
with open('df_restaurants_Reno_shortlist_wtags_n_categorytags.pkl','rb') as f:
    df = pickle.load(f)
t1 = '10 tags for food categories'
t2 = '10 tags for taste categories'
df['votes'] = df['category tags'].apply(lambda x: calculate_score(votes, tags_list=x))
df = df.sort_values(by = 'votes', ascending = False)
print(df[['votes', 'business_id', t1, t2, 'stars', 'categories']].head(5))


FileNotFoundError: [Errno 2] No such file or directory: 'df_restaurants_Reno_shortlist_wtags_n_categorytags.pkl'

In [40]:
df.keys()

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours', 'group', 'latitude_rad',
       'longitude_rad', 'distance', 'good aspects (2)',
       'good for breakfast or brunch', 'good for lunch', 'good for dinner',
       'good for group', '10 tags for food categories',
       '10 tags for taste categories', '10 recommended items',
       '2 not recommended items', 'accomodations', 'category tags', 'votes'],
      dtype='object')

In [12]:
df['attributes'][0]

{'BusinessAcceptsCreditCards': 'True',
 'GoodForKids': 'False',
 'RestaurantsPriceRange2': '2',
 'Alcohol': "u'full_bar'",
 'BusinessAcceptsBitcoin': 'False',
 'RestaurantsGoodForGroups': 'True',
 'BikeParking': 'False',
 'BusinessParking': "{'garage': True, 'street': True, 'validated': True, 'lot': True, 'valet': True}",
 'OutdoorSeating': 'False',
 'Caters': 'False',
 'WiFi': "'free'",
 'RestaurantsAttire': "'casual'",
 'BestNights': "{'monday': False, 'tuesday': False, 'friday': True, 'wednesday': False, 'thursday': True, 'sunday': False, 'saturday': True}",
 'GoodForMeal': "{'dessert': False, 'latenight': False, 'lunch': True, 'dinner': True, 'brunch': False, 'breakfast': False}",
 'NoiseLevel': "'average'",
 'HasTV': 'True',
 'RestaurantsReservations': 'False',
 'RestaurantsTakeOut': 'True',
 'RestaurantsTableService': 'True',
 'Ambience': "{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': True,

In [29]:
pool2 = df_sorted.head(50) # take top N restaurants for the next round of tags

### 1.4 try to generate the summary tags from the bags of tags over the new groups of restaurants (on hold)

In [21]:
## Explore - take out all the tags for both catetories, and see how many unique categories are there

In [None]:
t1 = '10 tags for food categories'
t2 = '10 tags for taste categories'


In [37]:
foods_tags = list(merged_df['10 tags for food categories'])
taste_tags = list(merged_df['10 tags for taste categories'])
flattend_foods_tags = [item.lower() for sublist in foods_tags for item in sublist]
flattend_taste_tags = [item.lower() for sublist in taste_tags for item in sublist]
unique_foods = list(set(flattend_foods_tags))
unique_taste = list(set(flattend_taste_tags))
print('total foods tags: ' + str(len(flattend_foods_tags)) + ', unique tags number ' + str(len(set(flattend_foods_tags))))
print('total taste tags: ' + str(len(flattend_taste_tags)) + ', unique tags number ' + str(len(set(flattend_taste_tags))))


total foods tags: 1973, unique tags number 985
total taste tags: 1985, unique tags number 557
