# Sketch Notebook

In [3]:
import json
import numpy as np
import pandas as pd

# Parse the json response
reviews = []
ratings = []

with open('./json/butcher-reviews.json') as f:
    reviews_temp = json.load(f)
    for review in reviews_temp:
        reviews.append(review['snippet'])
        ratings.append(review['rating'])

In [10]:
print(len(reviews))
print(len(ratings))
for i, review in enumerate(reviews):
    if i > 2: break
    print(review + '\n')

176
176
I don’t know where to begin. The staff is extremely friendly and professional. Very knowledgeable and helpful. We are not very accustomed to this kind of restaurant, and they helped us pick items and the right wine. Steaks were perfectly cooked. Sides were phenomenal. The cheesecake is to die for. Highly recommend for a special night out.

Trendy and chic, incredibly decadent and lavish. Great place to celebrate a special occasion. Of course amazing steaks as you’d expect. The ceviche was amazing, and the cheesecake dessert was a work of art! Amazing cocktail options and incredible variety of bourbon. Lots of rare options that you can’t find elsewhere.

Fantastic Steak and Seafood Restaurant. Service was impeccable, food was outstanding. Drinks and wine were top notch. One of the best restaurants in Toronto. It's a hidden gem in the city. Minutes away from Scotiabank Arena at Bay and Harbour Streets. Great for business meetings, date nights or out with friends. They have an out

In [44]:
from turtle import distance
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

n_gram_range = (1, 1)
stop_words = "english"

# Extract candidate words/phrases
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit(reviews)
candidates = count.get_feature_names_out()
custom_kws = ['quite', 'intimate', 'dim lighting'] # cutom kws can be passed from FE in the future

# Next, we convert both the reviews as well as the candidate keywords/keyphrases to numerical data using pre-trained BERT
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
review_embeddings = model.encode(reviews)
candidate_embeddings = model.encode(candidates)

In [45]:
# Use cosine similarity to compare candidate embeddings with all review embeddings (vectorized)
top_n = 10
comparison = {} # kw: mean similarity
distances = cosine_similarity(candidate_embeddings, review_embeddings) # return kernel matrixndarray of shape (n_samples_X, n_samples_Y)

# Compute the mean similarity for each candidate kw (vectorized)
mean_distances = np.mean(distances, axis=1)
# print(f'mean_distances shape: {mean_distances.shape}')

keywords = [candidates[index] for index in mean_distances.argsort()[-top_n:]]

mean_distances shape: (1264,)


In [None]:
# Diversification of returned KWs result

In [None]:
# Adding custom keys from FE to the final extracted KWs list
for kw in custom_kws:
    if kw not in keywords:
        np.append(keywords, kw)

# Iteractive through concated reviews to get keywords count and store in a dict
kw_cnt = {}


In [46]:
# print(len(candidates))
# print(candidates[:100])
print(candidate_embeddings.shape)
print(review_embeddings.shape)
print(keywords)

(1264, 768)
(176, 768)
['chef', 'appetizers', 'dinner', 'deliciousness', 'culinary', 'chefs', 'dinners', 'flavoursome', 'flavorful', 'tasty']
