In [None]:
!pip install -U -t /kaggle/working/ git+https://github.com/Kaggle/learntools.git@nlp

In [None]:
import sys
sys.path.append('/kaggle/working')

**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**

---


# Basic Text Processing with Spacy

In this exercise, you'll use SpaCy to generate some basic statistics from Yelp reviews. You'll be looking at reviews for specific dishes from an Italian deli, [DelFalco's in Scottsdale, Arizona](https://defalcosdeli.com/index.html). For example, they have meatball subs!

<img src="https://upload.wikimedia.org/wikipedia/commons/0/0a/Meatballs_sandwich10000000041678_000334_%2815638892980%29.jpg" alt="meatball sub">
    
You're a consultant for the restaurant looking to get insight into the quality of their food. You have an idea to use customer ratings from Yelp reviews to measure the quality of specific dishes. Assuming that a customer's rating and the menu items mentioned in the review are correlated, items that consistently appear in reviews with low ratings are likely subpar. Using this analysis, you can provide feedback to the owner.

The goal then is to extract menu items from the review text and find basic statistics on the ratings. For example, you can count how many times specific dishes appear in the reviews.

First you'll load in Pandas and SpaCy, then load the data from a JSON file.

In [None]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex1 import *
print("\nSetup complete")

In [None]:
# Load in the data from JSON file
data = pd.read_json('../input/nlp-course/restaurant.json')
data.head()

I've provided a list with the menu items and common alternate spellings. This could be improved, but it will be good for this exercise.

In [None]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

### Exercise 1: Find items in one review

First up, you'll use SpaCy to find menu items in a single review. For this you can use `PhraseMatcher` which matches based on phrase patterns. Comparatively `Matcher` matches on tokens, individual words. However, some of the menu items are phrases, so you can't match on individual tokens only. Note that while the menu items are in title case, review authors will often write the food items in a variety of cases. You'll need to tell the `PhraseMatcher` to perform case-insensitive matching with the `attr` keyword argument.

Using the `nlp` model, create a list of phrase docs from the `menu` list. Add the patterns to `PhraseMatcher` with the key `"MENU"`. Then use the `PhraseMatcher` to find matches in `doc`, an example 

In [None]:
import spacy
from spacy.matcher import PhraseMatcher

# Load the SpaCy model
nlp = spacy.load('en_core_web_sm')
# Create the doc object
review_doc = nlp(data.iloc[4].text)

# Create the PhraseMatcher object, be sure to match on lowercase text
matcher = ____

# Create a list of docs for each item in the menu
patterns = ____

# Add the item patterns to the matcher
____

# Find matches in the review_doc
matches = ____

In [None]:
# Uncomment if you need some guidance
# q_1.hint()
# q_1.solution()

In [None]:
# After implementing the above cell, uncomment and run this to print 
# out the matches. Otherwise you'll get an error.

# for match in matches:
#     print(f"At position {match[1]}: {review_doc[match[1]:match[2]]}")

In [None]:
#%%RM_IF(PROD)%%

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
review_doc = nlp(data.iloc[4].text)

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
patterns = [nlp(item) for item in menu]
matcher.add("MENU", None, *patterns)
matches = matcher(review_doc)

for match in matches:
    print(f"At position {match[1]}: {review_doc[match[1]:match[2]]}")
    
# Uncomment when checking code is complete
q_1.assert_check_passed()

### Exercise 2: Matching on the whole dataset

Now run this matcher over the whole dataset and collect ratings for each menu item. Each review has a rating, `review.stars`. For each item that appears in the review text (`review.text`), append the review's rating to a list of ratings for that item. The lists are kept in a dictionary `item_ratings`.

To get the matched phrases, you can reference the PhraseMatcher documentation for the structure of each match object:

>A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern.

In [None]:
from collections import defaultdict

# item_ratings is a dictionary of lists. If a key doesn't exist in item_ratings,
# the key is added with an empty list as the value.
item_ratings = defaultdict(list)

for idx, review in data.iterrows():
    doc = ____
    # Using the matcher from the previous exercise
    matches = ____
    
    # Create a set of the items found in the review text
    found_items = ____
    
    # Update item_ratings with rating for each item in found_items
    # Transform the item strings to lowercase to make it case insensitive
    ____

q_2.check()

In [None]:
# Uncomment if you need some guidance
#q_2.hint()
#q_2.solution()

In [None]:
#%%RM_IF(PROD)%%

from collections import defaultdict

item_ratings = defaultdict(list)

for idx, review in data.iterrows():
    doc = nlp(review.text)
    matches = matcher(doc)

    found_items = set([doc[match[1]:match[2]] for match in matches])
    
    for item in found_items:
        item_ratings[str(item).lower()].append(review.stars)
        
q_2.assert_check_passed()

### Combine Similar Items

You have some items like Steak and Cheese, Cheesesteak, and Cheese Steak that all refer to the same item, but are counted separately. Because language is messy. Before doing analysis, you should combine these items.

In [None]:
similar_items = [('cheesesteak', 'cheese steak'),
                 ('cheesesteak', 'steak and cheese'),
                 ('chicken parmigiana', 'chicken parm'),
                 ('chicken parmigiana', 'chicken parmesan'),
                 ('mac and cheese', 'macaroni'),
                 ('calzone', 'calzones')]

for (destination, source) in similar_items:
    item_ratings[destination].extend(item_ratings.pop(source))

### Exercise: Which items are the best reviewed?

Using these item ratings, find the mean ratings for each item. Then sort the ratings to find the best 

In [None]:
# Calculate the mean ratings for each menu item as a dictionary
mean_ratings = ____

# Sort the ratings in descending order, should be a list
best_items = ____

q_3.check()

In [None]:
# Uncomment if you need some guidance
# q_3.hint()
# q_3.solution()

In [None]:
# After implementing the above cell, uncomment and run this to print 
# out the best items. Otherwise you'll get an error.

# for item in best_items:
#     print(f"{item:>25}{mean_ratings[item]:>10.3f}")

In [None]:
#%%RM_IF(PROD)%%

mean_ratings = {item: sum(ratings)/len(ratings) for item, ratings in item_ratings.items()}
best_items = sorted(mean_ratings, key=mean_ratings.get, reverse=True)

for item in best_items:
    print(f"{item:>25}{mean_ratings[item]:>10.3f}")
    
q_3.assert_check_passed()

### Which items are the most popular?

Similar to the mean ratings, you can calculate the number of reviews for each item.

In [None]:
counts = {item: len(ratings) for item, ratings in item_ratings.items()}

In [None]:
item_counts = sorted(counts, key=counts.get, reverse=True)
for item in item_counts:
    print(f"{item:>25}{counts[item]:>5}")

### Thought Question: Are counts important here?

Finally, print out the 10 best and 10 worst items. Print the item name, the average rating, and the count. It's important to consider the number of ratings for a specific item when using the mean to make decisions or suggestions. Why is this?

Uncomment the following line after you've decided your answer.

In [None]:
#q_4.solution()

In [None]:
print("Best rated menu items:")
for item in best_items[:10]:
    print(f"{item:20} Average rating: {mean_ratings[item]:.3f} \tcount: {counts[item]}")

In [None]:
print("Worst rated menu items:")
for item in best_items[:-10:-1]:
    print(f"{item:20} Average rating: {mean_ratings[item]:.3f} \tcount: {counts[item]}")

### Next Up!

In the next tutorial you'll learn how to create a text classification model with SpaCy.

---
**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*