# Tagging types of pizza with ddlite: candidate extraction

## Introduction

Here's the pipeline:

1. Obtain and parse input data (yelp reviews from pizza restaurants)
2. Extract candidates for tagging
3. Generate features

In [1]:
%load_ext autoreload
%autoreload 2

import cPickle, os, sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from ddlite import *

## Processing the input data
The following code parses the json files of the academic dataset getting up to 50 reviews per restaurant.

In [3]:
import os, codecs, json, shutil, glob

#assign variables to path to the json files
FREVIEW = os.path.join('yelp_data', 'yelp_academic_dataset_review.json')
FBUSINESS = os.path.join('yelp_data', 'yelp_academic_dataset_business.json')

#delete any existing files in reviews folder
files = glob.glob('/yelp_pizza_reviews/*')
for f in files:
    os.remove(f)

def getReviews(quantOfRest=100000, quantOfReviewsPerRest=5000000):
    #get restaurant ids from business ids
    restaurantIDs = []
    with codecs.open(FBUSINESS,'rU','utf-8') as f:
        for business in f:
            if "Restaurants" in json.loads(business)["categories"]:
                if "Pizza" in json.loads(business)["categories"]:
                    restaurantIDs.append(json.loads(business)['business_id'])
    print "Pizza restaurantIDs count", len(restaurantIDs)
    
    #create dictionary of RestaurantID to Reviews
    dictRestaurantIDsToReview = {}
    with codecs.open(FREVIEW,'rU','utf-8') as f:
        for review in f:
            reviewText = json.loads(review)['text']
            ID = json.loads(review)['business_id']
            if ID in restaurantIDs:
                if ID in dictRestaurantIDsToReview.keys():
                    if len(dictRestaurantIDsToReview.get(ID)) < quantOfReviewsPerRest:
                        dictRestaurantIDsToReview.get(ID).append(reviewText)
                else:
                    if len(dictRestaurantIDsToReview.keys()) < quantOfRest:
                        dictRestaurantIDsToReview[ID] = [reviewText]
                    else:
                        break
    return dictRestaurantIDsToReview

#get reviews in the form of a dictionary
dictRestaurantIDsToReview = getReviews(quantOfReviewsPerRest=50)

#save reviews to folder as text files.  Each restaurant has separate review file.
count = 0
for restID in dictRestaurantIDsToReview.keys():
    reviews = ""
    for review in dictRestaurantIDsToReview[restID]:
        review = review.encode('ascii', errors='ignore') + " "
        count += 1
        reviews += review
    open("yelp_pizza_reviews/reviews_" + restID + ".txt", "w+").write(reviews)

#try to remove .DS_Store file.  Otherwise DocParser throws an exception
try:
    os.remove("yelp_pizza_reviews/.DS_Store")
except:
    print "No .DS_Store file"
    
print count

Pizza restaurantIDs count 2223
No .DS_Store file
44806


In [2]:
dp = DocParser('yelp_pizza_reviews/')
docs = list(dp.readDocs())

Now we'll use CoreNLP via ddlite's `SentenceParser` to parse each sentence. `DocParser` can handle this too; we didn't really need that call above. This can take a little while, so if the example has already been run, we'll reload it.

In [6]:
docs = None

pkl_f = 'yelp_tag_saved_sents_v4.pkl'
try:
    with open(pkl_f, 'rb') as f:
        sents = cPickle.load(f)
except:
    %time sents = dp.parseDocSentences()
    with open(pkl_f, 'w+') as f:
        cPickle.dump(sents, f)

print sents[0]

Sentence(words=[u'This', u'restaurant', u'used', u'to', u'be', u'called', u'La', u'Piazza', u'.'], lemmas=[u'this', u'restaurant', u'use', u'to', u'be', u'call', u'La', u'Piazza', u'.'], poses=[u'DT', u'NN', u'VBN', u'TO', u'VB', u'VBN', u'NNP', u'NNP', u'.'], dep_parents=[2, 0, 2, 6, 6, 3, 8, 6, 2], dep_labels=[u'det', u'ROOT', u'acl', u'mark', u'auxpass', u'xcomp', u'compound', u'xcomp', u'punct'], sent_id=0, doc_id=0, text=u'This restaurant used to be called La Piazza.', token_idxs=[0, 5, 16, 21, 24, 27, 34, 37, 43], doc_name='reviews_CwKyfU1JQRd3rHSYORG3hw.txt')


## Extracting candidates with matchers
We use regex matchers to extract candidates. The dictionary match should provide fairly high recall.  For each topping, we generate a regex matchers that requires the topping be the first word and pizza be the last word.

In [7]:
toppings = ["mushroom","pepperoni","sausage","hawaiian","pineapple","beef","pork","chicken",
            "Italian","salami","meatball","ham","bacon","spinach","tomato","onion","pepper"]

def gen_regex_match(topping):
    pattern = topping + r"\s\w+\spizza"
    m1 = RegexNgramMatch(label=topping+"m1", regex_pattern=pattern, ignore_case=True)
    pattern = topping + r"\s\w+\s\w+\spizza"
    m2 = RegexNgramMatch(label=topping+"m2", regex_pattern=pattern, ignore_case=True)
    pattern = topping + r"\s\w+\s\w+\s\w+\spizza"
    m3 = RegexNgramMatch(label=topping+"m3", regex_pattern=pattern, ignore_case=True)
    return [m1, m2, m3]

args = []
for topping in toppings:
    args += gen_regex_match(topping)

    
# old rules
#pizza_regex1 = RegexNgramMatch(label='Pizza', regex_pattern=r'\w+\spizza', ignore_case=True)
#pizza_regex2 = RegexNgramMatch(label='Pizza', regex_pattern=r'\w+\s\w+\spizza', ignore_case=True)
#pizza_regex3 = RegexNgramMatch(label='Pizza', regex_pattern=r'\w+\s\w+\s\w+\spizza', ignore_case=True)
#pizza_regex4 = RegexNgramMatch(label='Pizza', regex_pattern=r'\w+\s\w+\s\w+\s\w+\spizza', ignore_case=True)


In [8]:
#combine all matchers
CE = Union(*args)

## Creating the candidates
We'll use our unioned candidate extractor to extract our candidate entities from the sentences into an `Entities` object. 

In [9]:
E = Entities(sents, CE)

In [10]:
# Number of entities we extracted
len(E)

1021

A parse tree visualization of the entities.

In [11]:
E[0].render()

In [12]:
E[1].mention(attribute='words')

[u'Chicken', u'and', u'a', u'Hawaiian', u'pizza']

Finally, we'll pickle the extracted candidates from our `Entities` object for use in learning ipython module.

In [13]:
E.dump_candidates('yelp_tag_saved_entities_v5.pkl')