# Using Yelp Data to Recommend the Location of the Next Lou Malnati: Preprocessing, Training, and Modeling 
Chicago pizza magnate Lou Malnati is looking to expand his national pizza empire. With 59 locations in Illinois, seven in Arizona, four in Wisconsin, and four in Indiana. Malnati is interested in potentially expanding both within Arizona and Indiana and to other states. In particular, Malnati is interested in Florida, Pennsylvania, New Jersey, and Missouri. 
Malnati’s restaurants are known for their deep dish pizza, and are looking for locations that either might not have deep dish options or locations where the pizza options are not satisfying consumers. Malnati's team believes that they can both introduce deep dish to new customers and lure currently unsatisfied customers with their nationally recognized pizza brand. 
Malnati’s team has requested an analysis of the existing landscape in the four new states along with Arizona and Indiana. They want to understand which state holds the most promise for more or more new locations. Ideally, they would like to open up multiple locations and want to know whether one of the new states would be a better option than continuing to open up restaurants in Arizona and Indiana.

**The purpose of this notebook is to pre-process the data, to split the data into training and test sets, and to build and evaluate several models for predicting the star rating based on the comment.**

## Data Sources
All data has been downloaded directly from [Yelp](https://www.yelp.com/dataset):

1. yelp_academic_dataset_business.json: contains business data including location data, attributes, and categories
2. yelp_academic_dataset_review.json: contains full review text data including the user_id that wrote the review and the business_id the review is written for.

The data was loaded and read into pandas dataframes in the 1-ridgway-read-data notebook. The dataframes were filtered for only businesses with "pizza" in the categories and then pickled. The pickled datasets were then cleaned, merged, the text feature was prepared (e.g., tokenization, lemminization), and pickled once again:

- processed.pkl: pickled dataframe containing reviews and select business information for pizza businesses in select states

## Changes
- 03-21-22: Started pre-processing

## Summary of Pre-Processing and Modeling

TBD

## Import Libraries

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## File Locations

In [11]:
processed_df = '../data/processed/processed.pkl'

## Load Data

In [12]:
df = pd.read_pickle(processed_df)

In [13]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,346139,346140,346141,346142,346143,346144,346145,346146,346147,346148
review_id,XW_LfMv0fV21l9c6xQd_lw,gTBmyv_0E8LaCujbzP_oOw,eaZ4tpxGaZ-_STjRs_Qs8Q,NUI4r6IguPlYbp-hRHDXBg,gJ7fifhiME55IvK6GmYkjA,HrdMGSK09LlPZp8Tn4OAYA,Cvaa0B7d-ZK_v1ji0NAPZw,aZNpuDmIOYEw-WCRFfYPIQ,xAzPyGQN-56cs05h7somVA,qJt8eX18v0qW9ZHULHL_yQ,...,rNckjjKzFbEV6SRCiHJuJQ,YUbj6EeyNB9VJ24_i3uB-w,Mb1s9M-lBrnwHfLhbJHzDg,CHfNYEgvt-4yhrvkEY74hA,jpih2-xPqqaHDUivW8Rdug,HNbKVmQWYXgJeAjFVkA_Lg,lt1cguB7keZNcI3nWuLdyw,F2LxM15Ie7HAIfCxdV0OVg,WGs6wet1daSU-gEisraOlA,mif-uUZ65h7S6n6LW5Cr4A
user_id,9OAtfnWag-ajVxRbUTGIyg,5h9JA231vPilNAIjHxwGng,c1fSI6Dv5lybr0AJh67e7w,st-q1iyW3sJm-v0OCrheoQ,9qBdzBzoDxLFSMhhGrTWJg,TxrljvgguFghGdKpO1uqSA,kX9OMmA9XLpNPWXQtTe33Q,VRvD7-JHdWdTQnh1lRCw_A,kx1AyDQfkPHSkPstBQlDzg,2XFQXe_Ewzj1VA0PiEHG_A,...,XSOdhb9CE747hrmY8cR0NQ,WLaCTpXXdrFKlUTXDsp6Tg,GCgBvm0T1fZINcp1myOKoA,1pQMdzswD8vplCYbj6MZ5w,iS94VPcHINDyrgFkV0J9yQ,F4-aIdXAu86DPoQZRpbBHw,cO5oGRCrztPQ3lhkALF9jQ,3BsZVcNu71Pl4sqcctN2KQ,9XDFZlGs4-QwQx44YODnWQ,3WWEuDYQ3ssQl1cfl5ui8w
business_id,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,lj-E32x9_FA7GmUrBGBEWg,...,K1SsvIPfFcHniNSPc3IG7g,K1SsvIPfFcHniNSPc3IG7g,K1SsvIPfFcHniNSPc3IG7g,B3JCfkoBQilfyMrYza_Ilg,B3JCfkoBQilfyMrYza_Ilg,B3JCfkoBQilfyMrYza_Ilg,B3JCfkoBQilfyMrYza_Ilg,B3JCfkoBQilfyMrYza_Ilg,B3JCfkoBQilfyMrYza_Ilg,B3JCfkoBQilfyMrYza_Ilg
stars,4.0,5.0,4.0,5.0,4.0,4.0,5.0,4.0,4.0,3.0,...,5.0,5.0,4.0,5.0,5.0,3.0,5.0,5.0,5.0,4.0
useful,0,0,0,2,1,0,0,3,1,1,...,1,0,3,0,1,0,0,0,2,1
funny,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,1
cool,0,0,0,0,1,0,0,1,0,0,...,0,1,1,0,1,0,1,1,1,1
text,Love going here for happy hour or dinner! Gre...,My friends at work (connoisseurs of good food ...,"Great service, relaxing atmosphere and the foo...",I went to Brio for the first time on Wednesday...,I usually steer clear of the chain restaurants...,Had the Eggplant Parm as an appetizer for the ...,"I love this cute Italian restaurant, it's grea...",Ok. I found it shocking that this is a chain. ...,"Don't be fooled by BRIO being a ""chain"" restau...","It was okay. Not much in terms of style, quali...",...,Ordered here for the first time. I did a peppe...,"When it comes to pizza, the dough is everythin...",I ordered through Uber Eats. The food was prep...,It's absolutely Delicious! All the food was co...,This place is great! My neighbors and I all ve...,"Very nice western themed interior, nice restro...","This bar has a welcoming, cozy atmosphere. The...",Stopped in here on a recommendation. New owner...,Happened in here randomly. Because always on ...,We went to hear one of our favorite bands. Was...
date,2014-06-27 22:44:01,2014-08-24 19:24:26,2015-09-24 15:01:11,2015-10-17 04:56:25,2016-01-04 16:56:32,2017-10-25 01:56:19,2017-02-09 19:39:40,2008-08-19 23:32:49,2011-01-09 23:03:07,2015-09-24 01:26:33,...,2021-10-27 13:55:05,2021-05-19 16:19:46,2021-06-06 19:23:33,2021-08-12 22:36:15,2018-10-31 18:34:32,2019-09-14 18:44:32,2020-01-03 01:54:01,2019-01-24 00:53:16,2021-06-27 16:11:34,2014-12-27 17:50:32
binary_rating,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0


In [None]:
df.describe()

## TODO

1. ngrams with bigrams and trigrams (https://www.askpython.com/python/examples/n-grams-python-nltk)
2. sentiment analysis to prepare for model (https://github.com/cjhutto/vaderSentiment)
3. Model #1: count vectorizer
4. Model #2: MultinomialNB

## Sentiment Analysis
Adapted from https://github.com/nhcamp/Yelp-Burrito-Reviews/blob/master/Capstone%202.ipynb

In [None]:
def apply_sentiment_intensity_analysis(sentence):
    """Applies the polarity scores function to a sentence. Used with df.apply(), returns dictionary. 
    """
    analyzer = SentimentIntensityAnalyzer()
    polarity_dict = analyzer.polarity_scores(sentence)
    return polarity_dict

df['polarity_score'] = df['text_stemmed'].apply(lambda x: apply_sentiment_intensity_analysis(x))

In [None]:
df.head()