## Yelp Food recommender

### Problem Statement:


Yelp is the collection of reviews crowd-sourced by vari-ous users and has become a key product when custom-ers tend to make their restaurant choices. Each customer writes their reviews based on the various      experience they face during their visit to the restaurant. Customer’s choice of the restaurant can depend on a   variety of fac-tors like the name of the food item, quality of the food item, and the variety of other restaurant attributes like location, cost, and service. In addition, business owners would also like to prioritize on which items to improve and what attributes of the restaurant should be improved based on the reviews.


This Project was done by a team of four . We attempted to address the following Questions.  The entire project can be found at https://github.com/sreeaurovindh/yelp-dv . 

The following were the set of questions we addressed in the project

1.	How do the restaurant perceptions change over time? 
2.	How do people perceive my restaurant’s   food quality when compared to my neighborhood? 
3.	What factors influence the opinion of restaurants in my neighborhood?
4.	How to choose the restaurant based on my geo-location and category preferences? 
5.	What restaurants will my friends be interested in?” 


The report of this project can be found at  https://github.com/sreeaurovindh/sreeaurovindh.github.io/blob/master/yelp-files/visualization_report.pdf


### Project Demo

In [15]:
from IPython.display import HTML

# Youtube
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/cnacb2HXgD8?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')


### My Work


The following contains source code and description of my work along with project source code. My work included.
1. Data Extraction 
2. Computation of Polarity Score
3. Visualization to address questions 1 and 2.


### Data Description
The data was present as Json Files. Each file composed of a single object type , one JSON-object per line. Yelp Data set examples can be found at https://github.com/Yelp/dataset-examples

The documentation can be found at https://www.yelp.com/dataset/documentation/json 

The total data is persisted with five json files. They are
1. business.json => Contains business_id,name,address,city,state,postal code,latitude,longitude ,stars,review_count etc.  
2. review.json => Contains review_id,user_id,business_id,stars,date,text,useful,funny,cool
3. user.json => Contains user_id,name,review_count, yelping_since, friends,useful,funny,cool,fans etc..
4. checkin.json => time, business_id
5. tip.json => text,date,likes,business_id,user_id
6. photos (from auxillary file)

## Data Processing Infrastructure


In [11]:
from IPython.display import Image
Image(url='https://github.com/sreeaurovindh/sreeaurovindh.github.io/raw/master/yelp-files/infrastructure.jpg')


#### Wikipedia Scraper
An initial set of cuisines were extracted from Yelp da-taset. For each cuisine, a set of Wikipedia pages were extracted by Google search. For each page, all the cui-sine names were scraped by using Scrapy (a python library). These data are then stored in separate JSON files

#### Query Engine and Apache Solr
The entire review text was put into a Lucene which has tokenization and full-text search capabilities inbuilt to it. The Query engine takes the Wikipedia pairs (cuisine, food item) and queries each food item and makes a full-text search on the Solr index. This returns the set of re-views for each food item. Now the sentence containing the food item is extracted out and is stored as a JSON file.

#### Sentiment Analyser
For each sentence obtained from the Solr output, the   sentiment (polarity) of the sentence is extracted. These data are then passed and stored into MongoDB for further querying.

#### Aspect Processing
A very simple way to describe this process is to identify nouns in the sentence and then look for nearest adjec-tives around it. There are obvious shortcomings of the above approach and are overcome by extracting syntac-tic      dependencies between words and output word forms by the Stanford CoreNLP tool. We used open source libraries that use Stanford’s tool to extract vari-ous noun and the adjective pairs and these pairs are stored in JSON. These pairs of data are called Attribute Pairs. For example, some of the attribute pairs include (corn, sweet), (ambiance, horrible). The overall natural language processing time for the extraction of 2.5 GB of data was high. Hence we used an optimized amazon EC2 Linux cluster machines to parallelize this process and the resultant   output was collected back to a single machine.

#### Full Text Search for specific food items
We used Full text search inorder to extract food items and its relevent polarity(Sentiment) value. 


### Perform Full Text Search by using Apache Solr and update Mongodb Database

The following source code extracts the "sentence" that contains the name of the food item and computs its polarity. The output is stored in MongoDB

In [None]:
from SolrClient import SolrClient
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#Imported by easy_install vaderSentiment
#Obtained from https://github.com/cjhutto/vaderSentiment
import json
import iso8601
from pymongo import MongoClient

json_file_name = "D:\\Dropbox\\dv\\cusine_names\\Cusinies\\American.json"
category_name = "American (Traditional)"
json_file_out = "D:\\Dropbox\\dv\\cusine_names\\Cusinies_out\\American_output.json"
solr = SolrClient('http://localhost:8983/solr')

#Mongodb connection Parameters
client = MongoClient('mongodb://localhost:27017/')
db = client['yelp_dv']
food_business = db.food_business

# Open Output file
f = open(json_file_out, 'w')
result = []
analyzer = SentimentIntensityAnalyzer()
index = 0

with open(json_file_name) as json_data:
    #Load Json File
    file_jsons = json.load(json_data)
    for item in file_jsons:
        #Extract Food Item name 
        food_name = item['foodItem'].lower()
        data = '"'+food_name+'"'
        text_data=  '_text_:%s' %data
        #Query the Solr Index
        for res in  solr.paging_query('review_core',{'q':text_data},rows=10000):
        #Get all values of Review Record
            if res.get_results_count() > 0:
                print(food_name,res.get_results_count())
                
                json_doc = json.loads(res.get_json())
                output = json_doc['response']['docs']
                for review_recd in output:
                    review_text = review_recd['text'][0]
                    business_id = review_recd['business_id'][0]
                    review_id = review_recd['review_id'][0]
                    stars = review_recd['stars'][0]
                    timestamp_str = review_recd['date'][0]
                    useful = review_recd['useful'][0]
                    funny = review_recd['funny'][0]
                    try:
                        all_sentences = [sentence for sentence in review_text.lower()
                                         .split('.') if food_name in sentence]
                        if len(all_sentences) == 0:
                            all_sentences = [review_text]
                            
                        max_polarity = -2
                        sentence_review = ""
                        for sentence in all_sentences:
                            vs = analyzer.polarity_scores(sentence)
                            polarity_Score = vs['compound']
                            if polarity_Score > max_polarity:
                                max_polarity = polarity_Score
                            sentence_review = sentence_review + sentence
                            
                        #Get Business Name from business ID
                        business_name=food_business.find_one({"business_id":business_id })['name']
                        
                
                        #print(polarity_Score,max_polarity)   
                        data = {}
                        data['item'] = food_name
                        data['polarity'] = max_polarity
                        data['business_id'] = business_id
                        data['review_id'] = review_id
                        data['stars'] = stars
                        data['date']  = iso8601.parse_date(timestamp_str).strftime('%Y-%m-%d')
                        data['useful'] = useful
                        data['is_review'] = 1
                        data['is_tip'] = 0
                        data['name'] = business_name
                        data['review_sentence'] = all_sentences[0]
                        data['category'] = category_name
                        index 
                        json_data_out = json.dumps(data)
                        f.write(json_data_out+'\n')
                                
                    except Exception as e:
                        print(e)
                        pass
            dd
    f.close()

## Business Questions

## How do people perceive my restaurant’s food quality when compared to my neighborhood?

Inorder to understand how food quality had changed in comparision to the neighborhood, we used a bubble chart indicating various food items. 

The popularity of the item across the restaurant is denoted by the radius of the circle. Color of the circle denotes the name of the dish.  The colors are chosen to keep in mind visual clarity in times of overlap. The x-axis of the graph corresponds to the star rating of the restau-raant and y-axis corresponds to the review count or pop-ularity of the restaurant.  The user can explore the region by hovering on a specific bubble which represents the restaurant.

In [26]:
from IPython.display import Image
Image(url='https://github.com/sreeaurovindh/sreeaurovindh.github.io/raw/master/yelp-files/Food_items.jpg', width=700, height=250)

Hoevering over a circle would highlight two bubbles on the screen. The first one for the target restaurant(blue strong border) and the other denoting the aggregation polarity score of neighborhood restaurants A border around the circle is provided to illustrate the current restaurant with a dotted circle is used to indicate the restaurants in the neighborhood. All the other food items of the neighboring restaurants are faded away.

In [25]:
from IPython.display import Image
Image(url='https://github.com/sreeaurovindh/sreeaurovindh.github.io/raw/master/yelp-files/food_items_select.jpg', width=700, height=250)

### Mongodb Code used to extract Food Quality Comparisions in the neighborhood

In [21]:
from IPython.display import IFrame
IFrame('https://sreeaurovindh.github.io/yelp-files/food_quality_neighborhood.html', width=700, height=250)

## How do the restaurant perceptions change over time?

Inorder to understand how restaurant perceptions change over time, we used a bar chart indicating Percentage of reviews with Poor,Fair, Good and excellant ratings. The intensity map shows how average polarity value of     reviews for the specific restaurant had changed over time. The x-axis denotes time in months while the y-axis shows the distribution of polarity by a sequential color scheme. When we hover over a column, the per-centage of items with a bad polarity/opinion is shown. The user can  explore more about the specific part of that chart by clicking on it. This would take us to our next visualization which explains the specific set of food items.

In [20]:
from IPython.display import Image
Image(url='https://github.com/sreeaurovindh/sreeaurovindh.github.io/raw/master/yelp-files/restaurant.png')

In [22]:
from IPython.display import IFrame
IFrame('https://sreeaurovindh.github.io/yelp-files/quality_per_restaurant.html', width=700, height=250)

In [16]:
from IPython.display import IFrame
IFrame('https://sreeaurovindh.github.io/yelp-files/food_quality_items.html', width=700, height=250)

### Conclusion:

The visual recommender system provides insights about the yelp data to both the business owners and the         customers for taking informed decisions. The system   provides recommendations for the selected user with the help of interactive dashboard which uses a com-bined dataset from Yelp and Wikipedia. Here, we have tried to provide answers to the four questions after ana-lyzing the data.

The sentiment analysis implemented on the reviews have provided the system with more insight into the reviews than the normal unprocessed rating. Through this, the system could consolidate and inter-relate all the graphs with respect to both the business user and the customers. The system predominantly uses four graphs to visualize the questions. The initial graph pro-vides a selection guide to various facets based on the distribution of the items inside each category. The food quality over time intensity maps provide us with an overall view of the customer’s change in food quality perception over time. This is      further extended to show the food item quality           comparison and attrib-ute opinion comparison. Finally, the system recom-mends the customer with a list of      restaurants based on their friends’ preferences which are well connected with a concept map.