# Introduction

Today, people get used to looking up information online before they visit somewhere. Yelp provides a good platform where people can submit reviews of businesses using a one to five star rating system. Business owners expect higher ratings to increase revenue, therefore advice for improvement based on existing reviews is needed. In our analysis, we use data-driven methods to generate actionable solutions to help improve ratings in Yelp. Moreover, we provide a web application to demonstrate our findings and personalize advice for every single business owner.

# Background

Yelp releases open datasets (including four aspects: business, user, review and tip) for our study. The business data includes location data, attributes, and categories. The user data includes the user's friend mapping and all the metadata associated with the user. The review data includes full review texts and the "user_id" that wrote the review and "the business_id" the review is written for. The tip data includes shorter reviews that convey quick suggestions. Considering that the dataset is huge, we focus on a subset of it. The selected dataset consists of 490 cinemas across 11 states in North America (3 in Canada, 8 in the US), 28747 reviews and 11356 tips from 24359 users. To provide suggestions for the business owners, it's vital to find people's attitudes towards their performance in key aspects. In our analysis, we filter out the valid reviews and then analyze the sentiment behind the reviews. 

# Data Processing

## User Selection
We should know that there exist some users who intentionally give extreme stars for personal purpose, it’s unfair including their reviews in our analyzing. We want to find the spam users with the attributes provided in data set. We combine all kinds of compliments to be one. We combine fans and friends as followers. We also include average star and review count. The spam users we are expecting are those who receive very few compliments, have very few followers, give extreme stars (very high or very low) and have very few reviews.<img src="./picture/plot1.png" width = 30% height = 30% />
Since no user is labeled as bad user in the data set, we recognize it as a unsupervised learning task. The method we choose is k-means clustering. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. For the parameter of the number of clusters, we tried from 2 to 10. We check the result of clustering by drawing plots. When the number of clusters is 5, we receive the best result. [plot2, 3, 4]
From all 1,637,138 users, 635,755 of them are recognized as spam users. As shown in the plots, the group of golden points share the same characteristic of extreme stars and low amounts of compliments, followers and reviews. After deleting these users, 25816 reviews are remained analyzing.

## Text Processing
Next we process the reviews and tips in order to extract the key attributes later. Our goal in this part is to clean and split the words. Our work takes the following steps:
* Change all letters into lower case;
* Delete stop words ([stop words list](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words));
* Split the sentences into words;
* Delete punctuation;
* Turn the tense into the present tense;
* Correct the spelling error.

## Word Counting
Then we count the words. Actually, the frequency of word occurance in the reviews and tips contains a lot of information. High-frequency noun can be utilized as indicators of the cinemas' performance, such as seat, popcorn. High-frequency adjective may contains sentiment, which helps us to analyze the sentiment of the review or tip, especially those adjectives with high variance across different ratings. To count the words reasonably, we take the following steps:
* Count the times every word appearing in the reviews under different ratings;
* Put on weights to count to take the number of compliments for the reviews or tips (funny, useful, cool) and the time of the review or tip;
$$(N_1 + N_2)\times w_T$$
where $N_1$ denotes the times the word appears in all the reviews; $N_2$ denotes the total number of compliments of those reviews; $w_T$ denotes the time weight ranging from 0 to 1, 0 being the earliest date the review when the review is given and 1 being the latest date.
* Treat “should/would/not+adj.” and “never+verb.” as one word instead of two;
* Scale the count for comparison.

## Extracting the key attributes
So far, we've count the frequency of words appearing in the reviews and generated a list of the most frequent words. The list gives us an insight into what people look for and feel when they visit a cinema. We select important business attributes customers values the most from the list. We divide them into 6 aspects:
* Price: the ticket price and the the cost of other goods sold in the cinema
* Location: the location of the cinema and the important geographical elements nearby (hotels, restaurants and car parks), etc.
* Facility: the quality of movie, screen, sound and seats, etc.
* Environment: the wait time to buy tickets, the style of the cinema, etc.
* Food and drinks: the concession stand, the food and drinks offered.
* Services: the quality of service from staff, the online services (eg. online reservation), etc.
* Promotion: special offers for vip, free passes and discount, etc.

# Sentiment Analysis
To study people's attitudes towards the selected key words, we first find the segments where it appears, specifically, we extract the parts between the punctuation right before and after the words. We use a value called "polarity" ranging in \[-1, 1\] to to denote the sentiment. -1 is the most negative and 1 is the most positive. Written text can be broadly categorized into two types: facts and opinions. Opinions carry people's sentiments, appraisals and feelings towards the world, which are what we focus on in this section. Python's Pattern library provides a lexicon of adjectives that occur frequently in product reviews, annotated with scores for sentiment polarity ([en_sentiment](https://github.com/clips/pattern/blob/master/pattern/text/en/en-sentiment.xml)). Important attributes in this lexicon are: "pos" - the part-of-speech tags; "sense" - the situation where the word is used; "polarity" - the same as our definition; "intensity" - the effects on sentiment of modifier words (eg. very, little, ...). For every word appearing in the segment extracted, we take the average of the polarity values. If the word comes with a modifier word, we multiply the polarity by the intensity of the modifier word. If the word comes with a negation (eg. never, not, ...), we multiply the polarity by -0.5 and divide it by the modifier word's intensity.  

We then sum the polarity up to find the score for each aspect and get a rating chart like following:

| id | name | price | location | facility | environment | food\&drink | service |  promotion |
| :-: |  :-: |   :-:   |    :-:   |    :-:   |    :-:    |    :-:    |    :-:   |    :-:   |
| OEQrPxeku4BfHMCSi8UASQ | Chandler Cinemas | 0.000 | 0.000 | 0.686 | 1.005 | 0.225 | -0.063 | 0.313 |
|zRV7bzP_CfTg-_R9U-VsVg |   Visulite Theatre|0.009|0.8|9.596|5.817|6.180|3.115|0.000|
|pDA8NJUwGl1IoLDeaVfo0Q|AMC Ridge Park Square Cinema 8|0.400|0.450|0.624|0.234|0.000|0.683|0.000|

The performance of a business will be shown in a radar chart, take the cinema named "Chandler Cinemas" as an example:

<img src="./picture/radar.png" width = 30% height = 30% />

Based on the scores, we first point out the present status of the cinema: "*The rating of food&drinks, promotion of your cinema is higher than the median level of all the cinemas. The rating of *"
point out the first two biggest weaknesses and strengths. We gives suggestions like this: "*Based on the customer reviews, you have to improve your facility, especially seats. You best strength is service, please hold on to it.*"

# Strength and weakness

## Strength
1.  date and compliment


## Weakness
1. 

# Conclusion

# Duties

Yuchen Zeng: Text processing, word counting and shiny app.

Chong Wei: User selection, sentiment analysis.

Jingwen Yan: Extracting the business attribute, sentiment analysis.

# References