<h1 style="text-align:center">Seafood Restaurant Business Analysis with Yelp Data </h1>
<h4 style="text-align:right">Pengfei Hei, Zejin Gao, Siqi Shen, Anne Huen Wai Wong</h4>

# 1 Introduction

## 1.1 Thesis Statement

In this project, we analyse yelp data focusing on seafood restaurant business. Our goal is to explore factors extracted from attributes and reviews that could have influence on business ratings. And furthermore, we provide useful, analytical suggestions to seafood business owners in order to improve their Yelp ratings. Our work can be mainly divided into two parts: attributes analysis and review analysis. For the first part, we to be continue... For the second part, we do **sentiment analysis** for informative nouns in reviews. And then, we do **ANOVA** and **t-test** between customer sentiment for different nouns and compute **correlation coefficient** between customer sentiment and business stars. Finally, we combine our findings of the two parts together and give our conclusions and suggestions.

## 1.2 Data Background

This Yelp data contains about 6.69 million reviews and about 193 thousand businesses from the following cities: Montreal (Canada), Waterloo (Canada), Pittsburgh (U.S.), Charlotte (U.S.), Urbana-Champaign (U.S.), Phoenix (U.S.), Las Vegas (U.S.), Madison (U.S.), Cleveland (U.S.). There are four JSON files.

* **review json** contains 6,685,900 reviews.
* **business.json** contains information about 192,609 businesses.
* **user.json** contains information about 1,637,138 users.
* **tip.json** contains information about 1,223,094 tips written by users on businesses.

# 2 Data Filtering

talk about how to extract seafood related observations from business.json and review.json

include seafood restaurant not include steakhouse 

the length because we only want restaurant focusing on seafood and don't want other thing to interfere

the number of review > 50 because we don't want some outlier reviews

we save our new data set as ".csv".

# 3 Attribute Analysis

## 3.1 Filter Attributes
After obtaining businesses, we count all the attributes they contain and number of businesses contained in each attribute. Some attributes only contain very few businesses and will potentially introduce bias to the analysis, thus we need to rule out attribites with enough businesses. Intuitively, we decide to drop attributes with less than 50% businesses and the table for the rest attriutes shows below.

Attribute| Counts | Attribute | Counts | Attribute | Counts
------------|------|----------|--------|-----------|--------
RestaurantsTableService|225|BusinessAcceptsCreditCards|347|GoodForMeal|381
Caters|401|BikeParking|414|WiFi|426
NoiseLevel|427|RestaurantsAttire|430|Alcohol|431
HasTV|432|RestaurantsDelivery|432|RestaurantsPriceRange2|432
Ambience|432|OutdoorSeating|432|RestaurantsGoodForGroups|433
RestaurantsReservations|433|BusinessParking|433|RestaurantsTakeOut|433
GoodForKids|433

## 3.2 Imputation for Missing Values
After we filter the attributes due to its number of business, there are still many missing values. There are only 5 attributes containing all businesses, and the rest all have at least 1 missing values. Thus, to better analyze relationship between attributes and ratings, we need to impute missing values first. Since all of our attributes are categorical data, we decide to use decision tree which is an interpretable method and easy to implement. Take attribute 'BusinessAcceptCreditCard' as an example. It has 85 missing values and its value is just 'True' or 'False', thus decision tree should work well on it. We take 5 full attributes as the input, existing 'BusinessAcceptCreditCard' as output and train the model. To evaluate the performance of this model, cross validation is applied to obtain a score. When 80% of the data is used as training set, the score on the testing set is 0.97 which is an incredibly good result. Therefore this model could impute this attribute well. Do similar thing to other attributes and we obtain a full data set.
## 3.3 Regression and Anova
Then we consider linear regression and anova. We first consider a model including interaction. After fitting the model, we notice that only 'BusinessAcceptCreditCards', 'GoodForMeal', 'WiFi', 'NoiseLevel', 'RestaurantsAttire', 'Alcohol' are significant. 

# 4 Review Analysis

## 4.1 Data Cleaning

First, we obtain review data set with features we need. The review data from "all_review.csv" contains 7 features. We only keep features: 'business_id', 'stars' and 'text'. Then we save the new review data set as "review_with_useful_features.csv".

Second, we do **word tokenization**, which means we need to convert text into words. This process has 7 steps: step1, convert "n't" to "not" and then connect "not" with the word after it, such as changing "wouldn't go" to "would not_go"; step2, break paragraph into words; step3, remove punctuation, nonalphbetic string; step4, convert numbers to words; step5, convert words to lower case; step6, remove stopwords (we import "stopwords.words('english')" from python package nltk.corpus); step7, do lemmatization, such as changing "likes" to "like". 

An example is shown below.

|       business_id|  stars|                                text|                                 words
|---------------------|------|-----------------------------------------------------|-----------------------------------------------------
6xgcHeLad-VaoTIQewK84A|    3.0|  "Seems old and tired! And I wouldn't come again 3."|  \['seem','old','tired','would','not_come','three'\]


## 4.2 Sentiment Analysis

### 4.2.1 positive and negative adjectives classification

We use **Multinomial Naive Bayes Classifier** to classify adjectives in review text as positive and negative.

First, we define stars from 1 to 3 as negative and stars from 4 to 5 as positive. Then we convert stars into positive/negative tags and treat positive/negative tag as response variable. Second, we extract adjectives from tokenized words by python function "nltk.pos_tag()" (part-of-speech tagging). Third, we count frequency of each adjectives in all review texts and obtain 1200 most frequent adjectives. Fourth, we count occurrences for these 1200 adjectives in each review text and obtain the frequency matrix with the index of review as row and 1200 adjectives as column. We treat this frequency matrix as design matrix. Fifth, we fit multinomial Naive Bayes model with design matrix and response variable. Sixth, we do pridiction with each adjective and the positive/negative prediction result is the sentiment tag for adjectives.

Here is an example: {'good': 'positive', 'delicious': 'positive', 'friendly': 'positive', 'bad': 'negative', 'decent': 'negative', 'slow': 'negative'}.

We save the **dictionary for adjectives with positive/negative tags** as "dict_adj.txt".

### 4.2.2 informative nouns in reviews

First, we extract nouns from tokenized words by python function "nltk.pos_tag()" (part-of-speech tagging). Second, we count frequency of each nouns in all review texts. Because we assume that informative nouns would show up many times in review text, we only consider nouns with frequency larger than 4000. Third, we pick up informative nouns from them manually. 

**Informative nouns** are: food, lobster, crab, shrimp, oyster, fish, clam; service, waiter, waitress, chef, manager; price.

### 4.2.3 sentiment analysis for informative nouns

From this part, we do counting at the restaurant level instead of review level. That is to say, we would count some kind of word from all the review texts of each restaurant respectively.

Also, we need to obtain stars for each restaurant from "seafood_business.csv".

Then we obtain **sentiment table** for each informative nouns. First, we count the number of positive/negative adjectives in front of each informative noun. Second, we compute the proportion of positive adjectives among all the adjectives. The **sentiment table for food** is an example shown below.

business_id|positive count|negative count|positive proportion|stars
-----------|--------------|--------------|-------------------|-----
nsNONDHbV7Vudqh21uicqw|102|15|0.872|3.5
F06m2yQSPHIrb1IT7heYeQ|70|1|0.986|4.0
W7hCuNdn2gzehta6eSHzgQ|9|10|0.474|2.0

We define the proportion of positive adjectives among all the adjectives as **customer sentiment**. This variable ranges from zero to one. zero means customers are totally unsatisfied, while one means costomers are totally satisfied.

## 4.3 Test and Finding

First of all, we examine three main parts that could affect the seafood business stars. They are **food**, **service** and **price**. We count the total positive/negative adjectives in front of the three words and compute the customer sentiment. To know whether the customer sentiment has an influence on stars, we compute the **correlation coefficient** between the restaurant stars and the customer sentiment. The result is shown in the table below.

&nbsp; |  positve count | nagative count | customer sentiment | correlation
-------|----------------|----------------|--------------------|-------------
food|24047|1907|0.927|0.547
service|15458|3162|0.830|0.580
price|7869|802|0.908|0.221

We treat the first two column of the table as a contingency table. Then we do **chi-square test** to see whether the three words and customer attitude are independent or not. The p-value is less than $2.2\times10^{-16}$, which means that customers have different attitudes towards food, service and price.

From the analysis above, we can find that food has the highest customer sentiment score, while service has the lowest. In terms of correlation between customer sentiment and stars, customer sentiment of food and service have strong correlation with star, while the correlation for price is not very strong. So we believe that food and service need to be futher examined.

In **food category**, we find lobster, crab, shrimp, oyster, fish and clam. We do the same analysis except computing correlation. The table is shown below.

&nbsp; |  positve count | nagative count | customer sentiment
-------|----------------|----------------|-------------------
lobster|6022|374|0.942
crab|3146|479|0.868
shrimp|4379|265|0.942
oyster|802|58|0.948
fish|6646|558|0.923
clam|1154|100|0.920

The p-value for chi-square contigency table test is less than $2.2\times10^{-16}$, which means that customers have different attitudes towards different kinds of food. Oyster has the highest customer sentiment score, while crab has the lowest.

In **service category**, we find waiter, waitress, chef and manager. We do the same analysis as the one above. The table is shown below.

&nbsp; |  positve count | nagative count | customer sentiment
-------|----------------|----------------|-------------------
waiter|6022|374|0.846
waitress|3146|479|0.834
chef|4379|265|0.945
mamager|802|58|0.690

The p-value for chi-square contigency table test is less than $2.2\times10^{-16}$, which means that customers have different attitudes towards different kinds of service. chef has the highest customer sentiment score, while manager has the lowest.


# 5 Conclusion and Suggestion

## 5.1 conclusion




## 5.2 suggestion


# 6 Strength and Weakness

## 6.1 Strength

## 6.2 Weakness

# 7 Contribution

# 8 Reference