# Predict Yelp Ratings!

# 1. Introduction

Yelp data contains customers' reviews reflecting thier experience at a restaurant, like how they feel about foods, service or environment, and some other basic information of a certain restaurant, like location and categories. In this project, our group tried to come up with a simple, robust and accurate but imterpretable method to estimate yelp ratings. 

**Thesis Statement**: Logistic regression model with L2 penalty term is a fast and accurate classfier. To predict Yelp ratings based on 1.5 million train data, a logistic classfier spends less than 3 minutes on trainning and yields rmse=0.63.

# 2. Data pre-processing

## 2.1 Cleanning texts

Step 1: Only keep words and several punctuations(.?!',) from the raw data(including train data and test data).

Step 2: Put all the words(about 0.4 million) in a huge dictionary and remove those words with frequency less than 250 and get a dictionary with only 16k words.

Step 3: Generate new words by combining "not" with the first adjective and verb appear behind it in current sentence. eg?

## 2.2 Special punctuations and category variable

We keep special punctuations who exist in a form of "..", "!!" or "??" by counting how many times they appear in one observation. No matter how many dots, question marks or exclamation marks appear together in one sentence, they will be counted as once. 

From star histograms of different categories, we find that categories have effects on ratings. So here we generate a sparse matrix for categories, and combine it to our original sparse matrix we generate for text.

# 3 Logistic Regreesion as a classifier

## 3.1 TFIDF: Term Frequency–Inverse Document Frequency


TFIDF is a numerical statistic that is intended to reflect how important a word is to a text. The goal of using TFIDF instead of the raw frequencies of occurrence of a word in a text is to scale down the impact of word with high frequency in the corpus. The formula is: (w, t represent a word and a text)
$$\text{TF}(w,t) = \cfrac{\text{# w in t}}{\text{# words in t}}, \quad \text{IDF}(w,t) = \log\cfrac{\text{# words in t}}{\text{# texts that contain w}}, \quad \text{TFIDF} = \text{TF} \times \text{IDF}$$

A constant 1 is added to the denominator and numerator to prevent zero divisions. After calculating the TFIDF for a text, the result will be normalized . An example is shown below, suppose the corpus is composed of "This is not good", "This is delicious", and "Good".

\begin{equation}
\begin{pmatrix} 
1 & 1 & 1 & 1 & 0 \\
1 & 1 & 0 & 0 & 1 \\
0 & 0 & 0 & 1 & 0
\end{pmatrix} \Longrightarrow 
\begin{pmatrix} 
0.46 & 0.46 & 0.60 & 0.46 & 0 \\
0.52 & 0.52 & 0 & 0 & 0.68 \\
0 & 0 & 0 & 1 & 0
\end{pmatrix}
\end{equation}

The columns represent 'this''is''not''good''delicious'. We can see that 'not','delicious' and 'good' have the largest TFIDF within each row. Basically, higher TFIDF means higher importance.

## 3.2 Model fitting

First, we use cleaned data to build a large sparse matrix whose columns represent words  values are TFIDF. Then we build another sparse matrix whose columns represent categories and values are TFIDF. We combine them and get a sparse matrix with dimenson $1546379\times 17154$. This is the model matrix we will use in the following steps.



## 3.3 Result

# 4 Neural Network

## 4.1 Word embedding

## 4.2 FastText

## 4.3 Result

Based on prior knowledge of bodyfat, we think linear model is an appropriate model in this analysis. In order to achieve the goal of coming up with a simple, robust, and accurate “rule-of-thumb” method, we try to fit a multiple linear model. However, 14 variables seem to be too much in practice. We use different methods to do variable selection, and compare their performances.

Before modeling, we randomly choose 200 data as train set, and leave the rest as validation set. All the models are based on train set only, and tested on the validation set to check performance. 

Our goal is to find a model that is
* *accurate*: predict well on both test and validation data.
* *able to interprete*: easy to see which bunch of words have larger efficients in our logistic model.

## 3.1 Multiple Linear Regression

Based on other published work on this topic, a multiple linear model is likely to fit well:

$$ \text{(Body Fat %)}_i = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon_i, \quad{} \epsilon_i \sim N(0,\sigma^2) $$

However, 14 variables are too many for a model to predict bodyfat, especially in this case. Some variables may not be useful, and may have high collinearity, which requires us to remove thr redundant ones. For example, in common sense, a person who is heavy also tends to have larger body circumferences. Therefore, we need to do variable selection.

### 3.1.1 Variable Selection

We tried stepwise model selection based on crteria like Mallow's Cp, AIC and BIC, and fit multiple linear models based on the variables selected. We compare each model's performance on validation set, and select one method that have the smallest mean squared error: $MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i-\hat y_i)^2$

**Method**|Mallow's Cp|AIC|BIC
:-----:|:-----:|:-----:|:-----:
**MSE**|15.069|14.702|13.989


We find that when using the BIC method, the mean square error on validation set is the smallest. Variables *ABDOMEN*, *WRIST* and *WEIGHT* are selected, as shown below.
At this point, we come up with a multiple linear regression:

$$\text{BodyFat%}\text{ ~ }\text{Abdomen}+\text{Wrist}+\text{Weight}$$

### 3.1.2 MLR and interpret results

All coefficients are significant, which is not surprising. The result means:
* **Coefficients**: For example, keep other variables constant, if abdomen circumference increases by 1 cm, the body fat percentage will increase by 0.85%.
* **MSE**: Measures how well the prediction is. For example, the standard error of a prediction on the test data set can be estimated by $\sqrt{\text{MSE}} = 3.74$
* **Multiple R-squared**: Indicates that the variables explain about 72% of all the variation in BodyFat%

### 3.1.2 Model Diagnosis

Check Gauss-Markov assumptions: Gauss-Markov assumptions are met. Furthermore, no plausible highly influential points.

Check multi-colinearity: use *vif()* function to check multicollinearity between the variables selected. If the number is less than 10, we assume no significant multicollinearity among these variables.

## 3.2 LASSO

Lasso is another approach to select variables. The advantage of this method is that we can minimize the MSE on the validation set. 

The model is still a linear model: 
$$ \text{(Body Fat %)}_i = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p$$

However, instead of minimizing $\sum\limits_{i=1}^N (y_i-\beta_0-\beta_1x_i - \dots - \beta_px_p)^2$, which is what multiple linear regression does, Lasso minimizes:

$$\sum\limits_{i=1}^N (y_i-\beta_0-\beta_1x_i - \dots - \beta_px_p)^2 +\lambda \left|\sum\limits_{j=1}^p \beta_j \right| $$

We try different $\lambda^{(k)}$ to fit on the train set, then get coefficients $\beta^{(k)} = (\beta_0, \beta_1, \dots, \beta_{14})^{(k)}$. Then we get a predictive equation $y = X^\text{T}\beta^{(k)}$ to predict the BodyFat% on the validation set. The figure below shows the result of Lasso method.

**Explanation**:
* The x-axis indicates $\log \lambda$, and y-axis indicates MSE. 
* The numbers 0,1,...,14 in the label indicates the number of variables selected in that case. As $\lambda$ increases, the number of variables selected decreases. 
* The grey verticle lines indicates where some variables are deleted.

Based on the result, we choose $\lambda$ to be 0.1824 (the highlighted point in the figure), because:
* It select only 4 variables. The model is *simple*.
* The MSE at this time is relatively small (around 15). The model is *accurate*.

# 5. Conclusion

# 6. Future work

Based on the two different methods discussed before, we can propose two linear models to predict body fat %:

$$
\begin{align} 
\text{(BodyFat %)} &= -23.794+0.852*\text{Abdomen}-1.258*\text{Wrist}-0.073*\text{Weight} \\
\text{(BodyFat %)} &= -8.087+0.662*\text{Abdomen}-1.242*\text{Wrist}-0.186*\text{Height}+0.033*\text{Age}
\end{align}
$$

The MSE for these two models are sepearately: 13.99, 14.32. Because the first model is simpler and more accurate, we prefer the first predictive equation.

**Possible rule of thumb**: "Your abdomen circumference (cm) multiply by 0.85 minus wrist circumference (cm) multiply by 1.26 minus weight (lbs) multiply by 0.07 minus 24"

**Example Usage**: For a normal graduate male student, with circumferences: Abdomen=85cm, Wrist=18cm, Weight=130lbs, his  predicted body fat percentage would be around 16.43%. A 95% probability that his body fat is between 8.26% and 24.59%.

**Strengths**
- Our method is fast using less than 10 minutes, comparing several hours using by neural network.
- The final model is easy to understand.

**Weakness**

- RMSE is good, but not so small as multi-leyer neural network.
- Can not eliminate multicollinearity, which might miss some information.


### Each member's contribution
* **Shiwei Cao**: Lasso method; model checking; Summary writting
* **Jiyun Chen**: Variable selection and final model building; Summary writting
* **Jing Guo**: Data preprocessing and presentation materials organizing; Making slides