# Import modules

In [3]:
import pandas as pd

# Before you begin...

Make a `dataset` directory at the root of this project and place your json files in there.

# Convert datasets to CSV

I tried countless methods to read the data into a dataframe directly from `JSON` however I was not successful. the [yelp/dataset-examples](https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py) has a `json_to_csv_converter` which can be used to convert the data to `csv`.

## Log of attempt to load JSON data

```
** <2018-02-04 Sun>
- Note taken on [2018-02-04 Sun 18:16] \\
  [[https://www.dataquest.io/blog/python-json-tutorial/][blogpost]] on some techniques to deal with large datasets
- Note taken on [2018-02-04 Sun 18:16] \\
  trying to load the =review.json= file but experiencing weird problems.
  Found a [[https://github.com/pandas-dev/pandas/issues/18152][issue]] on the pandas repository which documents the same error
  that I am seeing. The conclusion seems to be that the json file was
  malformed. I need to verify if my dataset has any issues.
** <2018-02-05 Mon>
- Note taken on [2018-02-06 Tue 12:18] \\
  the converter works when executed with python2!
- Note taken on [2018-02-05 Mon 22:43] \\
  found a =json= to =csv= converter at the [[https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py][yelp/dataset-examples]] repo. The
  code is for python 2 so need to make a few adjustments before it works.
```

**Note:** the following cell will take time to execute, might want to go grab some ☕️. Alternatively, I would recommend manually converting only the files you need in the shell.

In [6]:
%%bash
for file in dataset/*.json;
do
    echo "converting $file to csv..."
    python2 lib/json_to_csv_converter.py $file;
done

# Exploration of reviews

## Hypothesis

H_o_: Reviews with higher `stars` should have a higher `useful` vote.

H_a_: Reviews with higher `stars` do not have a higher `useful` vote.

In [2]:
reviews = pd.read_csv('dataset/review.csv')

In [77]:
reviews.sort_values(by='useful')

Unnamed: 0,funny,user_id,review_id,text,business_id,stars,date,useful,cool
4203968,0,S7oukZE-NH_33uuWuR47LQ,hx5oI9l2xXwZMqyoiMXbeg,Cornish Pasty saved our Mothers Day! After hav...,ohEnmKpF7i2_ujme1p_vUQ,5,2017-05-15,-1,-1
0,0,bv2nCi5Qv5vroFiqKGopiw,v0i_UHJMo_hPBq9bxWvW4w,"Love the staff, love the meat, love the place....",0W4lkclzZThpx3V65bVgig,5,2016-05-28,0,0
2953500,0,Jm5h-bDATqRMWs3VahkFPg,SkyZGW3MV5-bK6supogQlQ,Love their Baja fish taco. I love that you can...,iGEvDk6hsizigmXhDKs2Vg,5,2015-12-10,0,0
2953501,0,Jm5h-bDATqRMWs3VahkFPg,fnMg8s-eQoTF6bvSJdUafQ,I love the bamboo drink ~~! it was cute the fi...,RtUvSWO_UZ8V3Wpj0n077w,4,2014-04-18,0,0
2953502,0,Jm5h-bDATqRMWs3VahkFPg,OsT8rbUqjH_8x5EZtSfMAQ,REALLY REALLY GOOD~!~\nTHE option (fried porte...,oWTn2IzrprsRkPfULtjZtQ,5,2012-11-20,0,0
2953506,0,Jm5h-bDATqRMWs3VahkFPg,mZCAdvFpgJpGqZ0v9dlVeQ,"For my birthday dinner dates, my hubby often t...",kKC7UTcHZM3zLfT97x3msg,3,2016-06-08,0,1
2953510,0,Jm5h-bDATqRMWs3VahkFPg,ATuUPgTDRs-HclsJiTF7bA,love the atmosphere -- maybe it was the waterf...,O66Zy8Y13VBm72ZDhS4fIg,4,2014-04-18,0,0
2953511,0,Jm5h-bDATqRMWs3VahkFPg,u3IuAf_xnlo6-N_hD0Z_dg,Food: The selection of salad is amazing. I hav...,XC85BrIIDwxN21K462jsuA,4,2017-01-28,0,0
2953512,0,Jm5h-bDATqRMWs3VahkFPg,5eE_Wmf00orrJHQQBwSaVg,Good bowl of hot soup!!! Perfect for winter ti...,iMoFE2g4kDG4FfKLJvk3Jw,4,2012-11-20,0,0
2953514,0,Jm5h-bDATqRMWs3VahkFPg,NSYO0FZWMAdx04mJ2Dxfdw,AMAZING recommendation by my friend to order t...,N93EYZy9R0sdlEvubu94ig,5,2014-04-18,0,0


In [73]:
reviews.describe()

Unnamed: 0,funny,stars,useful,cool
count,5261669.0,5261669.0,5261669.0,5261669.0
mean,0.509196,3.72774,1.385085,0.5860916
std,2.686168,1.433593,4.528727,2.233706
min,0.0,1.0,-1.0,-1.0
25%,0.0,3.0,0.0,0.0
50%,0.0,4.0,0.0,0.0
75%,0.0,5.0,2.0,1.0
max,1481.0,5.0,3364.0,1105.0


In [31]:
# correlations
reviews.corr()

Unnamed: 0,funny,stars,useful,cool
funny,1.0,-0.048866,0.621663,0.661669
stars,-0.048866,1.0,-0.077122,0.044828
useful,0.621663,-0.077122,1.0,0.677069
cool,0.661669,0.044828,0.677069,1.0


We note that there is a negative correlation between `stars` and `useful` which means that is a review has a higher `stars` then it received a *lower* `useful` vote. This shows that there is a rational descrepancy in the data. We can further validate this observation by creating a pivot table of `useful` vs. `stars`.

In [25]:
pd.pivot_table(reviews,values='useful', index='stars' )

Unnamed: 0_level_0,useful
stars,Unnamed: 1_level_1
1,3.202899
2,2.900901
3,1.969112
4,1.734139
5,2.047826


From the above pivot table we obtain an inconclusive result, further analysis is required. Next, let's obtain a count of users in the `reviews` df. We can do this using the `user_id` column. Note that `user_counts` obtained below is sorted in descending order. We then take the top 25% of users who have posted a a lot of reviews, similarly we also take the bottom 25% users with the least number of reviews.

In [42]:
# count of users (sorted highest to lowest)
user_counts = reviews['user_id'].value_counts()

In [60]:
import math

REVIEWS_LEN = reviews.shape[0]
TOP_25 = math.floor(REVIEWS_LEN*0.25)
BOT_25 = -TOP_25

most_frequent_users = user_counts[:TOP_25] # first 25%
least_frequent_users = user_counts[BOT_25:] # last 25%

Next, we obtain a df containing the reviews from the top 25% and bottom 25% users.

In [68]:
most_frequent_user_reviews = reviews.filter(items=most_frequent_users, axis=0)
least_frequent_user_reviews = reviews.filter(items=least_frequent_users, axis=0)

In [69]:
most_frequent_user_reviews.corr()

Unnamed: 0,funny,stars,useful,cool
funny,1.0,-0.341695,0.400509,0.31868
stars,-0.341695,1.0,-0.623933,-0.38893
useful,0.400509,-0.623933,1.0,0.368146
cool,0.31868,-0.38893,0.368146,1.0


In [70]:
pd.pivot_table(most_frequent_user_reviews, values='useful', index='stars')

Unnamed: 0_level_0,useful
stars,Unnamed: 1_level_1
1,5.704199
2,1.97107
3,1.988311
4,0.198111
5,0.072265


In [71]:
least_frequent_user_reviews.corr()

Unnamed: 0,funny,stars,useful,cool
funny,1.0,-0.371052,0.423652,0.314621
stars,-0.371052,1.0,-0.63229,-0.3968
useful,0.423652,-0.63229,1.0,0.357234
cool,0.314621,-0.3968,0.357234,1.0


In [72]:
least_frequent_usefulness = pd.pivot_table(least_frequent_user_reviews, values='useful', index='stars')
least_frequent_usefulness

Unnamed: 0_level_0,useful
stars,Unnamed: 1_level_1
1,5.902944
2,2.154024
3,2.060877
4,0.180985
5,0.070508


Now constructing pivot tables using the reviews top 25% and bottom 25% users, we can clearly see that reviews with lower `stars` receive lower `useful` votes! This proves that H_a_ is true.

**UPDATE**: This proves nothing since the stars on a review is given by the reviewer to the business they are reviewing so it makes sense that it has no correlation with the useful vote of a review.

# Are useful votes on a review biased? (Test run)

Do users actually read a review before voting it useful? Or is their decision biased based on the reviewer, star, cool votes, etc?

Let's do a quick test run! First we will train a model to predict if a review is useful only based on the text of the review. Next we will train a model using the review text and the useful votes. The accuracy of the two models will be compared, the hypotesis being that the later will perform better since it is closer to real life. To start off, we are going to use `CountVectorize` for feature extraction and a `Naive Bayes` classifer.

## Model 1: Text only

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csc_matrix, vstack
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [2]:
reviews = pd.read_csv('dataset/review.csv', usecols=['text', 'useful'], dtype=str)

In [3]:
reviews['length'] = reviews['text'].apply(len)
reviews.head()

Unnamed: 0,text,useful,length
0,"Love the staff, love the meat, love the place....",0,289
1,Super simple place but amazing nonetheless. It...,0,213
2,Small unassuming place that changes their menu...,0,502
3,Lester's is located in a beautiful neighborhoo...,0,373
4,Love coming here. Yes the place always needs t...,0,523


In [18]:
X = reviews[['length', 'useful']]
Y = pd.read_csv('dataset/Y.csv')
Y.head()

Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0


Let's make sure that our data is even ie. we have equal number of data points for both our labels. This will ensure that our model is not biased and has equal number of examples to learn from.

In [5]:
Y.groupby('label').size()

label
0    2744483
1    2517186
dtype: int64

The labels are mostly balanced so we will not make any changes to the dataset.

In [19]:
# note random_state is not set intentionally, I want to get different splits for now
# later we will use k-fold validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4)

print("shape of splits:")
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("Y_train: ", Y_train.shape)
print("Y_test: ", Y_test.shape)

shape of splits:
X_train:  (3157001, 2)
X_test:  (2104668, 2)
Y_train:  (3157001, 1)
Y_test:  (2104668, 1)


In [20]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train['length'].values.reshape(-1, 1), Y_train)

  y = column_or_1d(y, warn=True)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
nb_classifier.score(X_test['length'].values.reshape(-1, 1), Y_test)

0.5216761028342712

## Model 2: With text and useful vote

In order to combine the features obtained from the text and useful votes, we need to first convert the useful series into a sparse matrix and then concat the two sparse matrices vertically (column wise).

In [22]:
nb_classifier.fit(X_train, Y_train)

  y = column_or_1d(y, warn=True)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [23]:
nb_classifier.score(X_test, Y_test)

0.9996260692897883