# Prediction number of stars for a review

Our dataset is quite large, about 6GB. For debugging our code, we will use [sample](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sample#pyspark.sql.DataFrame.sample) after reading the JSON.

In [1]:
reviews_on_hdfs = "/user/borisshminke/data/yelp_academic_dataset_review.json"

In [2]:
reviews = spark.read.json(reviews_on_hdfs).sample(0.000001)
reviews.show(n=5)

+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|               date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|_iEl9sCLsvXEFHUWP...|   1|2017-08-16 17:57:23|    0|wI9BR4DNU99C_dvY-...|  4.0|Really 3.5

I thi...|     0|f77_FtAlN-8H4bUdu...|
|MUad5l6z0Z3fwdpb4...|   0|2015-09-27 20:45:18|    1|zrIFeuDJhZZ7Ce-K5...|  2.0|Cool concept - an...|     1|yXVhmdBFBmU3DIu9Y...|
|yHejLbG91ThJIn2xp...|   0|2019-01-23 14:38:13|    0|b2DO8cH6ooKQKcCxC...|  5.0|I eat at this pla...|     0|NZYeGIBbwDKYTwYou...|
|P0-zxLhfe9iOgidDG...|   0|2019-05-15 04:51:51|    0|Ryc_Aep4hkl6YBv0Y...|  5.0|This place is wit...|     1|daow2AoiYJGMrbrcl...|
|3kdSl5mo9dWC4clrQ...|   0|2018-12-28 03:58:43|    0|smvKWyooBL-5UFNfd...|  5.0|Great plac

# Transforming Data

Spark has a vast library of feature engineering functions. For example, we can get TF-IDF representation for our review corpus. In the following snippet we construct a data preparation pipeline with three stages:
1. we get review text parsed into words
1. we count term frequencies of our bags of words
1. we normalise by inverted document frequency

In [3]:
%%time

from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

data_preparation = Pipeline(stages=[
    Tokenizer(inputCol="text", outputCol="words"),
    HashingTF(inputCol="words", outputCol="term_frequency"),
    IDF(inputCol="term_frequency", outputCol="embedding")
])
prepared_reviews = data_preparation.fit(reviews).transform(reviews)
prepared_reviews.select("text", "words", "term_frequency", "embedding").show(n=5)

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|      term_frequency|           embedding|
+--------------------+--------------------+--------------------+--------------------+
|Really 3.5

I thi...|[really, 3.5, , i...|(262144,[14,1998,...|(262144,[14,1998,...|
|Cool concept - an...|[cool, concept, -...|(262144,[15889,16...|(262144,[15889,16...|
|I eat at this pla...|[i, eat, at, this...|(262144,[12888,15...|(262144,[12888,15...|
|This place is wit...|[this, place, is,...|(262144,[9639,136...|(262144,[9639,136...|
|Great place to go...|[great, place, to...|(262144,[1889,231...|(262144,[1889,231...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows

CPU times: user 105 ms, sys: 20.6 ms, total: 125 ms
Wall time: 20.8 s


Let's look into the details of the first row:

In [4]:
prepared_reviews.select("text", "words", "term_frequency", "embedding").head()

Row(text=u"Really 3.5\n\nI think came here in an off night.\nFirst I wanted the beautiful tacos I see in the photos  on Yelp.  The menu I had only had one taco on it.  When I asked the waiter.  He told me that was the only menu they had.  I then ordered the fried burrito and it looked amazing!!! \n\n\nHowever the burrito tasted pretty bad.\n\nI don't think I'll be coming back here.", words=[u'really', u'3.5', u'', u'i', u'think', u'came', u'here', u'in', u'an', u'off', u'night.', u'first', u'i', u'wanted', u'the', u'beautiful', u'tacos', u'i', u'see', u'in', u'the', u'photos', u'', u'on', u'yelp.', u'', u'the', u'menu', u'i', u'had', u'only', u'had', u'one', u'taco', u'on', u'it.', u'', u'when', u'i', u'asked', u'the', u'waiter.', u'', u'he', u'told', u'me', u'that', u'was', u'the', u'only', u'menu', u'they', u'had.', u'', u'i', u'then', u'ordered', u'the', u'fried', u'burrito', u'and', u'it', u'looked', u'amazing!!!', u'', u'', u'', u'however', u'the', u'burrito', u'tasted', u'pretty'

Mind the representation of TF-IDF vectors - it's sparse.

# Do It Yourself

Try to follow [a tutorial from Spark docs](http://spark.apache.org/docs/latest/ml-classification-regression.html#regression)

* calculate `word2vec` embeddings instead of TF-IDF
* build a linear regression (predict stars by text)
* split data into train and validation sets and evaluate your model
* compare quality of models (TF-IDF vs word2vec, linear vs random forest vs gradient goosted trees)