In [2]:
!gsutil cp gs://yelp-dataset-bucket/yelp_academic_dataset_review.json /home/user

Copying gs://yelp-dataset-bucket/yelp_academic_dataset_review.json...
- [1 files][  5.9 GiB/  5.9 GiB]   50.8 MiB/s                                   
Operation completed over 1 objects/5.9 GiB.                                      


In [10]:
!hdfs dfs -mkdir /user/data

In [11]:
!hdfs dfs -put /home/user/yelp_academic_dataset_review.json /user/data

# Prediction number of stars for a review

Our dataset is quite large, about 6GB. For debugging our code, we will use [sample](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sample#pyspark.sql.DataFrame.sample) after reading the JSON.

In [12]:
reviews_on_hdfs = "/user/data/yelp_academic_dataset_review.json"

In [13]:
reviews = spark.read.json(reviews_on_hdfs).sample(0.000001)
reviews.show(n=5)

+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|               date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|-lCSC0-seRf1KZUeL...|   0|2011-12-19 20:18:57|    0|7eUgRTX-Y5Fudkgod...|  4.0|My favourite room...|     0|HFItzRohDHZvcKDrM...|
|KEaCHdsY7w7CBsZ6h...|   0|2007-10-13 14:13:32|    0|FbI1y7TPgnQd_xCL_...|  4.0|We went to the Gr...|     0|d0LUROgBb3R5eAEio...|
|FCP5hYaTtn6dkpmZ_...|   0|2010-07-04 09:25:51|    0|KAqyziQ_VXNlgRDiL...|  4.0|1st, not the best...|     0|ijExLQtBHr4FXb3yf...|
|p8HvhJZ-_EHhmUVmZ...|   0|2014-06-26 03:05:47|    0|Az0KZ1E4GXmS7PJ-Z...|  4.0|This place was aw...|     0|BNosARG4V6JBJlXe0...|
|faPVqws-x-5k2CQKD...|   0|2018-04-07 18:16:21|    0|kEwQ2ljpqTpPnJIbQ...|  4.0|It is good

# Transforming Data

Spark has a vast library of feature engineering functions. For example, we can get TF-IDF representation for our review corpus. In the following snippet we construct a data preparation pipeline with three stages:
1. we get review text parsed into words
1. we count term frequencies of our bags of words
1. we normalise by inverted document frequency

In [14]:
%%time

from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

data_preparation = Pipeline(stages=[
    Tokenizer(inputCol="text", outputCol="words"),
    HashingTF(inputCol="words", outputCol="term_frequency"),
    IDF(inputCol="term_frequency", outputCol="embedding")
])
prepared_reviews = data_preparation.fit(reviews).transform(reviews)
prepared_reviews.select("text", "words", "term_frequency", "embedding").show(n=5)

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|      term_frequency|           embedding|
+--------------------+--------------------+--------------------+--------------------+
|My favourite room...|[my, favourite, r...|(262144,[2916,538...|(262144,[2916,538...|
|We went to the Gr...|[we, went, to, th...|(262144,[3340,538...|(262144,[3340,538...|
|1st, not the best...|[1st,, not, the, ...|(262144,[14,2315,...|(262144,[14,2315,...|
|This place was aw...|[this, place, was...|(262144,[5630,156...|(262144,[5630,156...|
|It is good.  The ...|[it, is, good., ,...|(262144,[9639,158...|(262144,[9639,158...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows

CPU times: user 95.4 ms, sys: 33.5 ms, total: 129 ms
Wall time: 36.2 s


Let's look into the details of the first row:

In [15]:
prepared_reviews.select("text", "words", "term_frequency", "embedding").head()

Row(text=u"My favourite room in the house (other than my bedroom because I sleep like a champ), is the kitchen!  \n\nThey have a lot of neat stuff.  I prefer to come here for all my kitchen essentials because the stuff they have is of good quality, and its trendy and modern.  The store is pretty easy to navigate through.  \n\nThe prices are quite good, especially around the holidays and unlike some other places (Home Outfitters, Homesense), you can browse their product offerings online to cut down on time wasted in the store.  They will even tell you if something is in stock (not down to the very last one, but that's good enough for me!).", words=[u'my', u'favourite', u'room', u'in', u'the', u'house', u'(other', u'than', u'my', u'bedroom', u'because', u'i', u'sleep', u'like', u'a', u'champ),', u'is', u'the', u'kitchen!', u'', u'', u'', u'they', u'have', u'a', u'lot', u'of', u'neat', u'stuff.', u'', u'i', u'prefer', u'to', u'come', u'here', u'for', u'all', u'my', u'kitchen', u'essential

Mind the representation of TF-IDF vectors - it's sparse.

# Do It Yourself

Try to follow [a tutorial from Spark docs](http://spark.apache.org/docs/latest/ml-classification-regression.html#regression)

* calculate `word2vec` embeddings instead of TF-IDF
* build a linear regression (predict stars by text)
* split data into train and validation sets and evaluate your model
* compare quality of models (TF-IDF vs word2vec, linear vs random forest vs gradient goosted trees)

In [42]:
from pyspark.ml.feature import Word2Vec
from pyspark.ml.regression import LinearRegression

In [48]:
data_preparation = Pipeline(stages=[
    Tokenizer(inputCol="text", outputCol="words"),
    Word2Vec(inputCol="words", outputCol="model")
])

prepared_reviews = data_preparation.fit(reviews).transform(reviews)
prepared_reviews.select("text", "words", "model").show(n=5)

regression = LinearRegression(featuresCol="model", labelCol="stars", maxIter=100, regParam=0.2, elasticNetParam=0.5)
model = regression.fit(prepared_reviews)

print("coefficients: " + str(model.coefficients))
print("intercept: " + str(model.intercept))

+--------------------+--------------------+--------------------+
|                text|               words|               model|
+--------------------+--------------------+--------------------+
|My favourite room...|[my, favourite, r...|[0.00131861418102...|
|We went to the Gr...|[we, went, to, th...|[6.01509346811593...|
|1st, not the best...|[1st,, not, the, ...|[0.00119749398695...|
|This place was aw...|[this, place, was...|[6.51844831882044...|
|It is good.  The ...|[it, is, good., ,...|[0.00146789905875...|
+--------------------+--------------------+--------------------+
only showing top 5 rows

coefficients: [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1237.4554721391437,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-431.49024523906286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1077.5112016550845,0.0,0.0,-929.5960582214364,0.0,0.0,0.0,0.0,0.0,0.0,1002.9030688934004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-131.79415085627096,557.871876985645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.

In [49]:
prepared_reviews.columns

['business_id',
 'cool',
 'date',
 'funny',
 'review_id',
 'stars',
 'text',
 'useful',
 'user_id',
 'words',
 'model']

In [50]:
trainingSummary = model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 0.460701
r2: 0.839328


In [None]:
train_predict = model.transform(prepared_reviews)
train_predict.select("prediction", "stars", "model").show(35)

+------------------+-----+--------------------+
|        prediction|stars|               model|
+------------------+-----+--------------------+
| 3.978619162700394|  4.0|[0.00131861418102...|
|3.7942751679944546|  4.0|[6.01509346811593...|
|     4.13363320459|  4.0|[0.00119749398695...|
|3.8609924920248204|  4.0|[6.51844831882044...|
| 3.784364208530939|  4.0|[0.00146789905875...|
| 3.938090791198865|  4.0|[0.00102792119258...|
| 2.768740683860572|  2.0|[0.00102986608414...|
| 4.627303976338861|  5.0|[0.00147076718646...|
| 3.749330720878864|  4.0|[0.00148871509222...|
| 4.997540225979817|  5.0|[-7.7833037115245...|
| 4.453156920063716|  5.0|[2.88112896669190...|
| 4.894827564276035|  5.0|[6.32315336371816...|
| 3.298712629100418|  2.0|[9.10666916063386...|
|   3.1703814718937|  3.0|[9.29591479785532...|
| 4.427272387978683|  5.0|[7.81678155261842...|
| 4.726151281977669|  5.0|[0.00150755843857...|
| 3.756729222872821|  4.0|[6.10695605504630...|
|1.6398778877393632|  1.0|[0.00209529235

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

lr_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="stars",metricName="r2")

print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(train_predict)) 