# Amazon Review Analysis

Data source: https://snap.stanford.edu/data/web-FineFoods.html

Useful resources: 
- PySpark cheat sheet: http://web.utk.edu/~wfeng1/doc/cheatSheet_pyspark.pdf
- MLlib document: https://spark.apache.org/docs/latest/ml-guide.html
- SparkR document: https://spark.apache.org/docs/latest/sparkr.html
- SparkR tutorial: https://rpubs.com/wendyu/sparkr

## Before We Start...

Basic concepts of Spark: 
- RDD (Resilient Distributed Datasets): fundamental data structure for distributing data among cluster nodes. Immutable.
- Transformation: operations on RDD that returns an RDD, such as map, filter, reduce, and reduceByKey.
- Action: operations on RDD that returns a non-RDD value, such as collect.

We will be mainly using Spark Dataframe APIs instead of RDD APIs, to simplify development.
- Spark Dataframes are very similar to tables in relational databases. They have schema. Most of the operations on them are similar to querying a relational database as well. You can consider Spark Dataframe as a wrap on top of RDD.

## Loading Data

In [4]:
%run "./Data Lake Connection"

In [5]:
# Reading data from Azure Data Lake

from pyspark.sql.types import *

amazon_schema = StructType([StructField('Id',IntegerType(),True), 
                            StructField('ProductId',StringType(),True),
                            StructField('UserId',StringType(),True),
                            StructField('ProfileName',StringType(),True),
                            StructField('HelpfulnessNumerator',IntegerType(),True),
                            StructField('HelpfulnessDenominator',IntegerType(),True),
                            StructField('Score',IntegerType(),True),
                            StructField('Time',IntegerType(),True),
                            StructField('Summary',StringType(),True),
                            StructField('Text',StringType(),True)])

amazon_review = spark.read.csv("abfss://data@ssbdatalakegen2.dfs.core.windows.net/Reviews.csv", header=True, schema=amazon_schema)

In [6]:
display(amazon_review)

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
1.0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1.0,1.0,5.0,1303862400.0,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
2.0,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0.0,0.0,1.0,1346976000.0,Not as Advertised,"""Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as """"Jumbo""""."""
3.0,B000LQOCH0,ABXLMWJIXXAIN,"""Natalia Corres """"Natalia Corres""""""",1.0,1.0,4.0,1219017600.0,"""""""Delight"""" says it all""","""This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' """"The Lion"
4.0,B000UA0QIQ,A395BORC6FGVXV,Karl,3.0,3.0,2.0,1307923200.0,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
5.0,B006K2ZZ7K,A1UQRSCLF8GW1T,"""Michael D. Bigham """"M. Wassir""""""",0.0,0.0,5.0,1350777600.0,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."
6.0,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0.0,0.0,4.0,1342051200.0,Nice Taffy,"I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat."
7.0,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0.0,0.0,5.0,1340150400.0,Great! Just as good as the expensive brands!,"This saltwater taffy had great flavors and was very soft and chewy. Each candy was individually wrapped well. None of the candies were stuck together, which did happen in the expensive version, Fralinger's. Would highly recommend this candy! I served it at a beach-themed party and everyone loved it!"
8.0,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0.0,0.0,5.0,1336003200.0,"Wonderful, tasty taffy",This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!
9.0,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1.0,1.0,5.0,1322006400.0,Yay Barley,Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too
10.0,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0.0,0.0,5.0,1351209600.0,Healthy Dog Food,This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.


In [7]:
# Spark SQL

amazon_review.createOrReplaceTempView("amazon_review")
display(spark.sql("SELECT * FROM amazon_review"))

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
1.0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1.0,1.0,5.0,1303862400.0,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
2.0,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0.0,0.0,1.0,1346976000.0,Not as Advertised,"""Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as """"Jumbo""""."""
3.0,B000LQOCH0,ABXLMWJIXXAIN,"""Natalia Corres """"Natalia Corres""""""",1.0,1.0,4.0,1219017600.0,"""""""Delight"""" says it all""","""This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' """"The Lion"
4.0,B000UA0QIQ,A395BORC6FGVXV,Karl,3.0,3.0,2.0,1307923200.0,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
5.0,B006K2ZZ7K,A1UQRSCLF8GW1T,"""Michael D. Bigham """"M. Wassir""""""",0.0,0.0,5.0,1350777600.0,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."
6.0,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0.0,0.0,4.0,1342051200.0,Nice Taffy,"I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat."
7.0,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0.0,0.0,5.0,1340150400.0,Great! Just as good as the expensive brands!,"This saltwater taffy had great flavors and was very soft and chewy. Each candy was individually wrapped well. None of the candies were stuck together, which did happen in the expensive version, Fralinger's. Would highly recommend this candy! I served it at a beach-themed party and everyone loved it!"
8.0,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0.0,0.0,5.0,1336003200.0,"Wonderful, tasty taffy",This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!
9.0,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1.0,1.0,5.0,1322006400.0,Yay Barley,Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too
10.0,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0.0,0.0,5.0,1351209600.0,Healthy Dog Food,This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.


## Cleaning Data

In [9]:
# Drop duplicates

amazon_review = amazon_review.dropDuplicates(['UserId','ProductId'])

In [10]:
# Convert Unix timestamp to readable date

from pyspark.sql.functions import from_unixtime, to_date

amazon_review = amazon_review.withColumn("Date", to_date(from_unixtime(amazon_review.Time)))

As comparison, for pandas dataframe you will use .apply() to apply a function to a column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

For example: amz_review['Date'] = amz_review['Time'].apply(to_date)

In [12]:
# Tokenization

from pyspark.sql.functions import udf

def tokenize(col):
  if col is None:
    return []
  return col.split()

tokenize_udf = udf(tokenize, ArrayType(StringType()))
amazon_review = amazon_review.withColumn("WordList", tokenize_udf(amazon_review.Text))

In [13]:
# Remove special characters and partially scraped html

special_char_list=['<span','<br','\\','`','\"',"'",'*','_','{','}','[',']','(',')',';','@','^','&','>','#','+',':','-','_','=','|','>','<','~','.','!','$','/',',','?','%','0','1','2','3','4','5','6','7','8','9']

def remove_specialchar(col):
  result = []
  for word in col:
    tmp_word = word.lower()
    for special_char in special_char_list:
      tmp_word = tmp_word.replace(special_char, "")
    result.append(tmp_word)
  return result

remove_specialchar_udf = udf(remove_specialchar, ArrayType(StringType()))
amazon_review = amazon_review.withColumn("WordListCleaned", remove_specialchar_udf(amazon_review.WordList))

In [14]:
# Remove stop words

from stop_words import get_stop_words

stop_words = get_stop_words('en')
remove_stopword_udf = udf(lambda col: [w for w in col if not (w in stop_words)], ArrayType(StringType()))
amazon_review = amazon_review.withColumn("WordListCleaned", remove_stopword_udf(amazon_review.WordListCleaned))

In [15]:
# Stemming

from nltk.stem.porter import PorterStemmer

p_stemmer = PorterStemmer()
stemming_udf = udf(lambda col: [p_stemmer.stem(w) for w in col], ArrayType(StringType()))
amazon_review = amazon_review.withColumn("WordListCleaned", stemming_udf(amazon_review.WordListCleaned))

In [16]:
display(amazon_review)

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Date,WordList,WordListCleaned
290915,B005HG9ESG,#oc-R1OE1OPY34LOC8,D,0,0,3,1344902400,Not the same,"Been buying this water for a couple of years. They recently changed the packaging and for some reason or other, the water seems different. The same clean taste is just NOT there. Not buying it again.",2012-08-14,"List(Been, buying, this, water, for, a, couple, of, years., They, recently, changed, the, packaging, and, for, some, reason, or, other,, the, water, seems, different., The, same, clean, taste, is, just, NOT, there., Not, buying, it, again.)","List(buy, water, coupl, year, recent, chang, packag, reason, water, seem, differ, clean, tast, just, buy)"
136337,B006Q820X0,#oc-R39HI2LQ9LHV32,stephanie,1,2,1,1339632000,Kcups leaking and coffee tasted old,"The kcups looked okay, but many of them sprayed coffee on the counter, had coffee grounds coming out. I won't buy this again.",2012-06-14,"List(The, kcups, looked, okay,, but, many, of, them, sprayed, coffee, on, the, counter,, had, coffee, grounds, coming, out., I, won't, buy, this, again.)","List(kcup, look, okay, mani, spray, coffe, counter, coffe, ground, come, wont, buy)"
124400,B005ZBZM52,#oc-R8EZLM74R071X,jalien,0,0,1,1350172800,"Great design, not so great flavor","I wanted to try something new and this is new alright. While I love and appreciate the design, the flavor is less than desirable. It's too bitter. I drink my coffee black and really enjoy a strong cup of dark roast, but after discovering I didn't care for the flavor, I tried to disguise it with the flavor of creamer I knew I'd like (coconut cream, yum) and still the flavor of the bitter coffee was just too much.",2012-10-14,"List(I, wanted, to, try, something, new, and, this, is, new, alright., While, I, love, and, appreciate, the, design,, the, flavor, is, less, than, desirable., It's, too, bitter., I, drink, my, coffee, black, and, really, enjoy, a, strong, cup, of, dark, roast,, but, after, discovering, I, didn't, care, for, the, flavor,, I, tried, to, disguise, it, with, the, flavor, of, creamer, I, knew, I'd, like, (coconut, cream,, yum), and, still, the, flavor, of, the, bitter, coffee, was, just, too, much.)","List(want, tri, someth, new, new, alright, love, appreci, design, flavor, less, desir, bitter, drink, coffe, black, realli, enjoy, strong, cup, dark, roast, discov, didnt, care, flavor, tri, disguis, flavor, creamer, knew, id, like, coconut, cream, yum, still, flavor, bitter, coffe, just, much)"
345714,B000ED9L9E,A10103MJIKKIFE,Dozer,1,1,5,1339459200,Yummi!!!!,"Love this hot cereal! I use 1/3 cup with 1 cup of almond milk, 1 tbsp of flax oil and 1 banana for breakfast (around 400 calories) and it helps me feel full all day! It has definitely decreased my cravings for sugary things.",2012-06-12,"List(Love, this, hot, cereal!, I, use, 1/3, cup, with, 1, cup, of, almond, milk,, 1, tbsp, of, flax, oil, and, 1, banana, for, breakfast, (around, 400, calories), and, it, helps, me, feel, full, all, day!, It, has, definitely, decreased, my, cravings, for, sugary, things.)","List(love, hot, cereal, use, , cup, , cup, almond, milk, , tbsp, flax, oil, , banana, breakfast, around, , calori, help, feel, full, day, definit, decreas, crave, sugari, thing)"
422681,B001LNL0MC,A101RJ22EBXAID,Disappointed Santa,5,6,1,1295913600,Holeymoley did I order crushed peppermint,"I ordered 3 boxes of Bob's candy canes because I could not find them in stores near me and I wanted the real thing, Bob's,not Spanglers, to help in the celebration of my father's last Christmas. We wanted an old fashioned, basic heart of America Christmas with him. We were willing to pay almost any price to get original Bob's candy canes for this purpose, hence ordering them at such an inflated price on-line from Amazon. More than half came smashed to smithereens. I sent a note to customer service alerting them to this and heard nothing back.",2011-01-25,"List(I, ordered, 3, boxes, of, Bob's, candy, canes, because, I, could, not, find, them, in, stores, near, me, and, I, wanted, the, real, thing,, Bob's,not, Spanglers,, to, help, in, the, celebration, of, my, father's, last, Christmas., We, wanted, an, old, fashioned,, basic, heart, of, America, Christmas, with, him., We, were, willing, to, pay, almost, any, price, to, get, original, Bob's, candy, canes, for, this, purpose,, hence, ordering, them, at, such, an, inflated, price, on-line, from, Amazon., More, than, half, came, smashed, to, smithereens., I, sent, a, note, to, customer, service, alerting, them, to, this, and, heard, nothing, back.)","List(order, , box, bob, candi, cane, find, store, near, want, real, thing, bobsnot, spangler, help, celebr, father, last, christma, want, old, fashion, basic, heart, america, christma, will, pay, almost, price, get, origin, bob, candi, cane, purpos, henc, order, inflat, price, onlin, amazon, half, came, smash, smithereen, sent, note, custom, servic, alert, heard, noth, back)"
201480,B001EQ4M6M,A101WAUONUXTNI,high altitude baking,0,4,1,1258761600,These Don't Work!,These make more than 10 cupcakes. Don't fill the muffin cups full (unless you are using jumbo muffin tins) because they will over flow. There should be some modification for high altitude..,2009-11-21,"List(These, make, more, than, 10, cupcakes., Don't, fill, the, muffin, cups, full, (unless, you, are, using, jumbo, muffin, tins), because, they, will, over, flow., There, should, be, some, modification, for, high, altitude..)","List(make, , cupcak, dont, fill, muffin, cup, full, unless, use, jumbo, muffin, tin, will, flow, modif, high, altitud)"
461927,B000ES1R1Y,A102T4DKZN8UFC,"""Marina Thakkar """"marinathakkar""""""",0,1,5,1279843200,Great tea,I have been drinking Ahmad teas for quite some time now and love all the flavors...so good!,2010-07-23,"List(I, have, been, drinking, Ahmad, teas, for, quite, some, time, now, and, love, all, the, flavors...so, good!)","List(drink, ahmad, tea, quit, time, now, love, flavorsso, good)"
19543,B000084ETV,A105BOR5D5S7CJ,ARL,0,0,2,1235606400,"After feeding for 5 years, we're done","We fed Canidae to our two dogs for almost 5 years--they always ate it eagerly, never had stomach upset, and got raves from the vet for their muscle, weight, and coats. It *was* a 5-star food. But last summer we opened a fresh bag of kibble (same exact packaging bought from our usual store) and immediately noticed it was much lighter in color than previous bags, but we fed it anyway. Within a day our Aussie, who has a bulletproof stomach with absolutely no digestive issues, began to have horrible gas, softer stools, and really loud, liquidy gurgling noises in his abdomen. We continued to feed the entire bag, thinking that he'd adjust to it eventually because our other dog was fine. Well, our Aussie never did adjust to it; he spent a month having those symptoms, and we dreaded mealtimes because of the stench and sounds he'd create. The kibble in the next bag we bought was the old normal shade of brown, and our Aussie's stomach noises, soft stools, and awful gas went away practically overnight. But by then I was hearing and reading that Canidae had developed a new formula that was making dogs sick. The stories about people whose dogs got *far* sicker than ours, and the dodgy responses that those long-time customers got from Canidae when they called to complain about the unannounced change, made us decide it was time for a new dog food. We went with Healthwise, a premium kibble made by the same company that makes Innova, California Natural, and Evo. It's not worth it to continue to be loyal to Canidae and to risk our dogs perhaps getting as sick as many others have. Read Canidae's new formula horror stories at consumeraffairs.com.",2009-02-26,"List(We, fed, Canidae, to, our, two, dogs, for, almost, 5, years--they, always, ate, it, eagerly,, never, had, stomach, upset,, and, got, raves, from, the, vet, for, their, muscle,, weight,, and, coats., It, *was*, a, 5-star, food., But, last, summer, we, opened, a, fresh, bag, of, kibble, (same, exact, packaging, bought, from, our, usual, store), and, immediately, noticed, it, was, much, lighter, in, color, than, previous, bags,, but, we, fed, it, anyway., Within, a, day, our, Aussie,, who, has, a, bulletproof, stomach, with, absolutely, no, digestive, issues,, began, to, have, horrible, gas,, softer, stools,, and, really, loud,, liquidy, gurgling, noises, in, his, abdomen., We, continued, to, feed, the, entire, bag,, thinking, that, he'd, adjust, to, it, eventually, because, our, other, dog, was, fine., Well,, our, Aussie, never, did, adjust, to, it;, he, spent, a, month, having, those, symptoms,, and, we, dreaded, mealtimes, because, of, the, stench, and, sounds, he'd, create.The, kibble, in, the, next, bag, we, bought, was, the, old, normal, shade, of, brown,, and, our, Aussie's, stomach, noises,, soft, stools,, and, awful, gas, went, away, practically, overnight., But, by, then, I, was, hearing, and, reading, that, Canidae, had, developed, a, new, formula, that, was, making, dogs, sick., The, stories, about, people, whose, dogs, got, *far*, sicker, than, ours,, and, the, dodgy, responses, that, those, long-time, customers, got, from, Canidae, when, they, called, to, complain, about, the, unannounced, change,, made, us, decide, it, was, time, for, a, new, dog, food., We, went, with, Healthwise,, a, premium, kibble, made, by, the, same, company, that, makes, Innova,, California, Natural,, and, Evo., It's, not, worth, it, to, continue, to, be, loyal, to, Canidae, and, to, risk, our, dogs, perhaps, getting, as, sick, as, many, others, have., Read, Canidae's, new, formula, horror, stories, at, consumeraffairs.com.)","List(fed, canida, two, dog, almost, , yearsthey, alway, ate, eagerli, never, stomach, upset, got, rave, vet, muscl, weight, coat, star, food, last, summer, open, fresh, bag, kibbl, exact, packag, bought, usual, store, immedi, notic, much, lighter, color, previou, bag, fed, anyway, within, day, aussi, bulletproof, stomach, absolut, digest, issu, began, horribl, ga, softer, stool, realli, loud, liquidi, gurgl, nois, abdomen, continu, feed, entir, bag, think, hed, adjust, eventu, dog, fine, well, aussi, never, adjust, spent, month, symptom, dread, mealtim, stench, sound, hed, creat, , kibbl, next, bag, bought, old, normal, shade, brown, aussi, stomach, nois, soft, stool, aw, ga, went, away, practic, overnight, hear, read, canida, develop, new, formula, make, dog, sick, stori, peopl, whose, dog, got, far, sicker, dodgi, respons, longtim, custom, got, canida, call, complain, unannounc, chang, made, us, decid, time, new, dog, food, went, healthwis, premium, kibbl, made, compani, make, innova, california, natur, evo, worth, continu, loyal, canida, risk, dog, perhap, get, sick, mani, other, read, canida, new, formula, horror, stori, consumeraffairscom)"
24876,B000G0EP78,A105DN5CYUR89W,W. Vernon,2,13,2,1211846400,Carbquik tastes like bisquik,"I suppose some people might rate this product higher as it is a great substitute for Bisquik, however I didn't like it because everything I made from it tasted like Bisquik.",2008-05-27,"List(I, suppose, some, people, might, rate, this, product, higher, as, it, is, a, great, substitute, for, Bisquik,, however, I, didn't, like, it, because, everything, I, made, from, it, tasted, like, Bisquik.)","List(suppos, peopl, might, rate, product, higher, great, substitut, bisquik, howev, didnt, like, everyth, made, tast, like, bisquik)"
419554,B0029ZAOW8,A105Y40R0K3ZIZ,Wei H. Ho,0,0,5,1321660800,good,it arrived on time and the quality is good as i expected. i will purchase again and recommend to other people.,2011-11-19,"List(it, arrived, on, time, and, the, quality, is, good, as, i, expected., i, will, purchase, again, and, recommend, to, other, people.)","List(arriv, time, qualiti, good, expect, will, purchas, recommend, peopl)"


## Saving & Reloading Data

In [18]:
# Save cleaned dataframe back to Azure Data Lake

import json

def array_to_string(my_list):
    return json.dumps(my_list)

array_to_string_udf = udf(array_to_string, StringType())

amazon_review.withColumn('WordList',array_to_string_udf(amazon_review.WordList)) \
             .withColumn('WordListCleaned',array_to_string_udf(amazon_review.WordListCleaned)) \
             .coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true") \
             .save('abfss://workspace@ssbdatalakegen2.dfs.core.windows.net/reviews_cleaned')

# Move the file out of the temporary folder and rename it

readPath = "abfss://workspace@ssbdatalakegen2.dfs.core.windows.net/reviews_cleaned"
writePath = "abfss://workspace@ssbdatalakegen2.dfs.core.windows.net"
file_list = dbutils.fs.ls(readPath)
for i in file_list:
    file_path = i[0]
    file_name = i[1]

fname = "reviews_cleaned.csv"
for i in file_list:
    if i[1].startswith("part-00000"):
        read_name = i[1]

dbutils.fs.mv(readPath+"/"+read_name, writePath+"/"+fname)

# Delete the empty folder

dbutils.fs.rm(readPath , recurse = True)

In [19]:
# Read the saved dataframe back

amazon_cleaned_schema = StructType([StructField('Id',IntegerType(),True), 
                            StructField('ProductId',StringType(),True),
                            StructField('UserId',StringType(),True),
                            StructField('ProfileName',StringType(),True),
                            StructField('HelpfulnessNumerator',IntegerType(),True),
                            StructField('HelpfulnessDenominator',IntegerType(),True),
                            StructField('Score',IntegerType(),True),
                            StructField('Time',IntegerType(),True),
                            StructField('Summary',StringType(),True),
                            StructField('Text',StringType(),True),
                            StructField('Date',DateType(),True),
                            StructField('WordList',StringType(),True),
                            StructField('WordListCleaned',StringType(),True)])

amazon_review_cleaned = spark.read.csv("abfss://workspace@ssbdatalakegen2.dfs.core.windows.net/reviews_cleaned.csv", header=True, schema=amazon_cleaned_schema)

In [20]:
def string_to_array(my_string):
    return json.loads(my_string)
  
string_to_array_udf = udf(string_to_array,  ArrayType(StringType()))

amazon_review_cleaned = amazon_review_cleaned.withColumn("WordList", string_to_array_udf(amazon_review_cleaned.WordList)) \
                                             .withColumn("WordListCleaned", string_to_array_udf(amazon_review_cleaned.WordListCleaned))

display(amazon_review_cleaned)

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Date,WordList,WordListCleaned
290915,B005HG9ESG,#oc-R1OE1OPY34LOC8,D,0,0,3,1344902400,Not the same,"Been buying this water for a couple of years. They recently changed the packaging and for some reason or other, the water seems different. The same clean taste is just NOT there. Not buying it again.",2012-08-14,"List(Been, buying, this, water, for, a, couple, of, years., They, recently, changed, the, packaging, and, for, some, reason, or, other,, the, water, seems, different., The, same, clean, taste, is, just, NOT, there., Not, buying, it, again.)","List(buy, water, coupl, year, recent, chang, packag, reason, water, seem, differ, clean, tast, just, buy)"
136337,B006Q820X0,#oc-R39HI2LQ9LHV32,stephanie,1,2,1,1339632000,Kcups leaking and coffee tasted old,"The kcups looked okay, but many of them sprayed coffee on the counter, had coffee grounds coming out. I won't buy this again.",2012-06-14,"List(The, kcups, looked, okay,, but, many, of, them, sprayed, coffee, on, the, counter,, had, coffee, grounds, coming, out., I, won't, buy, this, again.)","List(kcup, look, okay, mani, spray, coffe, counter, coffe, ground, come, wont, buy)"
124400,B005ZBZM52,#oc-R8EZLM74R071X,jalien,0,0,1,1350172800,"Great design, not so great flavor","I wanted to try something new and this is new alright. While I love and appreciate the design, the flavor is less than desirable. It's too bitter. I drink my coffee black and really enjoy a strong cup of dark roast, but after discovering I didn't care for the flavor, I tried to disguise it with the flavor of creamer I knew I'd like (coconut cream, yum) and still the flavor of the bitter coffee was just too much.",2012-10-14,"List(I, wanted, to, try, something, new, and, this, is, new, alright., While, I, love, and, appreciate, the, design,, the, flavor, is, less, than, desirable., It's, too, bitter., I, drink, my, coffee, black, and, really, enjoy, a, strong, cup, of, dark, roast,, but, after, discovering, I, didn't, care, for, the, flavor,, I, tried, to, disguise, it, with, the, flavor, of, creamer, I, knew, I'd, like, (coconut, cream,, yum), and, still, the, flavor, of, the, bitter, coffee, was, just, too, much.)","List(want, tri, someth, new, new, alright, love, appreci, design, flavor, less, desir, bitter, drink, coffe, black, realli, enjoy, strong, cup, dark, roast, discov, didnt, care, flavor, tri, disguis, flavor, creamer, knew, id, like, coconut, cream, yum, still, flavor, bitter, coffe, just, much)"
345714,B000ED9L9E,A10103MJIKKIFE,Dozer,1,1,5,1339459200,Yummi!!!!,"Love this hot cereal! I use 1/3 cup with 1 cup of almond milk, 1 tbsp of flax oil and 1 banana for breakfast (around 400 calories) and it helps me feel full all day! It has definitely decreased my cravings for sugary things.",2012-06-12,"List(Love, this, hot, cereal!, I, use, 1/3, cup, with, 1, cup, of, almond, milk,, 1, tbsp, of, flax, oil, and, 1, banana, for, breakfast, (around, 400, calories), and, it, helps, me, feel, full, all, day!, It, has, definitely, decreased, my, cravings, for, sugary, things.)","List(love, hot, cereal, use, , cup, , cup, almond, milk, , tbsp, flax, oil, , banana, breakfast, around, , calori, help, feel, full, day, definit, decreas, crave, sugari, thing)"
422681,B001LNL0MC,A101RJ22EBXAID,Disappointed Santa,5,6,1,1295913600,Holeymoley did I order crushed peppermint,"I ordered 3 boxes of Bob's candy canes because I could not find them in stores near me and I wanted the real thing, Bob's,not Spanglers, to help in the celebration of my father's last Christmas. We wanted an old fashioned, basic heart of America Christmas with him. We were willing to pay almost any price to get original Bob's candy canes for this purpose, hence ordering them at such an inflated price on-line from Amazon. More than half came smashed to smithereens. I sent a note to customer service alerting them to this and heard nothing back.",2011-01-25,"List(I, ordered, 3, boxes, of, Bob's, candy, canes, because, I, could, not, find, them, in, stores, near, me, and, I, wanted, the, real, thing,, Bob's,not, Spanglers,, to, help, in, the, celebration, of, my, father's, last, Christmas., We, wanted, an, old, fashioned,, basic, heart, of, America, Christmas, with, him., We, were, willing, to, pay, almost, any, price, to, get, original, Bob's, candy, canes, for, this, purpose,, hence, ordering, them, at, such, an, inflated, price, on-line, from, Amazon., More, than, half, came, smashed, to, smithereens., I, sent, a, note, to, customer, service, alerting, them, to, this, and, heard, nothing, back.)","List(order, , box, bob, candi, cane, find, store, near, want, real, thing, bobsnot, spangler, help, celebr, father, last, christma, want, old, fashion, basic, heart, america, christma, will, pay, almost, price, get, origin, bob, candi, cane, purpos, henc, order, inflat, price, onlin, amazon, half, came, smash, smithereen, sent, note, custom, servic, alert, heard, noth, back)"
201480,B001EQ4M6M,A101WAUONUXTNI,high altitude baking,0,4,1,1258761600,These Don't Work!,These make more than 10 cupcakes. Don't fill the muffin cups full (unless you are using jumbo muffin tins) because they will over flow. There should be some modification for high altitude..,2009-11-21,"List(These, make, more, than, 10, cupcakes., Don't, fill, the, muffin, cups, full, (unless, you, are, using, jumbo, muffin, tins), because, they, will, over, flow., There, should, be, some, modification, for, high, altitude..)","List(make, , cupcak, dont, fill, muffin, cup, full, unless, use, jumbo, muffin, tin, will, flow, modif, high, altitud)"
461927,B000ES1R1Y,A102T4DKZN8UFC,"""Marina Thakkar """"marinathakkar""""""",0,1,5,1279843200,Great tea,I have been drinking Ahmad teas for quite some time now and love all the flavors...so good!,2010-07-23,"List(I, have, been, drinking, Ahmad, teas, for, quite, some, time, now, and, love, all, the, flavors...so, good!)","List(drink, ahmad, tea, quit, time, now, love, flavorsso, good)"
19543,B000084ETV,A105BOR5D5S7CJ,ARL,0,0,2,1235606400,"After feeding for 5 years, we're done","We fed Canidae to our two dogs for almost 5 years--they always ate it eagerly, never had stomach upset, and got raves from the vet for their muscle, weight, and coats. It *was* a 5-star food. But last summer we opened a fresh bag of kibble (same exact packaging bought from our usual store) and immediately noticed it was much lighter in color than previous bags, but we fed it anyway. Within a day our Aussie, who has a bulletproof stomach with absolutely no digestive issues, began to have horrible gas, softer stools, and really loud, liquidy gurgling noises in his abdomen. We continued to feed the entire bag, thinking that he'd adjust to it eventually because our other dog was fine. Well, our Aussie never did adjust to it; he spent a month having those symptoms, and we dreaded mealtimes because of the stench and sounds he'd create. The kibble in the next bag we bought was the old normal shade of brown, and our Aussie's stomach noises, soft stools, and awful gas went away practically overnight. But by then I was hearing and reading that Canidae had developed a new formula that was making dogs sick. The stories about people whose dogs got *far* sicker than ours, and the dodgy responses that those long-time customers got from Canidae when they called to complain about the unannounced change, made us decide it was time for a new dog food. We went with Healthwise, a premium kibble made by the same company that makes Innova, California Natural, and Evo. It's not worth it to continue to be loyal to Canidae and to risk our dogs perhaps getting as sick as many others have. Read Canidae's new formula horror stories at consumeraffairs.com.",2009-02-26,"List(We, fed, Canidae, to, our, two, dogs, for, almost, 5, years--they, always, ate, it, eagerly,, never, had, stomach, upset,, and, got, raves, from, the, vet, for, their, muscle,, weight,, and, coats., It, *was*, a, 5-star, food., But, last, summer, we, opened, a, fresh, bag, of, kibble, (same, exact, packaging, bought, from, our, usual, store), and, immediately, noticed, it, was, much, lighter, in, color, than, previous, bags,, but, we, fed, it, anyway., Within, a, day, our, Aussie,, who, has, a, bulletproof, stomach, with, absolutely, no, digestive, issues,, began, to, have, horrible, gas,, softer, stools,, and, really, loud,, liquidy, gurgling, noises, in, his, abdomen., We, continued, to, feed, the, entire, bag,, thinking, that, he'd, adjust, to, it, eventually, because, our, other, dog, was, fine., Well,, our, Aussie, never, did, adjust, to, it;, he, spent, a, month, having, those, symptoms,, and, we, dreaded, mealtimes, because, of, the, stench, and, sounds, he'd, create.The, kibble, in, the, next, bag, we, bought, was, the, old, normal, shade, of, brown,, and, our, Aussie's, stomach, noises,, soft, stools,, and, awful, gas, went, away, practically, overnight., But, by, then, I, was, hearing, and, reading, that, Canidae, had, developed, a, new, formula, that, was, making, dogs, sick., The, stories, about, people, whose, dogs, got, *far*, sicker, than, ours,, and, the, dodgy, responses, that, those, long-time, customers, got, from, Canidae, when, they, called, to, complain, about, the, unannounced, change,, made, us, decide, it, was, time, for, a, new, dog, food., We, went, with, Healthwise,, a, premium, kibble, made, by, the, same, company, that, makes, Innova,, California, Natural,, and, Evo., It's, not, worth, it, to, continue, to, be, loyal, to, Canidae, and, to, risk, our, dogs, perhaps, getting, as, sick, as, many, others, have., Read, Canidae's, new, formula, horror, stories, at, consumeraffairs.com.)","List(fed, canida, two, dog, almost, , yearsthey, alway, ate, eagerli, never, stomach, upset, got, rave, vet, muscl, weight, coat, star, food, last, summer, open, fresh, bag, kibbl, exact, packag, bought, usual, store, immedi, notic, much, lighter, color, previou, bag, fed, anyway, within, day, aussi, bulletproof, stomach, absolut, digest, issu, began, horribl, ga, softer, stool, realli, loud, liquidi, gurgl, nois, abdomen, continu, feed, entir, bag, think, hed, adjust, eventu, dog, fine, well, aussi, never, adjust, spent, month, symptom, dread, mealtim, stench, sound, hed, creat, , kibbl, next, bag, bought, old, normal, shade, brown, aussi, stomach, nois, soft, stool, aw, ga, went, away, practic, overnight, hear, read, canida, develop, new, formula, make, dog, sick, stori, peopl, whose, dog, got, far, sicker, dodgi, respons, longtim, custom, got, canida, call, complain, unannounc, chang, made, us, decid, time, new, dog, food, went, healthwis, premium, kibbl, made, compani, make, innova, california, natur, evo, worth, continu, loyal, canida, risk, dog, perhap, get, sick, mani, other, read, canida, new, formula, horror, stori, consumeraffairscom)"
24876,B000G0EP78,A105DN5CYUR89W,W. Vernon,2,13,2,1211846400,Carbquik tastes like bisquik,"I suppose some people might rate this product higher as it is a great substitute for Bisquik, however I didn't like it because everything I made from it tasted like Bisquik.",2008-05-27,"List(I, suppose, some, people, might, rate, this, product, higher, as, it, is, a, great, substitute, for, Bisquik,, however, I, didn't, like, it, because, everything, I, made, from, it, tasted, like, Bisquik.)","List(suppos, peopl, might, rate, product, higher, great, substitut, bisquik, howev, didnt, like, everyth, made, tast, like, bisquik)"
419554,B0029ZAOW8,A105Y40R0K3ZIZ,Wei H. Ho,0,0,5,1321660800,good,it arrived on time and the quality is good as i expected. i will purchase again and recommend to other people.,2011-11-19,"List(it, arrived, on, time, and, the, quality, is, good, as, i, expected., i, will, purchase, again, and, recommend, to, other, people.)","List(arriv, time, qualiti, good, expect, will, purchas, recommend, peopl)"


## Review Score Prediction

As comparison, without Spark we will use sklearn in Python for machine learning (read more: https://scikit-learn.org/stable/user_guide.html); or NLTK for natural language processing (read more: https://www.nltk.org/)

In [23]:
# Extract bigram

from pyspark.ml.feature import NGram
from pyspark.sql.functions import array_union

ngram = NGram(n = 2, inputCol="WordListCleaned", outputCol="bigram")
amazon_review = ngram.transform(amazon_review)

amazon_review = amazon_review.withColumn("ngrams", array_union(amazon_review.WordListCleaned, amazon_review.bigram))

In [24]:
# Extract 5-star and 1-star reviews for prediction

prediction_df = amazon_review.where((amazon_review.Score == 1) | (amazon_review.Score == 5))
display(prediction_df)

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Date,WordList,WordListCleaned,bigram,ngrams
136337,B006Q820X0,#oc-R39HI2LQ9LHV32,stephanie,1,2,1,1339632000,Kcups leaking and coffee tasted old,"The kcups looked okay, but many of them sprayed coffee on the counter, had coffee grounds coming out. I won't buy this again.",2012-06-14,"List(The, kcups, looked, okay,, but, many, of, them, sprayed, coffee, on, the, counter,, had, coffee, grounds, coming, out., I, won't, buy, this, again.)","List(kcup, look, okay, mani, spray, coffe, counter, coffe, ground, come, wont, buy)","List(kcup look, look okay, okay mani, mani spray, spray coffe, coffe counter, counter coffe, coffe ground, ground come, come wont, wont buy)","List(kcup, look, okay, mani, spray, coffe, counter, ground, come, wont, buy, kcup look, look okay, okay mani, mani spray, spray coffe, coffe counter, counter coffe, coffe ground, ground come, come wont, wont buy)"
124400,B005ZBZM52,#oc-R8EZLM74R071X,jalien,0,0,1,1350172800,"Great design, not so great flavor","I wanted to try something new and this is new alright. While I love and appreciate the design, the flavor is less than desirable. It's too bitter. I drink my coffee black and really enjoy a strong cup of dark roast, but after discovering I didn't care for the flavor, I tried to disguise it with the flavor of creamer I knew I'd like (coconut cream, yum) and still the flavor of the bitter coffee was just too much.",2012-10-14,"List(I, wanted, to, try, something, new, and, this, is, new, alright., While, I, love, and, appreciate, the, design,, the, flavor, is, less, than, desirable., It's, too, bitter., I, drink, my, coffee, black, and, really, enjoy, a, strong, cup, of, dark, roast,, but, after, discovering, I, didn't, care, for, the, flavor,, I, tried, to, disguise, it, with, the, flavor, of, creamer, I, knew, I'd, like, (coconut, cream,, yum), and, still, the, flavor, of, the, bitter, coffee, was, just, too, much.)","List(want, tri, someth, new, new, alright, love, appreci, design, flavor, less, desir, bitter, drink, coffe, black, realli, enjoy, strong, cup, dark, roast, discov, didnt, care, flavor, tri, disguis, flavor, creamer, knew, id, like, coconut, cream, yum, still, flavor, bitter, coffe, just, much)","List(want tri, tri someth, someth new, new new, new alright, alright love, love appreci, appreci design, design flavor, flavor less, less desir, desir bitter, bitter drink, drink coffe, coffe black, black realli, realli enjoy, enjoy strong, strong cup, cup dark, dark roast, roast discov, discov didnt, didnt care, care flavor, flavor tri, tri disguis, disguis flavor, flavor creamer, creamer knew, knew id, id like, like coconut, coconut cream, cream yum, yum still, still flavor, flavor bitter, bitter coffe, coffe just, just much)","List(want, tri, someth, new, alright, love, appreci, design, flavor, less, desir, bitter, drink, coffe, black, realli, enjoy, strong, cup, dark, roast, discov, didnt, care, disguis, creamer, knew, id, like, coconut, cream, yum, still, just, much, want tri, tri someth, someth new, new new, new alright, alright love, love appreci, appreci design, design flavor, flavor less, less desir, desir bitter, bitter drink, drink coffe, coffe black, black realli, realli enjoy, enjoy strong, strong cup, cup dark, dark roast, roast discov, discov didnt, didnt care, care flavor, flavor tri, tri disguis, disguis flavor, flavor creamer, creamer knew, knew id, id like, like coconut, coconut cream, cream yum, yum still, still flavor, flavor bitter, bitter coffe, coffe just, just much)"
345714,B000ED9L9E,A10103MJIKKIFE,Dozer,1,1,5,1339459200,Yummi!!!!,"Love this hot cereal! I use 1/3 cup with 1 cup of almond milk, 1 tbsp of flax oil and 1 banana for breakfast (around 400 calories) and it helps me feel full all day! It has definitely decreased my cravings for sugary things.",2012-06-12,"List(Love, this, hot, cereal!, I, use, 1/3, cup, with, 1, cup, of, almond, milk,, 1, tbsp, of, flax, oil, and, 1, banana, for, breakfast, (around, 400, calories), and, it, helps, me, feel, full, all, day!, It, has, definitely, decreased, my, cravings, for, sugary, things.)","List(love, hot, cereal, use, , cup, , cup, almond, milk, , tbsp, flax, oil, , banana, breakfast, around, , calori, help, feel, full, day, definit, decreas, crave, sugari, thing)","List(love hot, hot cereal, cereal use, use , cup, cup , cup, cup almond, almond milk, milk , tbsp, tbsp flax, flax oil, oil , banana, banana breakfast, breakfast around, around , calori, calori help, help feel, feel full, full day, day definit, definit decreas, decreas crave, crave sugari, sugari thing)","List(love, hot, cereal, use, , cup, almond, milk, tbsp, flax, oil, banana, breakfast, around, calori, help, feel, full, day, definit, decreas, crave, sugari, thing, love hot, hot cereal, cereal use, use , cup, cup , cup almond, almond milk, milk , tbsp, tbsp flax, flax oil, oil , banana, banana breakfast, breakfast around, around , calori, calori help, help feel, feel full, full day, day definit, definit decreas, decreas crave, crave sugari, sugari thing)"
422681,B001LNL0MC,A101RJ22EBXAID,Disappointed Santa,5,6,1,1295913600,Holeymoley did I order crushed peppermint,"I ordered 3 boxes of Bob's candy canes because I could not find them in stores near me and I wanted the real thing, Bob's,not Spanglers, to help in the celebration of my father's last Christmas. We wanted an old fashioned, basic heart of America Christmas with him. We were willing to pay almost any price to get original Bob's candy canes for this purpose, hence ordering them at such an inflated price on-line from Amazon. More than half came smashed to smithereens. I sent a note to customer service alerting them to this and heard nothing back.",2011-01-25,"List(I, ordered, 3, boxes, of, Bob's, candy, canes, because, I, could, not, find, them, in, stores, near, me, and, I, wanted, the, real, thing,, Bob's,not, Spanglers,, to, help, in, the, celebration, of, my, father's, last, Christmas., We, wanted, an, old, fashioned,, basic, heart, of, America, Christmas, with, him., We, were, willing, to, pay, almost, any, price, to, get, original, Bob's, candy, canes, for, this, purpose,, hence, ordering, them, at, such, an, inflated, price, on-line, from, Amazon., More, than, half, came, smashed, to, smithereens., I, sent, a, note, to, customer, service, alerting, them, to, this, and, heard, nothing, back.)","List(order, , box, bob, candi, cane, find, store, near, want, real, thing, bobsnot, spangler, help, celebr, father, last, christma, want, old, fashion, basic, heart, america, christma, will, pay, almost, price, get, origin, bob, candi, cane, purpos, henc, order, inflat, price, onlin, amazon, half, came, smash, smithereen, sent, note, custom, servic, alert, heard, noth, back)","List(order , box, box bob, bob candi, candi cane, cane find, find store, store near, near want, want real, real thing, thing bobsnot, bobsnot spangler, spangler help, help celebr, celebr father, father last, last christma, christma want, want old, old fashion, fashion basic, basic heart, heart america, america christma, christma will, will pay, pay almost, almost price, price get, get origin, origin bob, bob candi, candi cane, cane purpos, purpos henc, henc order, order inflat, inflat price, price onlin, onlin amazon, amazon half, half came, came smash, smash smithereen, smithereen sent, sent note, note custom, custom servic, servic alert, alert heard, heard noth, noth back)","List(order, , box, bob, candi, cane, find, store, near, want, real, thing, bobsnot, spangler, help, celebr, father, last, christma, old, fashion, basic, heart, america, will, pay, almost, price, get, origin, purpos, henc, inflat, onlin, amazon, half, came, smash, smithereen, sent, note, custom, servic, alert, heard, noth, back, order , box, box bob, bob candi, candi cane, cane find, find store, store near, near want, want real, real thing, thing bobsnot, bobsnot spangler, spangler help, help celebr, celebr father, father last, last christma, christma want, want old, old fashion, fashion basic, basic heart, heart america, america christma, christma will, will pay, pay almost, almost price, price get, get origin, origin bob, cane purpos, purpos henc, henc order, order inflat, inflat price, price onlin, onlin amazon, amazon half, half came, came smash, smash smithereen, smithereen sent, sent note, note custom, custom servic, servic alert, alert heard, heard noth, noth back)"
201480,B001EQ4M6M,A101WAUONUXTNI,high altitude baking,0,4,1,1258761600,These Don't Work!,These make more than 10 cupcakes. Don't fill the muffin cups full (unless you are using jumbo muffin tins) because they will over flow. There should be some modification for high altitude..,2009-11-21,"List(These, make, more, than, 10, cupcakes., Don't, fill, the, muffin, cups, full, (unless, you, are, using, jumbo, muffin, tins), because, they, will, over, flow., There, should, be, some, modification, for, high, altitude..)","List(make, , cupcak, dont, fill, muffin, cup, full, unless, use, jumbo, muffin, tin, will, flow, modif, high, altitud)","List(make , cupcak, cupcak dont, dont fill, fill muffin, muffin cup, cup full, full unless, unless use, use jumbo, jumbo muffin, muffin tin, tin will, will flow, flow modif, modif high, high altitud)","List(make, , cupcak, dont, fill, muffin, cup, full, unless, use, jumbo, tin, will, flow, modif, high, altitud, make , cupcak, cupcak dont, dont fill, fill muffin, muffin cup, cup full, full unless, unless use, use jumbo, jumbo muffin, muffin tin, tin will, will flow, flow modif, modif high, high altitud)"
461927,B000ES1R1Y,A102T4DKZN8UFC,"""Marina Thakkar """"marinathakkar""""""",0,1,5,1279843200,Great tea,I have been drinking Ahmad teas for quite some time now and love all the flavors...so good!,2010-07-23,"List(I, have, been, drinking, Ahmad, teas, for, quite, some, time, now, and, love, all, the, flavors...so, good!)","List(drink, ahmad, tea, quit, time, now, love, flavorsso, good)","List(drink ahmad, ahmad tea, tea quit, quit time, time now, now love, love flavorsso, flavorsso good)","List(drink, ahmad, tea, quit, time, now, love, flavorsso, good, drink ahmad, ahmad tea, tea quit, quit time, time now, now love, love flavorsso, flavorsso good)"
419554,B0029ZAOW8,A105Y40R0K3ZIZ,Wei H. Ho,0,0,5,1321660800,good,it arrived on time and the quality is good as i expected. i will purchase again and recommend to other people.,2011-11-19,"List(it, arrived, on, time, and, the, quality, is, good, as, i, expected., i, will, purchase, again, and, recommend, to, other, people.)","List(arriv, time, qualiti, good, expect, will, purchas, recommend, peopl)","List(arriv time, time qualiti, qualiti good, good expect, expect will, will purchas, purchas recommend, recommend peopl)","List(arriv, time, qualiti, good, expect, will, purchas, recommend, peopl, arriv time, time qualiti, qualiti good, good expect, expect will, will purchas, purchas recommend, recommend peopl)"
414477,B002PHOHZ0,A107JX0JQRQUUP,Kevin J. Malantic,0,0,5,1327622400,Delicious and fresh alternative to the old fruitbasket standby,"Let's face it, flowers get boring after awhile and there's only so much chocolate you can buy as presents before people start calling you the 'Proflowers' guy behind your back. This fruit and treat gift is a fantastic alternative to the norm. I purchased 3 of these for relatives and they all loved the freshness of the fruit and the variety of the treats. Highly recommended and, what's more, the company shipped everything on time and it all arrive on the days that they stated.",2012-01-27,"List(Let's, face, it,, flowers, get, boring, after, awhile, and, there's, only, so, much, chocolate, you, can, buy, as, presents, before, people, start, calling, you, the, 'Proflowers', guy, behind, your, back., This, fruit, and, treat, gift, is, a, fantastic, alternative, to, the, norm., I, purchased, 3, of, these, for, relatives, and, they, all, loved, the, freshness, of, the, fruit, and, the, variety, of, the, treats., Highly, recommended, and,, what's, more,, the, company, shipped, everything, on, time, and, it, all, arrive, on, the, days, that, they, stated.)","List(let, face, flower, get, bore, awhil, there, much, chocol, can, buy, present, peopl, start, call, proflow, guy, behind, back, fruit, treat, gift, fantast, altern, norm, purchas, , rel, love, fresh, fruit, varieti, treat, highli, recommend, what, compani, ship, everyth, time, arriv, day, state)","List(let face, face flower, flower get, get bore, bore awhil, awhil there, there much, much chocol, chocol can, can buy, buy present, present peopl, peopl start, start call, call proflow, proflow guy, guy behind, behind back, back fruit, fruit treat, treat gift, gift fantast, fantast altern, altern norm, norm purchas, purchas , rel, rel love, love fresh, fresh fruit, fruit varieti, varieti treat, treat highli, highli recommend, recommend what, what compani, compani ship, ship everyth, everyth time, time arriv, arriv day, day state)","List(let, face, flower, get, bore, awhil, there, much, chocol, can, buy, present, peopl, start, call, proflow, guy, behind, back, fruit, treat, gift, fantast, altern, norm, purchas, , rel, love, fresh, varieti, highli, recommend, what, compani, ship, everyth, time, arriv, day, state, let face, face flower, flower get, get bore, bore awhil, awhil there, there much, much chocol, chocol can, can buy, buy present, present peopl, peopl start, start call, call proflow, proflow guy, guy behind, behind back, back fruit, fruit treat, treat gift, gift fantast, fantast altern, altern norm, norm purchas, purchas , rel, rel love, love fresh, fresh fruit, fruit varieti, varieti treat, treat highli, highli recommend, recommend what, what compani, compani ship, ship everyth, everyth time, time arriv, arriv day, day state)"
497417,B001SB0ZQ4,A108M02UNZ7545,Stephen Mills,1,1,1,1304208000,Disappointment with Amazon/Nutricity,"I am furious that I put in my first order for something other than books, dvds and cds from an Amazon based Seller, Nutricity, and they sent the wrong order. And, it's not that it's just the wrong order, but the title/headline on the product and the order itself specifically said DECAF!! I needed and wanted and ordered DECAF and, DECAFFEINATED is printed on the cans of the Product I was SUPPOSED to get and ordered - and they STILL sent me the wrong product with Caffeine. Unbelievable.",2011-05-01,"List(I, am, furious, that, I, put, in, my, first, order, for, something, other, than, books,, dvds, and, cds, from, an, Amazon, based, Seller,, Nutricity,, and, they, sent, the, wrong, order., And,, it's, not, that, it's, just, the, wrong, order,, but, the, title/headline, on, the, product, and, the, order, itself, specifically, said, DECAF!!, I, needed, and, wanted, and, ordered, DECAF, and,, DECAFFEINATED, is, printed, on, the, cans, of, the, Product, I, was, SUPPOSED, to, get, and, ordered, -, and, they, STILL, sent, me, the, wrong, product, with, Caffeine., Unbelievable.)","List(furiou, put, first, order, someth, book, dvd, cd, amazon, base, seller, nutric, sent, wrong, order, just, wrong, order, titleheadlin, product, order, specif, said, decaf, need, want, order, decaf, decaffein, print, can, product, suppos, get, order, , still, sent, wrong, product, caffein, unbeliev)","List(furiou put, put first, first order, order someth, someth book, book dvd, dvd cd, cd amazon, amazon base, base seller, seller nutric, nutric sent, sent wrong, wrong order, order just, just wrong, wrong order, order titleheadlin, titleheadlin product, product order, order specif, specif said, said decaf, decaf need, need want, want order, order decaf, decaf decaffein, decaffein print, print can, can product, product suppos, suppos get, get order, order , still, still sent, sent wrong, wrong product, product caffein, caffein unbeliev)","List(furiou, put, first, order, someth, book, dvd, cd, amazon, base, seller, nutric, sent, wrong, just, titleheadlin, product, specif, said, decaf, need, want, decaffein, print, can, suppos, get, , still, caffein, unbeliev, furiou put, put first, first order, order someth, someth book, book dvd, dvd cd, cd amazon, amazon base, base seller, seller nutric, nutric sent, sent wrong, wrong order, order just, just wrong, order titleheadlin, titleheadlin product, product order, order specif, specif said, said decaf, decaf need, need want, want order, order decaf, decaf decaffein, decaffein print, print can, can product, product suppos, suppos get, get order, order , still, still sent, wrong product, product caffein, caffein unbeliev)"
291053,B005HG9ESG,A10LDAXFO052F7,Alexandra,4,6,5,1335225600,Amazing water,"This water is very good for the immune system... Love it, it's done wonders for my family.. Will always purchase this water.",2012-04-24,"List(This, water, is, very, good, for, the, immune, system..., Love, it,, it's, done, wonders, for, my, family.., Will, always, purchase, this, water.)","List(water, good, immun, system, love, done, wonder, famili, will, alway, purchas, water)","List(water good, good immun, immun system, system love, love done, done wonder, wonder famili, famili will, will alway, alway purchas, purchas water)","List(water, good, immun, system, love, done, wonder, famili, will, alway, purchas, water good, good immun, immun system, system love, love done, done wonder, wonder famili, famili will, will alway, alway purchas, purchas water)"


### TF-IDF with Hashing Trick + Random Forest

In [26]:
# Copy prediction data

prediction_tfidf_hash = prediction_df.select('*')

In [27]:
# Getting tf-idf values for 1-2grams

from pyspark.ml.feature import HashingTF, IDF

hashtf = HashingTF(numFeatures=2**12, inputCol="ngrams", outputCol='TF')
tf = hashtf.transform(prediction_tfidf_hash)
idf = IDF(minDocFreq=3, inputCol="TF", outputCol="TF-IDF")
idfModel = idf.fit(tf)
prediction_tfidf_hash = idfModel.transform(tf)

In [28]:
# Random Forest

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="Score", outputCol="indexedScore").fit(prediction_tfidf_hash)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="TF-IDF", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

(trainingData, testData) = prediction_tfidf_hash.randomSplit([0.7, 0.3])

rf_model = pipeline.fit(trainingData)
predictions = rf_model.transform(testData)


In [29]:
display(predictions.select("Score", "indexedScore", "rawPrediction", "probability", "prediction", "predictedLabel"))

Score,indexedScore,rawPrediction,probability,prediction,predictedLabel
5,0.0,"List(1, 2, List(), List(34.25834695787432, 5.741653042125674))","List(1, 2, List(), List(0.856458673946858, 0.14354132605314185))",0.0,5
5,0.0,"List(1, 2, List(), List(35.311492102774324, 4.6885078972256675))","List(1, 2, List(), List(0.8827873025693582, 0.11721269743064171))",0.0,5
5,0.0,"List(1, 2, List(), List(34.902303655943456, 5.097696344056532))","List(1, 2, List(), List(0.8725575913985867, 0.12744240860141334))",0.0,5
5,0.0,"List(1, 2, List(), List(34.89124611759117, 5.108753882408826))","List(1, 2, List(), List(0.8722811529397794, 0.12771884706022066))",0.0,5
1,1.0,"List(1, 2, List(), List(34.85099831831011, 5.1490016816898985))","List(1, 2, List(), List(0.8712749579577527, 0.12872504204224744))",0.0,5
5,0.0,"List(1, 2, List(), List(35.03872813353971, 4.961271866460284))","List(1, 2, List(), List(0.8759682033384928, 0.1240317966615071))",0.0,5
5,0.0,"List(1, 2, List(), List(35.05945390113855, 4.940546098861447))","List(1, 2, List(), List(0.8764863475284639, 0.12351365247153619))",0.0,5
5,0.0,"List(1, 2, List(), List(34.35419185833013, 5.645808141669868))","List(1, 2, List(), List(0.8588547964582534, 0.1411452035417467))",0.0,5
5,0.0,"List(1, 2, List(), List(35.01350216054717, 4.986497839452833))","List(1, 2, List(), List(0.8753375540136792, 0.12466244598632084))",0.0,5
5,0.0,"List(1, 2, List(), List(35.313246150942064, 4.686753849057935))","List(1, 2, List(), List(0.8828311537735516, 0.11716884622644837))",0.0,5


In [30]:
# Calculate AUC for train/test split

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print("AUC = %g" % auc)

In [31]:
# Performance evaluation with 10-fold cross validation

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
paramGrid = ParamGridBuilder().build()
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=10)
cvModel = cv.fit(prediction_tfidf_hash)

In [32]:
print("Average AUC = %g" % cvModel.avgMetrics[0])

### Doc2Vec + Random Forest

In [34]:
# Copy prediction data

prediction_doc2vec = prediction_df.select('*')

In [35]:
# Calculate Doc2Vec

from pyspark.ml.feature import Word2Vec

word2Vec = Word2Vec(inputCol="WordListCleaned", outputCol="doc2vec")
w2v_model = word2Vec.fit(prediction_doc2vec)

prediction_doc2vec = w2v_model.transform(prediction_doc2vec)

In [36]:
display(prediction_doc2vec)

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Date,WordList,WordListCleaned,bigram,ngrams,doc2vec
136337,B006Q820X0,#oc-R39HI2LQ9LHV32,stephanie,1,2,1,1339632000,Kcups leaking and coffee tasted old,"The kcups looked okay, but many of them sprayed coffee on the counter, had coffee grounds coming out. I won't buy this again.",2012-06-14,"List(The, kcups, looked, okay,, but, many, of, them, sprayed, coffee, on, the, counter,, had, coffee, grounds, coming, out., I, won't, buy, this, again.)","List(kcup, look, okay, mani, spray, coffe, counter, coffe, ground, come, wont, buy)","List(kcup look, look okay, okay mani, mani spray, spray coffe, coffe counter, counter coffe, coffe ground, ground come, come wont, wont buy)","List(kcup, look, okay, mani, spray, coffe, counter, ground, come, wont, buy, kcup look, look okay, okay mani, mani spray, spray coffe, coffe counter, counter coffe, coffe ground, ground come, come wont, wont buy)","List(1, 100, List(), List(-0.11070514687647422, 0.07480503991246223, -0.09966146717003235, -0.05415507995833953, 0.09616886296619971, 0.04276601869302491, 0.001795660345427071, -0.011994701509441558, -0.001392077732210358, -0.06035017354103426, 0.06006408979495366, 0.06617037571656206, 0.04775558963107566, 0.0875061034845809, -0.0203114782149593, 0.02688704834630092, -0.02852808559934298, 0.008171051376848482, -0.00893517614652713, -0.0013350746594369411, -0.0871660765260458, -0.03036282939137891, -0.01140535343438387, -0.12211513488243023, 0.014471503750731547, -0.06954166983875135, -0.030523703123132385, -0.035681207974751786, -0.06845997211833794, 0.1751752517496546, -0.003685681770245234, -0.10145162728925546, 0.04248712809445957, -0.008931353125565995, -0.009937205914563188, 0.1514523564837873, 0.08226896659471095, 0.07908314187079668, 0.02311551348248031, -0.03882682118273806, -0.060955917462706566, 0.02854286708558599, 0.07569438715775807, -0.03359428715581695, 0.08153154204289118, 0.0771238449960947, 0.03548233529242376, 0.0148899732157588, 0.09837541806822021, -0.009523687496160466, 0.059004973309735455, -0.0509311967374136, 0.13392736731717983, 0.0011397929241259892, -0.023425499520575006, 0.007224394008517265, 0.16732169974905747, -0.12265889534804349, -0.10192612818597505, -0.11927961573625603, 0.09015046712011099, -0.13455304130911827, 0.01785769717146953, 0.0785536131200691, -0.0651134136132896, -0.003634004659640292, 0.18758404805945852, 0.047738587561373905, 0.04151505387077729, 0.053369783563539386, -0.2371798498982874, -0.17362718128909665, 0.07997869079311688, -0.006412364232043425, -0.06205862512191136, 0.02315822437716027, 0.05182298528961837, 0.010251031257212162, -0.06294083238268891, 0.02299553466339906, 0.08857928883905211, 0.17570667279263336, -0.1250551516811053, 0.013432381713452438, 0.0243382448485742, -0.0057703068911602404, 0.05804337561130524, 0.06014510989189148, -0.10078359449592729, -0.023347922775428742, 0.14351562620140612, 0.08410788055819769, -0.18434194196015596, -2.619049822290738E-4, -0.15106081838409105, 0.06801097933202982, 0.025421484373509884, -0.00783106591552496, -0.11268715199548751, -0.02298047083119551))"
124400,B005ZBZM52,#oc-R8EZLM74R071X,jalien,0,0,1,1350172800,"Great design, not so great flavor","I wanted to try something new and this is new alright. While I love and appreciate the design, the flavor is less than desirable. It's too bitter. I drink my coffee black and really enjoy a strong cup of dark roast, but after discovering I didn't care for the flavor, I tried to disguise it with the flavor of creamer I knew I'd like (coconut cream, yum) and still the flavor of the bitter coffee was just too much.",2012-10-14,"List(I, wanted, to, try, something, new, and, this, is, new, alright., While, I, love, and, appreciate, the, design,, the, flavor, is, less, than, desirable., It's, too, bitter., I, drink, my, coffee, black, and, really, enjoy, a, strong, cup, of, dark, roast,, but, after, discovering, I, didn't, care, for, the, flavor,, I, tried, to, disguise, it, with, the, flavor, of, creamer, I, knew, I'd, like, (coconut, cream,, yum), and, still, the, flavor, of, the, bitter, coffee, was, just, too, much.)","List(want, tri, someth, new, new, alright, love, appreci, design, flavor, less, desir, bitter, drink, coffe, black, realli, enjoy, strong, cup, dark, roast, discov, didnt, care, flavor, tri, disguis, flavor, creamer, knew, id, like, coconut, cream, yum, still, flavor, bitter, coffe, just, much)","List(want tri, tri someth, someth new, new new, new alright, alright love, love appreci, appreci design, design flavor, flavor less, less desir, desir bitter, bitter drink, drink coffe, coffe black, black realli, realli enjoy, enjoy strong, strong cup, cup dark, dark roast, roast discov, discov didnt, didnt care, care flavor, flavor tri, tri disguis, disguis flavor, flavor creamer, creamer knew, knew id, id like, like coconut, coconut cream, cream yum, yum still, still flavor, flavor bitter, bitter coffe, coffe just, just much)","List(want, tri, someth, new, alright, love, appreci, design, flavor, less, desir, bitter, drink, coffe, black, realli, enjoy, strong, cup, dark, roast, discov, didnt, care, disguis, creamer, knew, id, like, coconut, cream, yum, still, just, much, want tri, tri someth, someth new, new new, new alright, alright love, love appreci, appreci design, design flavor, flavor less, less desir, desir bitter, bitter drink, drink coffe, coffe black, black realli, realli enjoy, enjoy strong, strong cup, cup dark, dark roast, roast discov, discov didnt, didnt care, care flavor, flavor tri, tri disguis, disguis flavor, flavor creamer, creamer knew, knew id, id like, like coconut, coconut cream, cream yum, yum still, still flavor, flavor bitter, bitter coffe, coffe just, just much)","List(1, 100, List(), List(-0.1057189389885891, 0.15611619555247236, 0.043049396527099555, -0.024266765630316166, 0.033761991054884026, 0.03484702222270979, -0.02296642857115893, -0.008959645948683221, 0.058505518921828335, -0.014355461589903348, 0.00460923525194327, 0.10968321062890546, -0.03274600516029057, 0.07996849294397093, -0.06639442987562645, 0.0632239862988215, -0.07083157852147927, -0.05318153805897704, 0.06170275641073073, -0.08603889523961004, -0.04020947066047007, 0.07105378062087332, 0.07694123044521325, -0.12939560136735617, 0.056539090477772766, 0.04057502146765944, 0.035489624102843835, 0.022176622912021618, 0.03965054977951305, 0.07623614328691647, 0.012450105632610974, -0.12367673646215172, 0.07741056409265314, -0.005823900546168997, 0.04361878374281029, 0.08160311338447389, 0.09749067311973443, -0.04009373806461337, -0.06694141389536006, -0.125296211269285, -0.06174385737782965, 0.05523710884153843, 0.10569632736345132, -0.056667223860997526, -0.06430546127791915, -0.0026954159367873908, -0.025165878635432036, 0.046351677366709775, 0.06781229289758595, -0.013319686204860253, 0.075463159536455, -0.052751447273684395, 0.08344332594285896, -0.13744441489689052, 0.022272728710612723, 0.05442621934344061, 0.10338698862497472, -0.029921570981276176, -0.07387576575967528, -0.06003032742245566, 0.06397399184338394, -0.08245268532828916, 0.011579177999616755, 0.009448502833644548, -0.07819536922311054, -0.012256741906250162, 0.04064082689694173, 0.011840088764161226, 0.023694826955241815, 0.027414040954595074, -0.14951751416637782, -0.08646715157443569, 0.021314315431352173, 0.03434578975152579, -0.04930967420694374, -0.02868135205686225, -0.020541942257079338, -0.03851808425748632, 0.058531123612608225, -0.019232201740600256, -0.07326669159478374, -0.006036205660729181, -0.037581643366831397, 0.0902483391324668, 0.04045405749985505, 0.05232411931224522, 0.0674782061001419, 0.00438646901775861, 0.08253312019986056, -0.002005691340725337, 0.07163025879500699, 0.015011233931762121, -0.10569760296493769, -0.011571215975674844, 0.040959547739475965, -0.05397742543192136, 0.030984822614257043, 0.013806378557568505, -0.026021521538496017, -0.032212742451866644))"
345714,B000ED9L9E,A10103MJIKKIFE,Dozer,1,1,5,1339459200,Yummi!!!!,"Love this hot cereal! I use 1/3 cup with 1 cup of almond milk, 1 tbsp of flax oil and 1 banana for breakfast (around 400 calories) and it helps me feel full all day! It has definitely decreased my cravings for sugary things.",2012-06-12,"List(Love, this, hot, cereal!, I, use, 1/3, cup, with, 1, cup, of, almond, milk,, 1, tbsp, of, flax, oil, and, 1, banana, for, breakfast, (around, 400, calories), and, it, helps, me, feel, full, all, day!, It, has, definitely, decreased, my, cravings, for, sugary, things.)","List(love, hot, cereal, use, , cup, , cup, almond, milk, , tbsp, flax, oil, , banana, breakfast, around, , calori, help, feel, full, day, definit, decreas, crave, sugari, thing)","List(love hot, hot cereal, cereal use, use , cup, cup , cup, cup almond, almond milk, milk , tbsp, tbsp flax, flax oil, oil , banana, banana breakfast, breakfast around, around , calori, calori help, help feel, feel full, full day, day definit, definit decreas, decreas crave, crave sugari, sugari thing)","List(love, hot, cereal, use, , cup, almond, milk, tbsp, flax, oil, banana, breakfast, around, calori, help, feel, full, day, definit, decreas, crave, sugari, thing, love hot, hot cereal, cereal use, use , cup, cup , cup almond, almond milk, milk , tbsp, tbsp flax, flax oil, oil , banana, banana breakfast, breakfast around, around , calori, calori help, help feel, feel full, full day, day definit, definit decreas, decreas crave, crave sugari, sugari thing)","List(1, 100, List(), List(-0.11215953939947589, 0.19652471911354824, 0.1083142462182501, -0.028884253175608044, 0.03959108747798821, 0.06861086009905255, 0.04137226072107923, -0.1398608187325941, 0.028429636795972955, -0.10505966418262186, -0.05766447842249583, 0.04126739970023005, -0.11135072993307278, -0.03817337520163635, -0.03862246278480723, -0.01859901794071855, -0.049140017865032984, 0.04403928127782098, 0.0602422763657724, -0.097490636214357, 0.02597613555604014, -0.07940409278320469, -0.0026243011274471365, -0.03128950703278954, 0.036029411881262886, 0.019634896391553098, -0.05472070085494939, 0.034693366012953475, 0.10279750339044579, 0.11004968899591215, -0.0831304510477288, -0.10973192125173478, 0.0935081205017672, 0.07778901581106515, 0.049835258259855464, -0.024779639624316115, 0.08047891504548747, -0.0713880331223381, -0.04291156668151761, -0.06838056212291121, 0.12514599197512044, 0.033400334610507405, -0.028272234240611052, -0.01959520732534343, -0.08159556077664783, 0.08214042830313074, -0.012711293369146257, 0.14013826233688098, 0.06257774623046661, 0.18159719777357733, -0.004780705518928228, 0.026894782059665383, -0.002697614105900043, -0.023027787081383425, 0.0027383956747081386, -0.005862921000667045, -0.07496836261245711, 0.10934142209589481, 0.003833745766816468, -0.0268465641500621, -0.07480765694495418, 0.056169876796675136, 0.02137713115020045, 0.05925091418662461, -0.006261440957414693, -0.0033295184919803307, 0.01952822570656908, -0.019302409174370354, -0.043483091743084894, 0.00948902725338422, -0.04517627015292387, -0.055813614254945826, 0.004383695697219208, 0.040191919446505346, 0.062313620322223366, -0.04367429817673461, 0.0902649341472264, 0.046676143325447776, 0.06255348360744016, 0.023878321854461884, 0.14392635611624552, 0.04441613083352046, -0.06119875820225169, 0.02758135057279262, 0.07506437679945395, 0.01896875936152606, -0.09152929889487808, 0.05876499491518941, 0.13375790404348536, 0.04745303842255139, -0.021119412628869558, -0.048334930750445045, -0.020353248740289487, -0.02005613388108282, -0.011718123830084142, 0.013725161504257342, 0.05931006044004883, -0.049581883178124654, -0.10282776491909192, -0.07145142523122244))"
422681,B001LNL0MC,A101RJ22EBXAID,Disappointed Santa,5,6,1,1295913600,Holeymoley did I order crushed peppermint,"I ordered 3 boxes of Bob's candy canes because I could not find them in stores near me and I wanted the real thing, Bob's,not Spanglers, to help in the celebration of my father's last Christmas. We wanted an old fashioned, basic heart of America Christmas with him. We were willing to pay almost any price to get original Bob's candy canes for this purpose, hence ordering them at such an inflated price on-line from Amazon. More than half came smashed to smithereens. I sent a note to customer service alerting them to this and heard nothing back.",2011-01-25,"List(I, ordered, 3, boxes, of, Bob's, candy, canes, because, I, could, not, find, them, in, stores, near, me, and, I, wanted, the, real, thing,, Bob's,not, Spanglers,, to, help, in, the, celebration, of, my, father's, last, Christmas., We, wanted, an, old, fashioned,, basic, heart, of, America, Christmas, with, him., We, were, willing, to, pay, almost, any, price, to, get, original, Bob's, candy, canes, for, this, purpose,, hence, ordering, them, at, such, an, inflated, price, on-line, from, Amazon., More, than, half, came, smashed, to, smithereens., I, sent, a, note, to, customer, service, alerting, them, to, this, and, heard, nothing, back.)","List(order, , box, bob, candi, cane, find, store, near, want, real, thing, bobsnot, spangler, help, celebr, father, last, christma, want, old, fashion, basic, heart, america, christma, will, pay, almost, price, get, origin, bob, candi, cane, purpos, henc, order, inflat, price, onlin, amazon, half, came, smash, smithereen, sent, note, custom, servic, alert, heard, noth, back)","List(order , box, box bob, bob candi, candi cane, cane find, find store, store near, near want, want real, real thing, thing bobsnot, bobsnot spangler, spangler help, help celebr, celebr father, father last, last christma, christma want, want old, old fashion, fashion basic, basic heart, heart america, america christma, christma will, will pay, pay almost, almost price, price get, get origin, origin bob, bob candi, candi cane, cane purpos, purpos henc, henc order, order inflat, inflat price, price onlin, onlin amazon, amazon half, half came, came smash, smash smithereen, smithereen sent, sent note, note custom, custom servic, servic alert, alert heard, heard noth, noth back)","List(order, , box, bob, candi, cane, find, store, near, want, real, thing, bobsnot, spangler, help, celebr, father, last, christma, old, fashion, basic, heart, america, will, pay, almost, price, get, origin, purpos, henc, inflat, onlin, amazon, half, came, smash, smithereen, sent, note, custom, servic, alert, heard, noth, back, order , box, box bob, bob candi, candi cane, cane find, find store, store near, near want, want real, real thing, thing bobsnot, bobsnot spangler, spangler help, help celebr, celebr father, father last, last christma, christma want, want old, old fashion, fashion basic, basic heart, heart america, america christma, christma will, will pay, pay almost, almost price, price get, get origin, origin bob, cane purpos, purpos henc, henc order, order inflat, inflat price, price onlin, onlin amazon, amazon half, half came, came smash, smash smithereen, smithereen sent, sent note, note custom, custom servic, servic alert, alert heard, heard noth, noth back)","List(1, 100, List(), List(0.010369860368815285, 0.047791643753751285, 0.02116372825333167, 0.00812714211470275, 0.030156329317294336, -0.03050771279312256, 0.02318017765086282, -0.0887353802875926, 0.0492685903036208, -0.030262864786318795, 0.0017587846809032338, 0.1456325672118476, 0.11627727077790984, -0.030575024188254717, 0.11042622431974719, 0.044785771598172784, -0.05374610349449708, -0.049090280758079, 0.013368108799719872, 0.011582396958989124, -0.002885315127463804, -0.1275600098198521, -0.002342427994504019, -0.051602171570131625, -0.07437391553281082, -0.04504386004905596, -0.07000916320356299, -0.0021283641108311713, -0.015875986363324854, 0.03204391651390189, -0.003771916635472465, -0.07143469955405668, 0.16140592769348855, -0.09702295793599829, 0.01308337171320562, 0.031556277760063055, 0.0385377856005949, 0.07976793458133384, 0.056293592089787126, -0.03559194739770006, -0.03248154183987666, 0.06569279366953264, -0.08573170929836729, 0.04103329967431448, -0.045735423315178464, 0.08716261864605325, 0.0656674917899417, 0.1347713953366986, -0.07438945219662316, 0.008062774147320952, 0.012027621467563289, 0.06255545619017168, 0.011051503630975882, -0.05024104474836753, -0.01672913152207103, -0.030810846676575708, 0.010397614238576757, -0.05011252093956702, -0.026565422709272413, 0.005192968476977613, -0.027912407458104468, 0.030875998100748765, -0.05328877027689789, 0.012027719827731036, -0.04705001639754132, 0.09980948469644688, 0.0902407316327164, 0.0665242675879401, -0.02071189218097263, -0.012597424564538178, -0.06562925727296344, 0.01151050431912558, -0.06839157660336544, 0.062407478189992684, 0.03301300815278578, -0.0810363820788485, 0.04780213175884758, -0.005278638253609339, 0.01419025879456765, 0.07733331629747732, 0.01748038295449482, -0.00248609681148082, 0.002357351460011193, -0.05970394625587182, -0.02118441226236798, -0.05905859507792802, -0.007836508147496108, 0.051044222070939005, -0.16649493176265862, 0.048363753527357604, 0.09346369885046173, 0.011596130327907976, -0.019465936414942712, 0.008707476693584962, 0.08097581811145775, -0.0433935362379998, -0.1320980318360617, 0.03467075854401897, 0.016607691982278117, 0.049030946843602034))"
201480,B001EQ4M6M,A101WAUONUXTNI,high altitude baking,0,4,1,1258761600,These Don't Work!,These make more than 10 cupcakes. Don't fill the muffin cups full (unless you are using jumbo muffin tins) because they will over flow. There should be some modification for high altitude..,2009-11-21,"List(These, make, more, than, 10, cupcakes., Don't, fill, the, muffin, cups, full, (unless, you, are, using, jumbo, muffin, tins), because, they, will, over, flow., There, should, be, some, modification, for, high, altitude..)","List(make, , cupcak, dont, fill, muffin, cup, full, unless, use, jumbo, muffin, tin, will, flow, modif, high, altitud)","List(make , cupcak, cupcak dont, dont fill, fill muffin, muffin cup, cup full, full unless, unless use, use jumbo, jumbo muffin, muffin tin, tin will, will flow, flow modif, modif high, high altitud)","List(make, , cupcak, dont, fill, muffin, cup, full, unless, use, jumbo, tin, will, flow, modif, high, altitud, make , cupcak, cupcak dont, dont fill, fill muffin, muffin cup, cup full, full unless, unless use, use jumbo, jumbo muffin, muffin tin, tin will, will flow, flow modif, modif high, high altitud)","List(1, 100, List(), List(-0.08712528025110562, 0.12039564200676978, 0.0726782777864072, 0.004739899860901965, -0.001062935735616419, 0.013364736197723283, 0.04208725323486659, -0.21856986630397537, -0.024545496536625754, -0.03617085159445802, -0.03237503182349934, 0.04364887090762042, -0.060103439156793885, -1.44349488740166E-4, -0.0364598987572309, -0.10527612162857419, 0.00229622665149994, 0.027363015131817922, 0.017160869306988187, -0.04509747373069533, 0.031680792819113575, -0.024615284010198794, -0.07414622956679927, 0.1016853619140521, -0.0031183249213629295, -0.090456398203969, -0.05723355369021495, 0.02381974613914887, -0.03439670263065232, 0.12460367236700322, 0.03364252002858039, 0.0011880406075053744, 0.10278702899813652, -0.11856475001614954, 0.08181283220700505, -0.03075260224027766, 0.02731564506474468, -0.053769053412704826, -0.08310352524535523, -0.016480943136331107, 0.06612309710019164, -0.07115060611007114, 0.03495801685170995, 0.06984757942457993, -0.13294246627224815, 0.09666274358621901, 0.035933573782030076, 0.0705433563950161, 0.03725119845734702, 0.16927158581610355, -0.0018656044267117977, -0.03609744148949782, -0.04211504711161574, 0.00565694086253643, 0.08949320256296131, 0.012424443166107973, -0.028110493233220443, -0.05351932584825489, 0.0519964543895589, -0.029441708685933713, 0.03611484123393893, -0.07277244986552331, 0.02114809521784385, 0.10418679510864118, -0.1092223629497716, 0.0025209260962179136, 0.12103912503355078, 0.09719460312690999, 0.03604174695081181, -0.006007193018578821, -0.08215641085472371, -0.09483115387976997, -0.04529120117270698, 0.11301381901527444, 0.031451445900731616, 0.041884786915034056, 0.07464434030569261, 0.10187743723185526, -0.055623521200484694, 0.015851715320928227, 0.12377035798918869, 0.14774599672657332, -0.02076893682695097, -0.0899456907581124, -0.007073929381375718, -0.01982996330803467, -0.07224485071169005, 0.17030452262972376, -0.0526263899066382, 0.06959408190515305, 0.04317398969497945, -0.1054593846719298, -0.046510032688577965, 0.054184084628812135, -0.014479277888312936, -0.03242459585372772, -0.05550236264631773, -0.01691624941304326, -0.07458021222717232, -0.04594901876731051))"
461927,B000ES1R1Y,A102T4DKZN8UFC,"Marina Thakkar ""marinathakkar""",0,1,5,1279843200,Great tea,I have been drinking Ahmad teas for quite some time now and love all the flavors...so good!,2010-07-23,"List(I, have, been, drinking, Ahmad, teas, for, quite, some, time, now, and, love, all, the, flavors...so, good!)","List(drink, ahmad, tea, quit, time, now, love, flavorsso, good)","List(drink ahmad, ahmad tea, tea quit, quit time, time now, now love, love flavorsso, flavorsso good)","List(drink, ahmad, tea, quit, time, now, love, flavorsso, good, drink ahmad, ahmad tea, tea quit, quit time, time now, now love, love flavorsso, flavorsso good)","List(1, 100, List(), List(-0.05331267168124516, 0.09298020746144983, 0.00323694861597485, -0.020412969506449167, 0.01791489662395583, -0.020937028237515025, 0.03467680886387825, 0.006243498788939581, 0.10737207045571671, 0.11841328649057281, 0.09092248065604104, 0.08980865786886877, 0.056543335111604795, 0.06496564836965667, -0.05267915150357617, 0.04977247615655263, 0.034636886040162705, -0.04588542588882976, 0.06993839072270526, -0.09431846470882495, -0.027387323996259105, -0.015420094235903686, 0.09483637060556147, -0.0021509118378162384, 0.023072870965633128, -0.0912530060029692, 0.03489374679823716, 0.017943425931864314, 0.15810100263398555, -0.024460386174420513, -0.14954434904373354, -0.14201688890655834, 0.014449882010618845, 0.08133700396865606, 0.04591972975888186, 0.09559054672718048, -0.07539195546673404, -0.02843025947610537, 0.019338038594772417, -0.07480923417541716, -0.1005502519611683, 0.041785469382173486, 0.1000433052993483, -0.1155399117204878, -0.0752835943777528, -0.11565633097456561, 0.05848444998264313, 0.04780099582340982, 0.15254200653483468, 4.489074150721232E-4, 0.08429317776527669, 0.03920780494809151, 0.05238232306308216, -0.10591656994074583, -0.06020668852660391, 0.02799889565600703, -0.09065481026967366, -0.07943464991533093, -0.08388209425740771, -0.03663308835691876, 0.024967985020743474, -0.004910628621776898, -0.11347112846043374, 0.020904229126042787, 0.009610644373525348, -0.014628999142183198, 0.03135597788625293, 0.017316227727052238, 0.14282539238532382, 0.08065644030769666, -0.11197078846291535, 0.10610909656518035, -0.0715361318240563, 0.042634268808696, 0.012829915723866886, -0.05953472050734692, 0.018485749140381813, -0.048122309782128364, 0.23478756472468376, 0.009955466621451907, -0.037472665723827146, -0.029807157814502716, -0.0278691366677069, -0.09069690864998847, 0.00955475028604269, -0.027016195572084848, -0.02311091207795673, 0.054801548520723976, -0.01103521366086271, -0.11238533755143483, 0.023850743503620226, -0.08342869579792023, -0.03358358713901705, 0.0789148203200764, 0.006327453586790296, -0.001827459575401412, -0.13853606022894382, 0.01961614434710807, -0.15499148486802974, 0.03886766814523273))"
419554,B0029ZAOW8,A105Y40R0K3ZIZ,Wei H. Ho,0,0,5,1321660800,good,it arrived on time and the quality is good as i expected. i will purchase again and recommend to other people.,2011-11-19,"List(it, arrived, on, time, and, the, quality, is, good, as, i, expected., i, will, purchase, again, and, recommend, to, other, people.)","List(arriv, time, qualiti, good, expect, will, purchas, recommend, peopl)","List(arriv time, time qualiti, qualiti good, good expect, expect will, will purchas, purchas recommend, recommend peopl)","List(arriv, time, qualiti, good, expect, will, purchas, recommend, peopl, arriv time, time qualiti, qualiti good, good expect, expect will, will purchas, purchas recommend, recommend peopl)","List(1, 100, List(), List(0.006723782668511072, 0.0595461626847585, -0.06263581198355597, 0.09904975195725758, 0.014570568584733538, 0.05565779604431655, 0.04765311235355006, 0.0020541946093241372, 0.13680672500696447, -0.05255261270536316, -0.09550544578168127, 0.12494235765188932, 0.06146039844801028, 0.007194836934407552, 0.032599604378143944, 0.0361706233686871, -0.0183930786119567, 0.010314667390452491, 0.10938382086654504, 0.07945574993371135, -0.025096982717514038, -0.03869825270440843, 0.08367633695403734, 0.013647712456683317, -0.1547414544555876, -0.013484211224648686, -0.056663770642545484, -0.009033699623412555, 0.0296546614004506, 0.025667966673305877, 0.020240615846382246, -0.09565699837791422, 0.1945505527158578, -0.21298406604263517, -0.06409517820510599, 0.014407505591710407, 0.01810921790699164, 0.1717164512309763, -0.06504072931905587, 0.09406981410251723, -0.13800391968753603, 0.013010852866702609, 0.02365104771322674, -0.06309570796373817, -0.04236572861878408, 0.011900144318739573, -0.04303263779729605, 0.0647858613067203, 0.0923005428372158, -0.09667481906298134, 0.14006545053174096, 0.030313623862134084, -0.1063300110399723, -0.0029464632065759762, -0.1368834803967426, -0.12024536811643176, 0.07078235844771066, 0.07554677459928724, -0.012904807925224304, 0.030364019175370533, 0.0349554104420046, 0.07096418448620372, 0.1443679341011577, 0.007020208363731702, -0.02227534477909406, 0.06155985531707604, 0.11701710946444008, 0.05997176037635654, 0.05074448146236439, -0.02390731767647796, -0.0628723402818044, -0.07558776686588922, -0.02484747374223338, 0.07072605671257608, -0.013036566062106026, -0.00868170956770579, 0.1604278501537111, -0.08068104274570942, 0.10848066417707336, -0.048315534575117954, 0.07113068198992147, 0.1354682660765118, -0.02962641231715679, -0.058928428010808095, 0.014960536112387974, -0.11635572080396943, 0.08545165457245375, -0.010674597488509284, -0.11644106461770004, -0.10199245520763926, 0.061832856179939374, -0.09203014253742164, -0.0826412724951903, 0.01418953678674168, 0.1411928462071551, -0.10116216245417793, -0.18375914605955282, 0.041217557051115565, -0.024303638188737545, 0.06245360213021437))"
414477,B002PHOHZ0,A107JX0JQRQUUP,Kevin J. Malantic,0,0,5,1327622400,Delicious and fresh alternative to the old fruitbasket standby,"Let's face it, flowers get boring after awhile and there's only so much chocolate you can buy as presents before people start calling you the 'Proflowers' guy behind your back. This fruit and treat gift is a fantastic alternative to the norm. I purchased 3 of these for relatives and they all loved the freshness of the fruit and the variety of the treats. Highly recommended and, what's more, the company shipped everything on time and it all arrive on the days that they stated.",2012-01-27,"List(Let's, face, it,, flowers, get, boring, after, awhile, and, there's, only, so, much, chocolate, you, can, buy, as, presents, before, people, start, calling, you, the, 'Proflowers', guy, behind, your, back., This, fruit, and, treat, gift, is, a, fantastic, alternative, to, the, norm., I, purchased, 3, of, these, for, relatives, and, they, all, loved, the, freshness, of, the, fruit, and, the, variety, of, the, treats., Highly, recommended, and,, what's, more,, the, company, shipped, everything, on, time, and, it, all, arrive, on, the, days, that, they, stated.)","List(let, face, flower, get, bore, awhil, there, much, chocol, can, buy, present, peopl, start, call, proflow, guy, behind, back, fruit, treat, gift, fantast, altern, norm, purchas, , rel, love, fresh, fruit, varieti, treat, highli, recommend, what, compani, ship, everyth, time, arriv, day, state)","List(let face, face flower, flower get, get bore, bore awhil, awhil there, there much, much chocol, chocol can, can buy, buy present, present peopl, peopl start, start call, call proflow, proflow guy, guy behind, behind back, back fruit, fruit treat, treat gift, gift fantast, fantast altern, altern norm, norm purchas, purchas , rel, rel love, love fresh, fresh fruit, fruit varieti, varieti treat, treat highli, highli recommend, recommend what, what compani, compani ship, ship everyth, everyth time, time arriv, arriv day, day state)","List(let, face, flower, get, bore, awhil, there, much, chocol, can, buy, present, peopl, start, call, proflow, guy, behind, back, fruit, treat, gift, fantast, altern, norm, purchas, , rel, love, fresh, varieti, highli, recommend, what, compani, ship, everyth, time, arriv, day, state, let face, face flower, flower get, get bore, bore awhil, awhil there, there much, much chocol, chocol can, can buy, buy present, present peopl, peopl start, start call, call proflow, proflow guy, guy behind, behind back, back fruit, fruit treat, treat gift, gift fantast, fantast altern, altern norm, norm purchas, purchas , rel, rel love, love fresh, fresh fruit, fruit varieti, varieti treat, treat highli, highli recommend, recommend what, what compani, compani ship, ship everyth, everyth time, time arriv, arriv day, day state)","List(1, 100, List(), List(0.03512646508034925, 0.08502471422895702, 0.03978026181941332, 0.04128421466191148, 0.016958158441581005, -0.04055767835031242, 0.004677584979596526, -0.012345474527412376, 0.05975855107224265, -0.02833334654894506, -0.0032290652026097442, 0.10232084346285393, 0.045062873831971786, 0.014472080437942993, 0.018003343669481055, 0.04265539402272119, 0.04610688824119956, 0.007832485500203315, 0.10277838288091642, -0.0035743479708989345, -0.01832662561778412, -0.03365965785328732, 0.09273986830267796, 0.010543247576543065, -0.05658038863154172, -0.04948984396773889, -0.05024790207227302, -0.009166686053842653, -0.014039755452337653, 0.01605853229235901, 0.031153663646343144, -0.07617731647143616, 0.13115257720008144, -0.09019053443659876, -0.025967421161747255, 0.00585891808881316, 0.039112361728722705, 0.10294248313957088, 0.018739158417596373, -9.301698788307434E-4, -0.015350839837865773, 0.07339558206487794, -0.024562580099459302, -0.030504126404953556, -0.10284566147074233, 0.023666505907597238, 0.003626935644248544, 0.051106327364957606, -0.03884956879584595, 0.01041331043640195, 0.10829321269844767, -0.0033429562910135056, -0.02917527725790129, -0.02848718241723471, -0.011427562994487194, -0.017395176946423774, 0.018603925246658715, -0.0759309320927186, 0.007102417618815982, 0.011174710813996404, -0.0143085066229105, 0.013904406277592792, -0.026201851235507705, -0.006597208898774413, -0.06365880912689623, 0.005793436896055937, 0.07466338109225035, 0.005797476499601332, 0.0024851776475390028, 0.005942473556239937, -0.021860152520889114, -1.961716916412115E-4, -0.054542369360840594, 0.11538415730302763, -0.015397601755509195, -0.07454442397453064, 0.07757117692245777, 0.007910776577435087, 0.055272545795454535, 0.011976582019828087, 0.023180797474112273, 0.03223466048515294, -0.003678415392980326, -1.1516165673234608E-4, 0.052514092108711254, -0.1056032549233858, 5.674325796060784E-4, 0.02301447673938995, -0.08740909897917232, 0.02699486062277195, 0.019289454423584217, -0.03902745807846618, -0.009819155488553088, -0.028439707772503067, 0.04803448732465852, -0.0016624641951260178, 0.010818045827300223, 0.01794124982417236, -0.043356000186894415, 0.02852271878442099))"
497417,B001SB0ZQ4,A108M02UNZ7545,Stephen Mills,1,1,1,1304208000,Disappointment with Amazon/Nutricity,"I am furious that I put in my first order for something other than books, dvds and cds from an Amazon based Seller, Nutricity, and they sent the wrong order. And, it's not that it's just the wrong order, but the title/headline on the product and the order itself specifically said DECAF!! I needed and wanted and ordered DECAF and, DECAFFEINATED is printed on the cans of the Product I was SUPPOSED to get and ordered - and they STILL sent me the wrong product with Caffeine. Unbelievable.",2011-05-01,"List(I, am, furious, that, I, put, in, my, first, order, for, something, other, than, books,, dvds, and, cds, from, an, Amazon, based, Seller,, Nutricity,, and, they, sent, the, wrong, order., And,, it's, not, that, it's, just, the, wrong, order,, but, the, title/headline, on, the, product, and, the, order, itself, specifically, said, DECAF!!, I, needed, and, wanted, and, ordered, DECAF, and,, DECAFFEINATED, is, printed, on, the, cans, of, the, Product, I, was, SUPPOSED, to, get, and, ordered, -, and, they, STILL, sent, me, the, wrong, product, with, Caffeine., Unbelievable.)","List(furiou, put, first, order, someth, book, dvd, cd, amazon, base, seller, nutric, sent, wrong, order, just, wrong, order, titleheadlin, product, order, specif, said, decaf, need, want, order, decaf, decaffein, print, can, product, suppos, get, order, , still, sent, wrong, product, caffein, unbeliev)","List(furiou put, put first, first order, order someth, someth book, book dvd, dvd cd, cd amazon, amazon base, base seller, seller nutric, nutric sent, sent wrong, wrong order, order just, just wrong, wrong order, order titleheadlin, titleheadlin product, product order, order specif, specif said, said decaf, decaf need, need want, want order, order decaf, decaf decaffein, decaffein print, print can, can product, product suppos, suppos get, get order, order , still, still sent, sent wrong, wrong product, product caffein, caffein unbeliev)","List(furiou, put, first, order, someth, book, dvd, cd, amazon, base, seller, nutric, sent, wrong, just, titleheadlin, product, specif, said, decaf, need, want, decaffein, print, can, suppos, get, , still, caffein, unbeliev, furiou put, put first, first order, order someth, someth book, book dvd, dvd cd, cd amazon, amazon base, base seller, seller nutric, nutric sent, sent wrong, wrong order, order just, just wrong, order titleheadlin, titleheadlin product, product order, order specif, specif said, said decaf, decaf need, need want, want order, order decaf, decaf decaffein, decaffein print, print can, can product, product suppos, suppos get, get order, order , still, still sent, wrong product, product caffein, caffein unbeliev)","List(1, 100, List(), List(0.0713003507283117, 0.06815460294906404, -0.012361046026593872, -0.05399561334135276, 0.0030562838295563346, -0.06193372366639475, 0.05166923813521862, 0.05141113466760587, -0.0049488070099392815, -0.004381148172320709, 0.0034767610486596823, 0.19667378022512863, 0.09010193182621151, -0.032965296702015964, 0.09538178411977631, 0.07681146301772622, 0.030271811566005148, -0.0452708502388781, 0.050617544002653586, -0.014298684761992522, -0.02804297860711813, -0.062438944969991486, 0.07653664287534498, -0.04720570676450041, -0.13055675334873654, -0.04319412486317257, -0.05709675608280425, 0.038078201615939, -0.004331658938012662, 0.08956729241513779, 0.04142766524871279, -0.11958587692961806, 0.15579399269162897, -0.05734277498863992, -0.019824195010144086, 0.1181049074283302, -0.04570172948851472, 0.04301383248752071, 0.08799222597320165, -0.028518328454256767, -0.08182346206601886, 0.02744422006903083, -2.138483825893629E-4, 0.006985532713033968, -0.0836530286774394, 0.0749190123973503, 0.03082982197936092, 0.10408222331621107, -0.023519619301493676, -0.10873101256965172, 0.04941134791200359, 0.08509510926281412, 0.09876898228235187, 0.012506001562412295, -0.02012198848561162, -0.023851489559525534, 0.0774042897946423, 0.0043601770885288715, -0.01077403263410642, 0.025364477032174666, 0.016624870998341413, 0.026267086532676502, -0.0012067388826315956, 0.05126159209369992, -5.179954958813531E-4, 0.040920082230253944, 0.008571606001905386, 0.04184312024430948, 0.020818400085859355, -0.07264564532254424, -0.10221465419800509, -0.007771025234389872, -0.1339634514990307, 0.11724345065054616, -0.04471959314486455, -0.056335500873891366, 0.08835601112583563, -0.07865655202130299, -0.04937536051840565, 0.03810410173693007, 0.033606677223967076, 0.009018387684288124, -0.01758820612338327, 0.046354664533975576, -0.05509027477265114, -0.13234242943248578, -0.003150219455294843, 0.12999508592544035, -0.09788644863736061, -0.041768995429655266, 0.17410201022756241, 0.015408837934955955, 0.006966074196887867, -0.009383629275751966, 0.10334399362493837, -0.05100112787858095, -0.018471561002938654, 0.020127996120468845, -0.033869649638377484, 0.06418654788285494))"
291053,B005HG9ESG,A10LDAXFO052F7,Alexandra,4,6,5,1335225600,Amazing water,"This water is very good for the immune system... Love it, it's done wonders for my family.. Will always purchase this water.",2012-04-24,"List(This, water, is, very, good, for, the, immune, system..., Love, it,, it's, done, wonders, for, my, family.., Will, always, purchase, this, water.)","List(water, good, immun, system, love, done, wonder, famili, will, alway, purchas, water)","List(water good, good immun, immun system, system love, love done, done wonder, wonder famili, famili will, will alway, alway purchas, purchas water)","List(water, good, immun, system, love, done, wonder, famili, will, alway, purchas, water good, good immun, immun system, system love, love done, done wonder, wonder famili, famili will, will alway, alway purchas, purchas water)","List(1, 100, List(), List(0.005267522530630231, 0.12741794507019222, -0.01992430173171063, 0.09278246036653096, 0.03998839253714929, 0.1277338833315298, -0.13400513709833223, 0.07084168462703624, -0.05168495886027813, -5.173911340534687E-4, -0.024712991124639906, 0.13051359875438112, 0.03128258422172318, -0.015069574893762667, -0.12345274615411957, 0.037252628089239195, 0.012277644127607346, -0.013245741836726665, 0.05941549860290252, -0.08411478142564495, 0.03425289123939971, -0.03551530960248783, 0.018040590453892946, 0.048148467515905694, -0.0511362172740822, -0.020015141616264977, 0.039606242130200066, -0.022146266574660935, 0.049801824924846486, 0.11187494996314247, -0.11268158505360285, 0.006500920280814171, 0.15499168184275428, -0.050754154644285635, -0.010333988505105177, 0.10274656613667806, -0.02515732659958303, 0.1025400971993804, -0.07170402305200696, -0.034087662623884775, -0.10599686958206196, 0.031598678790032864, -0.04591123868400852, 0.08797392466415961, -0.0670536688218514, 0.011512095729509989, -0.031188294679547347, 0.11048809407899776, 0.048936693385864295, 0.09283056139247492, 0.011501168676962454, 0.0749804318572084, -0.09673666985084613, -0.04924138037798305, -0.06862022720936996, -0.15017283521592617, -0.023543592969266076, -0.002658323384821415, -0.03459940509249766, -0.15807603113353252, 0.09972351230680943, 0.043920228723436594, 0.060690030145148434, -0.04346581358307351, -0.08024700746561089, -0.11900231149047613, 0.044738102083404854, -0.019811829396833975, -0.030980312963947654, -0.05613200971856713, -0.020930136068879314, -0.06953930947929621, -0.05058949425195654, 0.03145948766420285, 0.09679067724694808, 0.03866466072698434, 0.012925770599395037, 0.04229510962613858, 0.052003163844347, -0.07537014351692051, -0.05762449838221073, 0.04634823029239972, -0.05627802138527234, -0.09155684651341289, 0.17394756184269983, 0.07989029846309373, 0.007092517102137208, 0.08968049132575591, -0.0611260668374598, -0.08973855649431546, -0.015436013073970873, -0.033633859983334936, 0.09375719206097224, -0.09821398463100195, -0.03521862470855315, -0.07499664882197976, -0.029159248961756624, -0.026542755930374064, -0.053836660071586565, -0.18354528856192093))"


In [37]:
# Random Forest

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="Score", outputCol="indexedScore").fit(prediction_doc2vec)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="doc2vec", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

(trainingData, testData) = prediction_doc2vec.randomSplit([0.7, 0.3])

rf_model = pipeline.fit(trainingData)
predictions = rf_model.transform(testData)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print("AUC = %g" % auc)

In [38]:
# Performance evaluation with 10-fold cross validation

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().build()
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=10)
cvModel = cv.fit(prediction_doc2vec)

print("Average AUC = %g" % cvModel.avgMetrics[0])

## Model Interpretation

In [40]:
# Calculating TF-IDF without hashing; limit vocabulary to top 2^12 (4096) ngrams

from pyspark.ml.feature import CountVectorizer, IDF

interpret_tfidf = prediction_df.select('*')

tf = CountVectorizer(inputCol="ngrams", outputCol='TF', minDF=2.0, vocabSize=2**12)
tf_model = tf.fit(interpret_tfidf)
tf_transformed = tf_model.transform(interpret_tfidf)
idf = IDF(minDocFreq=3, inputCol="TF", outputCol="TF-IDF")
idfModel = idf.fit(tf_transformed)
interpret_tfidf = idfModel.transform(tf_transformed)

In [41]:
# Building a full Random Forest model with all the data, using TF-IDF embedding without hashing

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="Score", outputCol="indexedScore").fit(interpret_tfidf)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="TF-IDF", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

rf_model = pipeline.fit(interpret_tfidf)

In [42]:
# Getting feature importance from the Random Forest model

feature_importance = rf_model.stages[-2].featureImportances
print(feature_importance)

In [43]:
# Get the top 20 most important feature's indices, and its importance metric

import numpy as np
import pandas as pd

top20_indice = np.flip(np.argsort(feature_importance.toArray()))[:20].tolist()
top20_importance = []
for index in top20_indice:
    top20_importance.append(feature_importance[index])

top20_df = spark.createDataFrame(pd.DataFrame(list(zip(top20_indice, top20_importance)), columns =['index', 'importance']))

display(top20_df)

index,importance
930,0.095536671305655
510,0.0771450245716013
715,0.0758230265252156
218,0.0357526135759054
1027,0.0330005637210955
440,0.0314459798847877
1706,0.0313921096234848
1229,0.0305992639776906
54,0.0218362657511012
1120,0.020851544707246


In [44]:
# Create a map between each ngram and its index

from pyspark.sql.functions import explode, udf

make_list_udf = udf(lambda col: [col], ArrayType(StringType()))
remove_list_udf = udf(lambda col: col[0], StringType())

def get_index(col):
  if len(col.indices) == 0:
    return -1   # Mark the ngram's index as -1 if it is not the top 2^12 ngrams
  else:
    return int(col.indices[0])
get_index_udf = udf(get_index, IntegerType())

ngram_index = interpret_tfidf.select(explode(interpret_tfidf.ngrams).alias("ngrams")).distinct() \
                             .withColumn("ngrams", make_list_udf("ngrams"))
ngram_index = tf_model.transform(ngram_index)
ngram_index = ngram_index.withColumn("ngrams", remove_list_udf("ngrams")) \
                         .withColumn("index", get_index_udf("TF")) \
                         .select("ngrams", "index")

In [45]:
display(ngram_index.where(ngram_index.index > -1))

ngrams,index
still,100
yummi,352
dri,145
habanero,3821
gloria,2971
small amount,1702
far,3398
everyday,879
earl,1266
buy one,2032


In [46]:
# Find the ngrams that map to the top 20 most important features

# Note that if you used hashingTF for word embedding, there would be multiple ngrams under the same index, because of the collision introduced by hashing, all of which would share and contribute to one importance score, and we don't have a way to separate their contribution to the importance score.
# Here in order to avoid such collision (so just one index per ngram), I used CountVectorizer instead of HashingTF during encoding.

import pyspark.sql.functions as f

top20_ngram = top20_df.join(ngram_index, on="index", how="left_outer")
display(top20_ngram.groupby("importance").agg(f.collect_list(top20_ngram.ngrams).alias("ngram")).orderBy("importance", ascending=False))

importance,ngram
0.095536671305655,List(worst)
0.0771450245716013,List(return)
0.0758230265252156,List(horribl)
0.0357526135759054,List(money)
0.0330005637210955,List(wast money)
0.0314459798847877,List(wast)
0.0313921096234848,List(never buy)
0.0305992639776906,List(disgust)
0.0218362657511012,List(delici)
0.020851544707246,List(china)
