# Homework 4: Big Data


This homework assignment builds on the in-class work we did with Spark.
You will be using the [Yelp Academic Dataset](https://www.kaggle.com/yelp-dataset/yelp-dataset) and focusing primaily on the text of the reviews (i.e. the reviews.json.gz file).

**We suggest that you work in groups to make a plan to tackle this homework assignment.**

Here are the two questions that comprise the assignment:

1. List the 50 most common non-stopword words that are unique to *positive* reviews.
2. List the 50 most common non-stopword words that are unique to *negative* reviews.

As an example, consider the following two reviews:

* Positive: The meal was great, and the service was the best we ever experienced.
* Negative: The meal was awful.  It was the worst thing we ever experienced.

Assume our stopwords are {'the','was','and','the','was','we','it'}

* Positive unique: {'great', 'service', 'best'}

* Negative unique: {'awful', 'worst', 'thing'}

In this example, each unique word occurs just once, so the concept of "top 50" doesn't make sense.  For your data, you'll need to count the number of times each unique word occurs.

Because this is the final homework assignment in this course, we are leaving it up to you to operationalize most of the details.  For example, you will need to determine what constitutes a positive or a negative review.

**You should take care to document your work, preferably using markdown blocks. In-code commenting is also 
a good idea.**

You will also need to generate a list of stopwords.  Neither spaCy nor NLTK are available on AWS EMR, so you'll need to be creative in how you get a good list of stopwords into Spark.

Finally, you will notice that there are a **lot** of reviews.  You might want to work off a small sample (i.e. use the rdd.sample() function in Spark) to work on a reduced size dataset while you're developing your solution.

### REMEMBER TO TERMINATE YOUR AWS CLUSTER(S) WHEN YOU'RE DONE (OR WHEN YOU TAKE A BREAK)!

Please download your work in HTML and IPYNB formats and submit both to Canvas.

In [1]:
hello = ''

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1556068665957_0002,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
business = spark.read.json('s3://umsi-data-science/data/yelp/business.json')

VBox()

In [119]:
# business.printSchema()

VBox()

In [120]:
review = spark.read.json('s3://umsi-data-science/data/yelp/review.json.gz')

VBox()

In [121]:
# review.printSchema()

VBox()

In [122]:
# review.take(5)

VBox()

Filtering by Stars for Positive and Negative

In [125]:
positive = review.filter(review['stars'] >= 4)

VBox()

In [126]:
negative = review.filter(review['stars'] <= 3)

VBox()

In [123]:
# positive.take(1)

VBox()

In [124]:
# negative.take(1)

VBox()

Taking sample size of 0.1% each

In [127]:
sample_positive = positive.sample(False, 0.0001, None)

VBox()

In [128]:
sample_negative = negative.sample(False, 0.0001, None)

VBox()

Turning positive sample into all lower case and splitting each word

In [147]:
pos_func = sample_positive.select('text').rdd.flatMap(lambda x: x)
# pos_func.take(2)

VBox()

In [149]:
pos_lower = pos_func.map(lambda x: x.lower())
# pos_lower.take(2)

VBox()

In [152]:
split_positive = pos_lower.flatMap(lambda x: x.split(' '))
# split_positive.take(20)

VBox()

Turning negative sample into all lower case and splitting each word

In [153]:
neg_func = sample_negative.select('text').rdd.flatMap(lambda x: x)

VBox()

In [154]:
neg_lower = neg_func.map(lambda x: x.lower())

VBox()

In [155]:
split_negative = neg_lower.flatMap(lambda x: x.split(' '))

VBox()

Defining stop words based on stop words from python STOP_WORDS array HW2

In [167]:
stop_words = ['perhaps', 'about', 'bottom', 'else', 'also', 'afterwards', 'might', 'along', 'none', 'of', 'themselves', 'beforehand', 'therein', 'yourselves', 'against', 'various', 'often', 'already', 'being', 'out', 'does', 'full', 'is', 'few', 'must', 'myself', 'thereupon', 'these', 'but', 'this', 'we', 'within', 'cannot', 'over', 'show', 'would', 'becoming', 'something', 'whereas', 'give', 'serious', 'rather', 'although', 'either', 'front', 'himself', 'his', 'it', 'through', 'via', 'so', 'whoever', 'an', 'wherever', 'keep', 'somewhere', 'last', 're', 'both', 'you', 'becomes', 'done', 'make', 'latter', 'many', 'other', 'hence', 'doing', 'moreover', 'am', 'everyone', 'someone', 'among', 'empty', 'whence', 'yourself', 'least', 'thru', 'how', 'beside', 'mostly', 'as', 'former', 'name', 'ten', 'any', 'what', 'amongst', 'ourselves', 'hereafter', 'its', 'without', 'amount', 'from', 'anyone', 'nevertheless', 'nobody', 'did', 'whose', 'alone', 'back', 'still', 'whereafter', 'just', 'behind', 'quite', 'besides', 'say', 'most', 'third', 'thereby', 'side', 'three', 'onto', 'was', 'eleven', 'on', 'below', 'why', 'and', 'put', 'anyhow', 'are', 'same', 'twenty', 'fifty', 'yet', 'beyond', 'be', 'elsewhere', 'whatever', 'part', 'enough', 'five', 'hundred', 'their', 'where', 'once', 'thereafter', 'anything', 'such', 'call', 'unless', 'between', 'regarding', 'or', 'six', 'move', 'upon', 'due', 'around', 'itself', 'i', 'well', 'toward', 'whether', 'therefore', 'made', 'indeed', 'used', 'across', 'for', 'anyway', 'though', 'together', 'others', 'to', 'there', 'thus', 'than', 'throughout', 'whenever', 'him', 'all', 'however', 'ever', 'us', 'only', 'whereupon', 'had', 'one', 'hers', 'off', 'my', 'those', 'whereby', 'who', 'above', 'a', 'mine', 'she', 'whole', 'become', 'ours', 'several', 'nor', 'some', 'seemed', 'hereby', 'he', 'now', 'before', 'everything', 'do', 'next', 'always', 'never', 'seems', 'should', 'own', 'formerly', 'here', 'not', 'can', 'nowhere', 'could', 'really', 'sometime', 'take', 'first', 'them', 'top', 'twelve', 'whom', 'with', 'then', 'go', 'when', 'which', 'will', 'towards', 'your', 'latterly', 'under', 'anywhere', 'since', 'if', 'up', 'further', 'until', 'sometimes', 'using', 'down', 'while', 'see', 'herein', 'ca', 'eight', 'meanwhile', 'yours', 'has', 'after', 'no', 'her', 'have', 'except', 'every', 'again', 'seem', 'into', 'much', 'thence', 'in', 'very', 'became', 'forty', 'nine', 'two', 'otherwise', 'fifteen', 'by', 'the', 'too', 'sixty', 'wherein', 'at', 'each', 'get', 'during', 'whither', 'me', 'somehow', 'because', 'please', 'almost', 'even', 'noone', 'less', 'may', 'more', 'neither', 'another', 'been', 'namely', 'our', 'nothing', 'four', 'hereupon', 'seeming', 'that', 'per', 'they', 'were', 'everywhere', 'herself', '']

VBox()

Filtering for words in stop words, giving each word a count and key

In [169]:
counts_positive = split_positive.filter(lambda x: x not in stop_words) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
# counts_positive.take(5)

VBox()

In [170]:
counts_negative = split_negative.filter(lambda x: x not in stop_words) \
            .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
# counts_negative.take(5)

VBox()

Sorting the positive and negative words by their count

In [171]:
count_positive_top = counts_positive.sortBy(lambda x: x[1], ascending=False)

VBox()

In [172]:
count_negative_top = counts_negative.sortBy(lambda x: x[1], ascending = False)

VBox()

In [173]:
# count_positive_top.take(50)

VBox()

In [174]:
# count_negative_top.take(50)

VBox()

Creating dataframe to see top 50 for positive and negative

In [175]:
spark.createDataFrame(count_positive_top).show(50)


VBox()

+----------+---+
|        _1| _2|
+----------+---+
|     great|167|
|     place|123|
|      like|117|
|      good|110|
|      it's| 97|
|      food| 94|
|      time| 74|
|      i've| 71|
|   service| 70|
|      love| 66|
|      best| 62|
|    little| 60|
|definitely| 52|
|      went| 51|
|       got| 51|
| recommend| 48|
|     staff| 48|
|      come| 47|
|      nice| 47|
|       try| 46|
|       i'm| 43|
|     don't| 43|
|      came| 40|
|         -| 40|
|    pretty| 36|
|  friendly| 34|
|      want| 33|
|      menu| 33|
|    highly| 33|
|       new| 33|
|    prices| 33|
|      feel| 33|
|       it.| 31|
|   ordered| 31|
|     think| 31|
|       bit| 31|
|    people| 30|
|     night| 30|
|      sure| 30|
|         &| 30|
|   chicken| 30|
|       lot| 29|
|restaurant| 29|
|    places| 28|
|   getting| 27|
|     order| 27|
|     right| 26|
|   amazing| 26|
|    better| 26|
|   looking| 25|
+----------+---+
only showing top 50 rows

In [176]:
spark.createDataFrame(count_negative_top).show(50)

VBox()

+----------+---+
|        _1| _2|
+----------+---+
|      food| 86|
|      like| 85|
|     place| 72|
|      good| 69|
|     don't| 62|
|   service| 58|
|   ordered| 55|
|         -| 46|
|    didn't| 46|
|       got| 45|
|      it's| 45|
|      came| 40|
|     order| 38|
|     asked| 37|
|      time| 37|
|    people| 34|
|     great| 34|
|       i'm| 33|
|     think| 33|
|      know| 33|
|      said| 32|
|      nice| 32|
|    pretty| 32|
|      took| 31|
|   chicken| 30|
|    little| 30|
|      went| 29|
|restaurant| 28|
|     going| 27|
|      i've| 27|
|      want| 26|
|       way| 26|
|   minutes| 25|
|       it.| 24|
|  customer| 24|
|     sauce| 24|
|      left| 23|
|   quality| 23|
|         2| 22|
|     staff| 22|
|     right| 21|
|       hot| 21|
|   waiting| 20|
|      away| 19|
|    that's| 19|
|         &| 19|
|    drinks| 18|
|      sure| 18|
|    called| 18|
|    served| 18|
+----------+---+
only showing top 50 rows