# Spark DataFrames continued

Let's read in two of the three data files from the Yelp academic dataset (https://www.kaggle.com/yelp-dataset/yelp-dataset) and examine the schemas for each one (we're skipping the reviews.json file for this class):

In [124]:
business = spark.read.json('s3://umsi-data-science/data/yelp/business.json')

VBox()

In [125]:
business.printSchema()

VBox()

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: boolean (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: struct (nullable = true)
 |    |    |-- casual: boolean (nullable = true)
 |    |    |-- classy: boolean (nullable = true)
 |    |    |-- divey: boolean (nullable = true)
 |    |    |-- hipster: boolean (nullable = true)
 |    |    |-- intimate: boolean (nullable = true)
 |    |    |-- romantic: boolean (nullable = true)
 |    |    |-- touristy: boolean (nullable = true)
 |    |    |-- trendy: boolean (nullable = true)
 |    |    |-- upscale: boolean (nullable = true)
 |    |-- BYOB: boolean (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: struct (nullable = true)
 |    |    |-- friday: boolean (nullable = true)
 |    |    |-- monday: boolean (nullable = true)
 |    |    |-- saturday: boolean (nullab

In [126]:
# review = spark.read.json('s3://umsi-data-science/data/yelp/review.json.gz')

VBox()

In [127]:
# review.printSchema()

VBox()

In [128]:
tip = spark.read.json('s3://umsi-data-science/data/yelp/tip.json')

VBox()

In [129]:
tip.printSchema()

VBox()

root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- likes: long (nullable = true)
 |-- text: string (nullable = true)
 |-- user_id: string (nullable = true)

### Let's try to find the name of the business that has the highest number of "tips":

In [130]:
most_tips = tip.groupBy('business_id').count().sort('count',ascending=False)

VBox()

In [131]:
from pyspark.sql.functions import col
most_tips = most_tips.withColumn('the_count',col('count'))

VBox()

In [132]:
most_tips.show()

VBox()

+--------------------+-----+---------+
|         business_id|count|the_count|
+--------------------+-----+---------+
|FaHADZARwnY4yvlvp...| 3517|     3517|
|JmI9nslLD7KZqRr__...| 2382|     2382|
|DkYS3arLOhA8si5uU...| 1474|     1474|
|5LNZ67Yw9RD6nf4_U...| 1436|     1436|
|K7lWdNUhCbcnEvI0N...| 1346|     1346|
|hihud--QRriCYZw1z...| 1287|     1287|
|RESDUcs7fIiihp38-...| 1149|     1149|
|yfxDa8RFOvJPQh0rN...| 1062|     1062|
|4JNXUYY8wbaaDmk3B...| 1038|     1038|
|iCQpiavjjPzJ5_3gP...| 1033|     1033|
|SMPbvZLSMMb7KU76Y...|  996|      996|
|7sPNbCx7vGAaH7SbN...|  981|      981|
|UPIYuRaZvknINOd1w...|  959|      959|
|eoHdUeQDNgQ6WYEnP...|  940|      940|
|yQab5dxZzgBLTEHCw...|  900|      900|
|JyxHvtj-syke7m9rb...|  888|      888|
|LNGBEEelQx4zbfWnl...|  854|      854|
|WUq8HJHIZU4uteB15...|  831|      831|
|f4x1YBxkLrZg652xt...|  800|      800|
|El4FC8jcawUVgw_0E...|  759|      759|
+--------------------+-----+---------+
only showing top 20 rows

In [133]:
joined = most_tips.join(business,'business_id','left').sort('the_count',ascending=False)

VBox()

In [134]:
most_tips_joined = joined.select("name","the_count").filter(joined['the_count'] > 1000).collect()

VBox()

In [135]:
for b in most_tips_joined:
    print(b.name,b.count,b.the_count)

VBox()

(u'McCarran International Airport', <built-in method count of Row object at 0x7f534ee41af8>, 3517)
(u'Phoenix Sky Harbor International Airport', <built-in method count of Row object at 0x7f534ee41aa0>, 2382)
(u'Earl of Sandwich', <built-in method count of Row object at 0x7f534ee41a48>, 1474)
(u'The Cosmopolitan of Las Vegas', <built-in method count of Row object at 0x7f534ee41838>, 1436)
(u'Wicked Spoon', <built-in method count of Row object at 0x7f534ee41cb0>, 1346)
(u'Gangnam Asian BBQ Dining', <built-in method count of Row object at 0x7f534ee41d08>, 1287)
(u'Bacchanal Buffet', <built-in method count of Row object at 0x7f534ee41d60>, 1149)
(u'Pho Kim Long', <built-in method count of Row object at 0x7f534ee41db8>, 1062)
(u'Mon Ami Gabi', <built-in method count of Row object at 0x7f534ee41e10>, 1038)
(u'Secret Pizza', <built-in method count of Row object at 0x7f534ee41e68>, 1033)


## Your turn
Use a combination of Spark and plain old python code to answer the following questions.  Include code and written responses in English for each question.

### Q1. How many businesses in the data set are located in the state of Ohio (OH)?

In [136]:
business.filter(business.state == 'OH').count()

VBox()

12609

### Q2. How many Pennsylvania-based businesses have a hipster ambience?

In [137]:
business.filter(business.state == 'PA').filter(business.attributes.Ambience.hipster == True).count()

VBox()

71

### Q3. Which Nevada-based business has the most liked tip, and what is the text of the tip?

In [138]:
tip.join(business,'business_id','inner').sort(tip.likes,ascending=False).select('name','likes','text','state').show()

VBox()

+--------------------+-----+--------------------+-----+
|                name|likes|                text|state|
+--------------------+-----+--------------------+-----+
| A Peaceful Farewell|   15|My kitty Rocky is...|   AZ|
|1st Pet Veterinar...|   12|1st Pet was very ...|   AZ|
|Department of Mot...|   11|License photograp...|   NV|
|        Baladie Café|    9|Heads up.... The ...|   NV|
|             Bomboba|    7|Don't plan on com...|   AZ|
|  Southwest Airlines|    7|Tuesday is not a ...|   NV|
|           Burgh'ers|    7|Just average! Bet...|   PA|
| Mastro's Ocean Club|    6|Request table in ...|   NV|
|      Pineapple Park|    6|Do yourself a fav...|   NV|
|KJ Dim Sum & Seafood|    6|Ordered a plate o...|   NV|
|   Let's Eat Noodles|    6|4 health code vio...|   AZ|
|           Starbucks|    6|The warmed up "bu...|   AZ|
|    Tremont Taphouse|    6|Upper-middle clas...|   OH|
|Las Vegas Day School|    6|Time to pick up j...|   NV|
|     Costco Gasoline|    5|Did you know you ...

### Q4. Excluding businesses in the state of Nevada, list 10 businesses with the highest number of tips

In [139]:
business.filter(business.attributes.Ambience.hipster == True).filter(business.state != 'NV').count()

VBox()

878

### Q5. List the names of the divey businesses from Ohio that have an overall rating of 4 or more stars and have at least 1000 tips.
You might want to do this in several steps.

In [140]:
divey_counts = business.filter(business.attributes.Ambience.divey == True).filter(business.state == 'OH').filter(business.stars >= 4.0).join(tip,"business_id","right").groupBy('business_id').count()

VBox()

In [141]:
divey_counts = divey_counts.withColumn('the_count',col('count'))

VBox()

In [142]:
good_dives = divey_counts.filter(divey_counts.the_count >= 1000).join(business,'business_id').select('name').collect()

VBox()

In [143]:
for b in good_dives:
    print(b.name)

VBox()

Gangnam Asian BBQ Dining
Pho Kim Long
Earl of Sandwich
McCarran International Airport
Phoenix Sky Harbor International Airport
Wicked Spoon
Bacchanal Buffet
Mon Ami Gabi
The Cosmopolitan of Las Vegas
Secret Pizza