# Personal Dataset Analysis

## Yelp Dataset
https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset

In [1]:
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, LongType, DecimalType
from operator import add

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
89,application_1671409217564_0111,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
yelp_directory = "/yelp/"

business_file = yelp_directory + "yelp_academic_dataset_business.json"
checkin_file = yelp_directory + "yelp_academic_dataset_checkin.json"
review_file = yelp_directory + "yelp_academic_dataset_review.json"
tip_file = yelp_directory + "yelp_academic_dataset_tip.json"
user_file = yelp_directory + "yelp_academic_dataset_user.json"

In [3]:
business_df = spark.read.json(business_file)
checkin_df = spark.read.json(checkin_file)
tip_df = spark.read.json(tip_file)
user_df = spark.read.json(user_file)

In [4]:
business_df.printSchema()

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: str

In [5]:
checkin_df.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)

In [7]:
tip_df.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- compliment_count: long (nullable = true)
 |-- date: string (nullable = true)
 |-- text: string (nullable = true)
 |-- user_id: string (nullable = true)

In [8]:
user_df.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)

# Reservation Required
## Is it worth it to make a reservation?

When you think of restaurants that require a reservation, you might think "Damn, this place must be good since I need to make a reservation just to get in!"

But could the required reservation just be a mind game? Are restaurants that require a reservation better than restaurants that you can walk into? Are they more expensive? Is it worth it to make a reservation?

Let's find out!

### First, let's filter so we focus on the resturants. It makes sense for other businesses like hotels and doctors to be by appointment only. We are only interested in restaurants.

In [14]:
restaurant_df = business_df.filter(business_df.categories.contains("Restaurants"))
restaurant_df = restaurant_df.select(restaurant_df.name, restaurant_df.review_count, restaurant_df.stars, restaurant_df.attributes)

# The version of PySpark on Azure does not have the isNotNull() function, instead use the negation of isnull

restaurant_df = restaurant_df.filter(~isnull(restaurant_df.attributes.RestaurantsReservations))
restaurant_df = restaurant_df.filter(~isnull(restaurant_df.attributes.RestaurantsPriceRange2))
restaurant_df = restaurant_df.filter(~restaurant_df.attributes.RestaurantsReservations.isin("None"))

In [15]:
restaurant_df.show()

+--------------------+------------+-----+--------------------+
|                name|review_count|stars|          attributes|
+--------------------+------------+-----+--------------------+
|      Sonic Drive-In|           6|  2.0|{null, null, u'no...|
|Tsevi's Pub And G...|          19|  3.0|{null, null, u'fu...|
|      Sonic Drive-In|          10|  1.5|{null, null, u'no...|
|             Denny's|          28|  2.5|{null, null, 'non...|
|Zio's Italian Market|         100|  4.5|{null, null, u'no...|
|            Tuna Bar|         245|  4.0|{null, null, 'ful...|
|                 BAP|         205|  4.5|{null, null, u'no...|
|Roast Coffeehouse...|          40|  4.0|{null, null, u'be...|
|Romano's Macaroni...|         339|  2.5|{null, null, 'ful...|
|           Super Dog|           6|  4.0|{null, null, u'be...|
|             Bar One|          65|  4.0|{null, null, u'fu...|
|    DeSandro on Main|          41|  3.0|{null, null, u'no...|
|       Ardmore Pizza|         109|  3.5|{null, null, u

### Notice that some restaurants have a very small review count which might skew the data. Let's filter out restaurants with less than 10 reviews.

In [6]:
restaurant_df = restaurant_df.filter(restaurant_df.review_count >= 10)

## Do the best restaurants on Yelp tend to require reservations?

While it's dificult to define "best" we can look at the stars and the number of reviews. A restaurant with a small number of reviews can have a very high star score but not be the best. Let's filter for restauratns which have over a 1000 reviews which means they are very well known. Then lets rank them by star rating.

In [26]:
restaurant_df_star_sort = restaurant_df.select(restaurant_df.name, restaurant_df.review_count, restaurant_df.stars, restaurant_df.attributes.RestaurantsReservations, restaurant_df.attributes.RestaurantsPriceRange2)
restaurant_df_star_sort = restaurant_df_star_sort.filter(restaurant_df_star_sort.review_count >= 1000)
restaurant_df_star_sort = restaurant_df_star_sort.sort(desc(restaurant_df.stars))


restaurant_df_star_sort.show()

+--------------------+------------+-----+----------------------------------+---------------------------------+
|                name|review_count|stars|attributes.RestaurantsReservations|attributes.RestaurantsPriceRange2|
+--------------------+------------+-----+----------------------------------+---------------------------------+
| Anthonino's Taverna|        1007|  4.5|                              True|                                2|
|         Noble Crust|        1259|  4.5|                              True|                                2|
|          Two Chicks|        1489|  4.5|                              True|                                2|
|         Bakersfield|        1215|  4.5|                             False|                                2|
|  Commander's Palace|        4876|  4.5|                              True|                                3|
| Bogart's Smokehouse|        1421|  4.5|                             False|                                2|
|

In these top 20 restaurants, we can see that half require reservations and half do ont. The ones that do require reservations also tend to be more expensive. It seems the highest star rating that a restaurant on Yelp with more than 1000 reviews has is 4.5. Let's focus in on this set of restaurants which has 1000+ reviews and 4.5 stars and see what percentage of them require reservations.

In [44]:
total_true_count = restaurant_df.filter(col("attributes.RestaurantsReservations") == "True").count()
total_false_count = restaurant_df.filter(col("attributes.RestaurantsReservations") == "False").count()

best_df = restaurant_df.filter(restaurant_df.review_count >= 1000)
best_df = best_df.filter(best_df.stars == 4.5)

best_true_count = best_df.filter(col("attributes.RestaurantsReservations") == "True").count()
best_false_count = best_df.filter(col("attributes.RestaurantsReservations") == "False").count()

print("Total true: " + str(total_true_count))
print("Total false: " + str(total_false_count))
print("Total proportion required: " + str(total_true_count / (total_true_count + total_false_count)))
print("\n")
print("Best true: " + str(best_true_count))
print("Best false: " + str(best_false_count))
print("Best proportion required: " + str(best_true_count / (best_true_count + best_false_count)))

Total true: 13343
Total false: 26660
Total proportion required: 0.33354998375121864


Best true: 48
Best false: 56
Best proportion required: 0.46153846153846156

## Answer: Yes, the best restaurants are more likely  to require reservations.

According to our analysis, while almost exactly a third (33%) of restaurants on Yelp require reservations, 46% of the "best" restaurants (as defined above) require a reservation. This is evidence for the argument that "Yes, making a reservation is worth it."

## Do the most popular restaurants on Yelp tend to require reservations?

In the previous analysis, we looked at what we defined as the "best" restaurants on Yelp which was a combonation of star rating and popularity. What if we just looked a popularity (number of reviews) instead?

In [17]:
restaurant_df_count_sort = restaurant_df.select(restaurant_df.name, restaurant_df.review_count, restaurant_df.stars, restaurant_df.attributes.RestaurantsReservations, restaurant_df.attributes.RestaurantsPriceRange2)
restaurant_df_count_sort = restaurant_df_count_sort.sort(desc(restaurant_df.review_count))

restaurant_df_count_sort.show()

+--------------------+------------+-----+----------------------------------+---------------------------------+
|                name|review_count|stars|attributes.RestaurantsReservations|attributes.RestaurantsPriceRange2|
+--------------------+------------+-----+----------------------------------+---------------------------------+
|   Acme Oyster House|        7568|  4.0|                             False|                                2|
|        Oceana Grill|        7400|  4.0|                              True|                                2|
|Hattie B’s Hot Ch...|        6093|  4.5|                             False|                                2|
|Reading Terminal ...|        5721|  4.5|                             False|                                2|
|Ruby Slipper - Ne...|        5193|  4.5|                             False|                                2|
| Mother's Restaurant|        5185|  3.5|                             False|                                2|
|

### Answer: Looking at the 20 most reviewed restaurants on Yelp, we see that most of them do not require reservations. The most popular restaurant does not require reservations, but the second most popular restaurant does. Only 5 out of the 20 most popular restaurants on Yelp require reservations.

## Are restaurants that require reservations better than restaurants that do not?

So far, we have looked at the best and most popular restaurants on Yelp. But realistically, unless you live in the city where these restaurants are, you won't go out of your way to check them out. Now lets do an analysis on every restaruant in the data set to see if on average, restaurants which require reservations have a higher star rating than those that you can just walk into.

In [7]:
restaurant_df_grouped = restaurant_df.groupBy(restaurant_df.attributes.RestaurantsReservations)

star_average = restaurant_df_grouped.avg("stars")

star_average.show()

+-----------------------------------+------------------+
|attributes[RestaurantsReservations]|        avg(stars)|
+-----------------------------------+------------------+
|                              False| 3.407629579028264|
|                               True|3.6450354039266175|
+-----------------------------------+------------------+

## Answer: Yes, restaurants that require reservations have an average star rating of 3.65 while restaurants that do not have an average star rating of 3.41

## Are restaurants that require reservations more expensive than restaurants that do not?

Sure, restaurants that require reservations are slightly better on average. That does not necessairly mean that they are worth making a reservation to if they are much more expensive. Let's compare restaurants that do and don't require reservations on price.

In [9]:
restaurant_price_df = restaurant_df.withColumn("price", col("attributes.RestaurantsPriceRange2").cast("integer"))
restaurant_df_grouped_res = restaurant_price_df.groupBy(restaurant_df.attributes.RestaurantsReservations)
price_average = restaurant_df_grouped_res.avg("price")
price_average.show()

+-----------------------------------+------------------+
|attributes[RestaurantsReservations]|        avg(price)|
+-----------------------------------+------------------+
|                              False|1.4293522353644477|
|                               True| 2.021891348088531|
+-----------------------------------+------------------+

## Answer: Yes, restaurants that require reservations have an average price rating of 2.02 while restaurants that do not have an average price rating of 1.43

#### Lets take the average review count of each price range

I had originally taken the average of review_count instead of stars by mistake, but the result is interesting. It shows that the lower the price_range, the less people book reservetation/give reviews to resturants which require you to book in advance

In [57]:
averageByPriceRange = sortedByPrice.groupBy(sortedByPrice.price_range)\
                            .agg(avg(sortedByPrice.review_count))

averageByPriceRange.show()

+-----------+------------------+
|price_range| avg(review_count)|
+-----------+------------------+
|          3| 108.9635761589404|
|          1|             52.88|
|          4|109.41379310344827|
|          2| 60.86011191047162|
+-----------+------------------+

#### Lets take the average star rating of each price range

Surprisingly, with the cheaper resturants customers tend to be more satisfied, at least for resturants where they have to reserve a table

It makes sense, when people book a table at an expensive resturant, they have much higher expectation. Not only the food should look nice, but the resturant should look a certain way and things like that. 
Plus since they are paying more, they might feel more entitled

In [58]:
averageStarsByPriceRange = sortedByPrice.groupBy(sortedByPrice.price_range)\
                            .agg(avg(sortedByPrice.stars))

averageStarsByPriceRange.show()

+-----------+------------------+
|price_range|        avg(stars)|
+-----------+------------------+
|          4| 3.896551724137931|
|          3|3.8609271523178808|
|          2| 4.167466027178257|
|          1|             4.136|
+-----------+------------------+

# Bitcoin Accepted

Let's look at businesses which accept Bitcoin as payment. Let's find out the most common type of businesses which accept Bitcoins as a payment method, and do some further analysis on them. Things like, how well are business in those categories rated, how many people have given a review for them, and so on.

In [97]:
bitcoinAcceptedBusinesses = business_df.select(business_df.name, business_df.categories, business_df.review_count, business_df.stars, business_df.state)\
                                .filter(business_df.attributes.BusinessAcceptsBitcoin == "True")\
                                .cache()
bitcoinAcceptedBusinesses.show()

+--------------------+--------------------+------------+-----+-----+
|                name|          categories|review_count|stars|state|
+--------------------+--------------------+------------+-----+-----+
|Phone Repair Phil...|Shopping, Electro...|         106|  4.5|   PA|
|One Hour Heating ...|Heating & Air Con...|          20|  2.5|   FL|
|Taj Mahal Homesty...|Vegan, Pakistani,...|         280|  4.0|   ID|
|         Molly Mover|Movers, Local Ser...|          15|  2.0|   ID|
|   Basimo Beach Cafe|Breakfast & Brunc...|         339|  4.0|   FL|
|          Gizmo Pros|Shopping, Mobile ...|          13|  4.5|   FL|
|Mechanic/Bicycle ...|Bikes, Shopping, ...|           7|  5.0|   PA|
|      Concord Steaks|Pizza, Barbeque, ...|          33|  3.5|   PA|
|     Wizard Concrete|Masonry/Concrete,...|           5|  2.0|   FL|
|Bill Mullen Electric|Generator Install...|           9|  4.5|   PA|
|   Apple Repair Shop|Mobile Phone Acce...|         108|  3.0|   LA|
|    Satsuma Realtors|Real Estate 

## Which states have the most business that accept bitcoin?



In [77]:
mostCryptoFriendlyStates = bitcoinAcceptedBusinesses.groupBy(bitcoinAcceptedBusinesses.state)\
                                .count()\
                                .alias("count")\
                                .sort(desc(col("count")))
mostCryptoFriendlyStates.show()

+-----+-----+
|state|count|
+-----+-----+
|   PA|  122|
|   FL|  118|
|   CA|   38|
|   LA|   34|
|   TN|   30|
|   MO|   29|
|   IN|   25|
|   NV|   20|
|   ID|   17|
|   NJ|   17|
|   AZ|   16|
|   DE|    4|
+-----+-----+

## Answer: Pennsylvania and Florida

This answer is surprising because we expected California to have the most businesses that accept Bitcoin. Not only is California the most populous state, but we are also the tech capital of the world. It is possible that California has more businesses that accept Bitcoin but that it is not reflected on Yelp.

#### Let's see the average rating of businesses in Pennsylvania and Florida

PA does not look promising, but we do not know about FL or other states

In [78]:
avgRatingPa =  business_df.select(business_df.name, business_df.review_count, business_df.stars)\
                                .filter(business_df.state == "PA")\
                                .agg(avg(business_df.stars))

avgRatingPa.show()

+------------------+
|        avg(stars)|
+------------------+
|3.5730191838773173|
+------------------+

#### Surprisingly businesses in FL are rated slightly higher than businesses in PA

In [79]:
avgRatingFl =  business_df.select(business_df.name, business_df.review_count, business_df.stars)\
                                .filter(business_df.state == "FL")\
                                .agg(avg(business_df.stars))

avgRatingFl.show()

+------------------+
|        avg(stars)|
+------------------+
|3.6109570831750855|
+------------------+

#### What about the rest?

Again expected the average rating to be slightly higher. Should not have judged FL and PA too soon!

In [80]:
avgRatingRest =  business_df.select(business_df.name, business_df.review_count, business_df.stars)\
                                .filter(business_df.state != "FL")\
                                .filter(business_df.state != "PA")\
                                .agg(avg(business_df.stars))

avgRatingRest.show()

+------------------+
|        avg(stars)|
+------------------+
|3.6015259455194104|
+------------------+

### Coming back to all of the businesses accepting Bitcoin

Let's first get the average number of reviews for these place

Doesn't seem very high, an average of 52.583 reviews per business

In [83]:
avgNumberOfReviewsForBusinessesAcceptingBitcoin = bitcoinAcceptedBusinesses\
                                .agg(avg(bitcoinAcceptedBusinesses.review_count))\
                                .cache()

avgNumberOfReviewsForBusinessesAcceptingBitcoin.show()

+-----------------+
|avg(review_count)|
+-----------------+
|52.58297872340425|
+-----------------+

#### Let's see the average rating

The average rating is surprisingly high

In [84]:
avgRatingForBusinessesAcceptingBitcoin = bitcoinAcceptedBusinesses\
                                .agg(avg(bitcoinAcceptedBusinesses.stars))\
                                .cache()

avgRatingForBusinessesAcceptingBitcoin.show()

+-----------------+
|       avg(stars)|
+-----------------+
|4.140425531914894|
+-----------------+

#### What are the states with the highest ratings? Which states are keeping people's faith in crypto as a payment method?

California tops this list with Tennessee in second.

In [93]:
averageRatingPerState = bitcoinAcceptedBusinesses.groupBy(bitcoinAcceptedBusinesses.state)\
                                .agg(avg(bitcoinAcceptedBusinesses.stars).alias("avg"))\
                                .sort(desc(col("avg")))
        
averageRatingPerState.show()

+-----+------------------+
|state|               avg|
+-----+------------------+
|   CA| 4.605263157894737|
|   TN| 4.416666666666667|
|   NJ| 4.294117647058823|
|   LA| 4.279411764705882|
|   ID| 4.264705882352941|
|   PA| 4.200819672131147|
|   AZ|             4.125|
|   IN|               4.1|
|   NV|             4.025|
|   FL|3.9322033898305087|
|   DE|              3.75|
|   MO| 3.689655172413793|
+-----+------------------+

### What are the most common type of business which accepts Bitcoin?

In [116]:
bitcoinAcceptedBusinessesCategoryMap = bitcoinAcceptedBusinesses\
                                            .select(split(col("categories"), ",")\
                                            .alias("category_array"), col("stars"))\
                                            .drop(bitcoinAcceptedBusinesses.categories)\
                                            .cache()
                                                                            
bitcoinAcceptedBusinessesCategoryMap.show()

+--------------------+-----+
|      category_array|stars|
+--------------------+-----+
|[Shopping,  Elect...|  4.5|
|[Heating & Air Co...|  2.5|
|[Vegan,  Pakistan...|  4.0|
|[Movers,  Local S...|  2.0|
|[Breakfast & Brun...|  4.0|
|[Shopping,  Mobil...|  4.5|
|[Bikes,  Shopping...|  5.0|
|[Pizza,  Barbeque...|  3.5|
|[Masonry/Concrete...|  2.0|
|[Generator Instal...|  4.5|
|[Mobile Phone Acc...|  3.0|
|[Real Estate Serv...|  5.0|
|[Specialty Food, ...|  4.5|
|[Home Services,  ...|  5.0|
|[Food,  Street Ve...|  4.0|
|[Health & Medical...|  4.5|
|[IT Services & Co...|  2.5|
|[Beauty & Spas,  ...|  3.5|
|[Home Services,  ...|  4.5|
|[Automotive,  Sel...|  1.5|
+--------------------+-----+
only showing top 20 rows

#### We now want to explode this dataframe

In [117]:
explodedDf = bitcoinAcceptedBusinessesCategoryMap\
                    .withColumn("categories", explode("category_array"))\
                    .drop(bitcoinAcceptedBusinessesCategoryMap.category_array)\
                    .cache()
explodedDf.show()

+-----+--------------------+
|stars|          categories|
+-----+--------------------+
|  4.5|            Shopping|
|  4.5|  Electronics Repair|
|  4.5|      Local Services|
|  4.5| IT Services & Co...|
|  4.5| Mobile Phone Repair|
|  4.5| Mobile Phone Acc...|
|  4.5|       Mobile Phones|
|  2.5|Heating & Air Con...|
|  2.5|       Home Services|
|  4.0|               Vegan|
|  4.0|           Pakistani|
|  4.0|              Indian|
|  4.0|          Vegetarian|
|  4.0|         Restaurants|
|  2.0|              Movers|
|  2.0|      Local Services|
|  2.0|       Home Services|
|  2.0|    Packing Services|
|  2.0|        Self Storage|
|  4.0|  Breakfast & Brunch|
+-----+--------------------+
only showing top 20 rows

In [125]:
groupedByCategories = explodedDf.groupBy(col("categories"))\
                                .count()\
                                .alias("count")\
                                .cache()
groupedByCategories.show()

+--------------------+-----+
|          categories|count|
+--------------------+-----+
|    Furniture Stores|    3|
|              Korean|    2|
| Boudoir Photography|    1|
|       Data Recovery|    3|
|            Day Spas|    3|
| Children's Clothing|    1|
|             Noodles|    2|
|               Reiki|    1|
|         Hobby Shops|    1|
|            Handyman|    2|
|  Damage Restoration|    2|
|       Beauty & Spas|   19|
|         Hobby Shops|    2|
|            Japanese|    3|
|  Convenience Stores|    1|
|               Salad|    3|
|        IV Hydration|    1|
|    Shipping Centers|    4|
|          Bookstores|    1|
| Cannabis Dispens...|    1|
+--------------------+-----+
only showing top 20 rows

In [126]:
mostFrequentCategories = groupedByCategories\
                            .sort(desc(col("count")))\
                            .cache()

mostFrequentCategories.show()

+--------------------+-----+
|          categories|count|
+--------------------+-----+
|       Home Services|  117|
|      Local Services|  101|
|            Shopping|  100|
|         Restaurants|   94|
|                Food|   59|
| Professional Ser...|   40|
|       Beauty & Spas|   38|
| IT Services & Co...|   37|
|           Nightlife|   36|
|       Home Services|   35|
| Event Planning &...|   35|
|         Contractors|   32|
|      Local Services|   32|
|                Bars|   29|
|  Electronics Repair|   29|
|         Restaurants|   29|
| Mobile Phone Repair|   28|
|       Home Cleaning|   22|
|        Coffee & Tea|   22|
|          Sandwiches|   20|
+--------------------+-----+
only showing top 20 rows

## Average rating per category of business which accepts crypto

Removing categories which have a count less than 20

20 is chosen arbitarily

Can easily check for worst cases as well, categories where rating is pretty bad

In [142]:
groupedByCategoriesAndRating = explodedDf.groupBy(col("categories"), col("stars"))\
                                .count()\
                                .alias("count")\
                                .sort(desc(col("count")))\
                                .filter(col("count") >= 20)\
                                .cache()
groupedByCategoriesAndRating.show()

+---------------+-----+-----+
|     categories|stars|count|
+---------------+-----+-----+
|  Home Services|  5.0|   38|
| Local Services|  5.0|   35|
|    Restaurants|  4.0|   34|
|       Shopping|  4.5|   31|
|  Home Services|  4.0|   29|
|  Home Services|  4.5|   27|
|    Restaurants|  4.5|   26|
| Local Services|  4.5|   25|
|       Shopping|  5.0|   25|
|       Shopping|  4.0|   23|
| Local Services|  4.0|   23|
|           Food|  4.5|   20|
+---------------+-----+-----+

In [143]:
# good businesses in crypto

sortedByRating = groupedByCategoriesAndRating\
                    .groupBy(col("categories"))\
                    .agg(avg(col("stars")).alias("avg_rating"))\
                    .sort(desc(col("avg_rating")))\
                    .cache()
                
sortedByRating.show()

+---------------+----------+
|     categories|avg_rating|
+---------------+----------+
|       Shopping|       4.5|
| Local Services|       4.5|
|  Home Services|       4.5|
|           Food|       4.5|
|    Restaurants|      4.25|
+---------------+----------+

## Hidden Gems
### Find the highest rated businesses with the fewest number of check-ins.

In [None]:
# Filter out businesses with less than 10 reviews because the rating is not as trustworthy

filtered_business_df = business_df.filter(business_df.review_count >= 10)

# Left join because we don't care about the checkin time

hidden_gems_df = filtered_business_df.join(checkin_df, on='business_id', how='left')

hidden_gems_df = hidden_gems_df.groupBy('business_id').agg(
    first('name').alias('name'),
    first('city').alias('city'),
    first('stars').alias('stars'),
    count('*').alias('checkin_count')
)

filtered_hidden_gems_df = hidden_gems_df.filter(hidden_gems_df['checkin_count'] < 100)

filtered_hidden_gems_df = filtered_hidden_gems_df.sort(hidden_gems_df['stars'].desc())

filtered_hidden_gems_df.show()

## San Francisco Hidden Gems?

### That's an interesting list of hidden gems, but not very useful we further filter by city. Let's see if we can find some hidden gems here...

In [15]:
sf_hidden_gems_df = filtered_hidden_gems_df.filter(hidden_gems_df.city == "San Francisco")

In [16]:
sf_hidden_gems_df.show()

+-----------+----+----+-----+-------------+
|business_id|name|city|stars|checkin_count|
+-----------+----+----+-----+-------------+
+-----------+----+----+-----+-------------+

### Sadly, it looks like we can't find the hidden gems here. Our theory is that Yelp is more popular in San Francisco than it is in other places so there are more people checking in to businesses on Yelp here. Let's try to find some hidden gems in Santa Barbara based on our definition of hidden gem (less then 10 check-ins) because we see Santa Barbara in the table above

In [45]:
sb_hidden_gems_df = filtered_hidden_gems_df.filter(hidden_gems_df.city == "Santa Barbara")

In [46]:
sb_hidden_gems_df.show()

+--------------------+--------------------+-------------+-----+-------------+
|         business_id|                name|         city|stars|checkin_count|
+--------------------+--------------------+-------------+-----+-------------+
|C5C2oLWx_plBLHaTW...|The Poppy Pod Flo...|Santa Barbara|  5.0|            1|
|-6L_z3ftD1iepJb0F...|Channel Islands O...|Santa Barbara|  5.0|            1|
|rpekzyLlfjr6eyxty...|         Aveda Store|Santa Barbara|  5.0|            1|
|vfaVKfR9bOZpVniOd...|  Kathryn Waters, OD|Santa Barbara|  5.0|            1|
|e9MlroCGKfrYOoWjK...|Michelle K Montec...|Santa Barbara|  5.0|            1|
|pS8SpbPGNV5-asskD...|Pedego Electric B...|Santa Barbara|  5.0|            1|
|jNWePdDkWzwt4oQ5U...| Faitell Attractions|Santa Barbara|  5.0|            1|
|snknZPBoWlp_bcOhS...|Pretty Please Beauty|Santa Barbara|  5.0|            1|
|x3bLTM7G6bX0rJlzm...|  Santa Barbara Dojo|Santa Barbara|  5.0|            1|
|S3lmzX0wJcTYf9q1e...|Coast 2 Coast Col...|Santa Barbara|  5.0| 

### Nice! Now I know to go to The Plumbing Factory if I ever need a plumber in Santa Barbara! But I still want to find some San Francisco hidden gems. Let's relax the definition of a hidden gem (it's not so easy to hide in a 7mi x 7mi city)

In [54]:
sf_filtered_hidden_gems_df = hidden_gems_df.filter(hidden_gems_df['checkin_count'] < 500)

sf_filtered_hidden_gems_df = sf_filtered_hidden_gems_df.sort(sf_filtered_hidden_gems_df['stars'].desc())

sf_filtered_hidden_gems_df = sf_filtered_hidden_gems_df.filter(sf_filtered_hidden_gems_df.city == "San Francisco")

sf_filtered_hidden_gems_df.show()

+-----------+----+----+-----+-------------+
|business_id|name|city|stars|checkin_count|
+-----------+----+----+-----+-------------+
+-----------+----+----+-----+-------------+

In [57]:
sf_df = business_df.filter(business_df.city == "San Francisco")
sf_df.show()

+-------+----------+-----------+----------+----+-----+-------+--------+---------+----+-----------+------------+-----+-----+
|address|attributes|business_id|categories|city|hours|is_open|latitude|longitude|name|postal_code|review_count|stars|state|
+-------+----------+-----------+----------+----+-----+-------+--------+---------+----+-----------+------------+-----+-----+
+-------+----------+-----------+----------+----+-----+-------+--------+---------+----+-----------+------------+-----+-----+

After further analysis, we find that this dataset does not include businesses in San Francisco. The Yelp dataset says that this is a subset of the entire Yelp data and it seems San Francisco businsses were not included in the dataset. Maybe Yelp does not want to give away its data in this form for free? A bit disappointing, but fair enough.