## Tune Filters

This notebook analyzes frequencies and statistics of entire dataset to accurately tune filters. Filters will then be implemented in script (``filter_city_data.py``) to perform sentiment analysis on data.

In [4]:
# Import necessary libraries
import json
import numpy as np
from collections import Counter

Iterate through full business dataset from (``yelp_academic_dataset_business.json``) and gather frequencies for all categories, cities, and states. Collect stats on review counts as well.

In [5]:
# Create counters for categories, cities, states, and a list for review counts
category_counter = Counter()
city_counter = Counter()
state_counter = Counter()
review_counts = []

# Open the Yelp business dataset and read each line
with open("./Yelp JSON/yelp_dataset/yelp_academic_dataset_business.json", "r") as businesses:
    for line in businesses:
        business = json.loads(line)

        # Skip businesses that are not restaurants
        if business["categories"]:
            categories = business["categories"].split(", ")
            for category in categories:
                category_counter[category] += 1
        
        # Count the occurrences of cities and states, and collect review counts
        city_counter[business["city"]] += 1
        state_counter[business["state"]] += 1
        review_counts.append(business["review_count"])

In [3]:
# Print the top 20 most common categories
print("CATEGORIES")
for category, count in category_counter.most_common(20):
    print(f"{category}: {count}")

CATEGORIES
Restaurants: 52268
Food: 27781
Shopping: 24395
Home Services: 14356
Beauty & Spas: 14292
Nightlife: 12281
Health & Medical: 11890
Local Services: 11198
Bars: 11065
Automotive: 10773
Event Planning & Services: 9895
Sandwiches: 8366
American (Traditional): 8139
Active Life: 7687
Pizza: 7093
Coffee & Tea: 6703
Fast Food: 6472
Breakfast & Brunch: 6239
American (New): 6097
Hotels & Travel: 5857


In [None]:
# Print the top 20 most common cities
print("CITIES")
for city, count in city_counter.most_common(50):
    print(f"{city}: {count}")

CITIES
Philadelphia: 14569
Tucson: 9250
Tampa: 9050
Indianapolis: 7540
Nashville: 6971
New Orleans: 6209
Reno: 5935
Edmonton: 5054
Saint Louis: 4827
Santa Barbara: 3829
Boise: 2937
Clearwater: 2221
Saint Petersburg: 1663
Metairie: 1643
Sparks: 1624
Wilmington: 1446
Franklin: 1414
St. Louis: 1255
St. Petersburg: 1185
Meridian: 1043
Brandon: 1033
Largo: 1002
Carmel: 967
Cherry Hill: 960
West Chester: 838
Goleta: 798
Brentwood: 767
Palm Harbor: 665
Greenwood: 649
New Port Richey: 604
Lutz: 591
Riverview: 588
Kenner: 584
Fishers: 570
Wesley Chapel: 560
King of Prussia: 560
Doylestown: 539
Pinellas Park: 512
Dunedin: 490
Hendersonville: 484
Bensalem: 454
Norristown: 448
Exton: 419
Marlton: 415
Spring Hill: 402
Tarpon Springs: 398
St Petersburg: 387
Springfield: 384
Lansdale: 378
Ardmore: 376


In [None]:
# Print the top 20 most common states
print("STATES")
for state, count in state_counter.most_common(20):
    print(f"{state}: {count}")

STATES
PA: 34039
FL: 26330
TN: 12056
IN: 11247
MO: 10913
LA: 9924
AZ: 9912
NJ: 8536
NV: 7715
AB: 5573
CA: 5203
ID: 4467
DE: 2265
IL: 2145
TX: 4
CO: 3
WA: 2
HI: 2
MA: 2
NC: 1


In [None]:
# Calculate statistics for review counts
review_counts = np.array(review_counts)
review_stats = {
    "mean": np.mean(review_counts),
    "median": np.median(review_counts),
    "std_dev": np.std(review_counts),
    "min": np.min(review_counts),
    "max": np.max(review_counts)
}

print("REVIEW COUNTS")
for stat, value in review_stats.items():
    print(f"{stat}: {value:.2f}")

REVIEW COUNTS
mean: 44.87
median: 15.00
std_dev: 121.12
min: 5.00
max: 7568.00


Now, iterate through full business dataset and find optimal number of businesses.

In [None]:
# Tune filters to find optimal number of total businesses
selected_business_ids = []

with open("./Yelp JSON/yelp_dataset/yelp_academic_dataset_business.json", "r") as businesses:
    for line in businesses:
        business = json.loads(line)

        if business["categories"]:
            categories = business["categories"].split(", ")

            if "Restaurants" in categories and business["state"] == "CA" and 50 < business["review_count"] < 1000:
                selected_business_ids.append(business["business_id"])

print(f"Restaurants in CA with more than 50 but less than 1,000 reviews: {len(selected_business_ids)}")

Restaurants in CA with more than 50 but less than 1,000 reviews: 684


After tuning filters, there are 684 restaurants in CA with more than 50 reviews (but less than 1,000) which seems like a reasonable dataset size. Then, iterate through full review dataset from (``yelp_academic_dataset_review.json``) and find optimal number of total reviews.

In [None]:
# Tune filters to find optimal number of total reviews
review_count = 0

with open("./Yelp JSON/yelp_dataset/yelp_academic_dataset_review.json", "r") as reviews:
    for line in reviews:
        review = json.loads(line)

        if review["business_id"] in selected_business_ids:
            review_count += 1

print(f"Total reviews for selected businesses: {review_count}")

Total reviews for selected businesses: 157916
