<a href="https://colab.research.google.com/github/root-git/stratascratch-sql-challenges/blob/main/1_Top_5_States_with_the_Most_5_Star_Businesses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Top 5 States with the Most 5-Star Businesses

**Find the top 5 states with the most 5 star businesses.**  

Output the state name along with the number of 5-star businesses and order records by the number of 5-star businesses in descending order.  
In case there are ties in the number of businesses, return all the unique states.  
If two states have the same result, sort them in alphabetical order.

**Original Question Link:**  
[StrataScratch ID10046 – Top 5 States with 5-Star Businesses](https://platform.stratascratch.com/coding/10046-top-5-states-with-5-star-businesses?code_type=1)



----

## Table Schema

**Table Name**: `yelp_business`

| Column Name    | Data Type         |
|----------------|-------------------|
| `address`      | text              |
| `business_id`  | text              |
| `categories`   | text              |
| `city`         | text              |
| `is_open`      | bigint            |
| `latitude`     | double precision  |
| `longitude`    | double precision  |
| `name`         | text              |
| `neighborhood` | text              |
| `postal_code`  | text              |
| `review_count` | bigint            |
| `stars`        | double precision  |
| `state`        | text              |

----

## Thought Process

- Only include businesses with exactly **5-star** ratings.
- If multiple states tie for the 5th place, include **all** tied states.
- Sort ties by state name (alphabetically).

----

In [1]:
import pandas as pd

# Create mock data for yelp_business
data = [
    # 5-star businesses
    {'business_id': 'b1', 'state': 'CA', 'stars': 5.0},
    {'business_id': 'b2', 'state': 'CA', 'stars': 5.0},
    {'business_id': 'b3', 'state': 'NY', 'stars': 5.0},
    {'business_id': 'b4', 'state': 'NY', 'stars': 5.0},
    {'business_id': 'b5', 'state': 'TX', 'stars': 5.0},
    {'business_id': 'b6', 'state': 'FL', 'stars': 5.0},
    {'business_id': 'b7', 'state': 'FL', 'stars': 5.0},
    {'business_id': 'b8', 'state': 'WA', 'stars': 5.0},
    {'business_id': 'b9', 'state': 'NV', 'stars': 5.0},  # Tied with WA for 5th place
    {'business_id': 'b10', 'state': 'NV', 'stars': 5.0}, # Tied again

    # Same count as NV and WA to test tie-breaker by alpha (GA)
    {'business_id': 'b11', 'state': 'GA', 'stars': 5.0},
    {'business_id': 'b12', 'state': 'GA', 'stars': 5.0},

    # Lower-star businesses to verify exclusion
    {'business_id': 'b13', 'state': 'CA', 'stars': 4.0},
    {'business_id': 'b14', 'state': 'TX', 'stars': 3.5},
    {'business_id': 'b15', 'state': 'NY', 'stars': 2.0},
]

# Convert to DataFrame
df = pd.DataFrame(data)

In [2]:
# Show the mock data
print("Mock Yelp Business Data:")
print(df)

Mock Yelp Business Data:
   business_id state  stars
0           b1    CA    5.0
1           b2    CA    5.0
2           b3    NY    5.0
3           b4    NY    5.0
4           b5    TX    5.0
5           b6    FL    5.0
6           b7    FL    5.0
7           b8    WA    5.0
8           b9    NV    5.0
9          b10    NV    5.0
10         b11    GA    5.0
11         b12    GA    5.0
12         b13    CA    4.0
13         b14    TX    3.5
14         b15    NY    2.0


In [3]:
import sqlite3

# Create an in-memory SQLite database (data will not persist after session ends)
conn = sqlite3.connect(":memory:")

# Load the DataFrame into a table named 'yelp_business' within the SQLite database
df.to_sql("yelp_business", conn, index=False)

query = """ Select * from yelp_business""" #Preview table

pd.read_sql(query, conn)

Unnamed: 0,business_id,state,stars
0,b1,CA,5.0
1,b2,CA,5.0
2,b3,NY,5.0
3,b4,NY,5.0
4,b5,TX,5.0
5,b6,FL,5.0
6,b7,FL,5.0
7,b8,WA,5.0
8,b9,NV,5.0
9,b10,NV,5.0


In [25]:
# Replace with your SQL query below
query = """ SELECT * FROM yelp_business"""

result_df = pd.read_sql(query, conn)

In [12]:
query = """
WITH business_rank AS
(
  SELECT
    state,
    COUNT(DISTINCT business_id) as n_businesses,
    DENSE_RANK() OVER(ORDER BY COUNT(DISTINCT business_id) DESC) as business_rank
  FROM yelp_business
  WHERE stars=5
  GROUP BY state
 )

 SELECT
  state,
  n_businesses
 FROM business_rank
 WHERE business_rank <= 5
 ORDER BY n_businesses DESC,state """

solution = pd.read_sql(query, conn)

In [26]:
# Compare the two results
are_equal = result_df.equals(solution)

# Print result based on the comparison
if are_equal:
    print("Correct!")
else:
    print("Try again!")

Try again!


### Problem Explanation

###1. Filter businesses that have a `5-star` rating.
   ```sql
   SELECT
     state
   FROM yelp_business
   WHERE stars = 5  
   ```

###2. Count how many 5-star businesses each state has.
   ```sql
   SELECT
     state,
     COUNT(DISTINCT business_id) AS n_businesses
   FROM yelp_business
   GROUP BY state
   ```


###3. Rank by the number of 5-star businesses in **descending** order.
DENSE_RANK() assigns the same rank to states with the same number of 5-star businesses without gaps in ranking numbers. This ensures that if multiple states tie in the count, they share the same rank, allowing us to include all tied states within the top ranks fairly.

   ```sql
    SELECT
      state,
      DENSE_RANK() OVER(ORDER BY COUNT(DISTINCT business_id) DESC) as business_rank
    FROM yelp_business
  ```

###4. Combine Steps 1 to 3 into a CTE

  ```sql
    WITH business_rank AS
    (
      SELECT
        state,
        COUNT(DISTINCT business_id) as n_businesses,
        DENSE_RANK() OVER(ORDER BY COUNT(DISTINCT business_id) DESC) as business_rank
      FROM yelp_business
      WHERE stars=5
      GROUP BY state
    )
  ```

###5. Return all states ranked within the top 5, ordered by the number of 5-star businesses (descending) and state name (ascending) to break ties.

   ```sql
    SELECT
      state,
      n_businesses
    FROM business_rank
    WHERE business_rank <= 5
    ORDER BY n_businesses DESC,state
  ```
###6. Combine step 4 - 5 together.
```sql
WITH business_rank AS
    (
      SELECT
        state,
        COUNT(DISTINCT business_id) as n_businesses,
        DENSE_RANK() OVER(ORDER BY COUNT(DISTINCT business_id) DESC) as business_rank
      FROM yelp_business
      WHERE stars=5
      GROUP BY state
    )
    SELECT
      state,
      n_businesses
    FROM business_rank
    WHERE business_rank <= 5
    ORDER BY n_businesses DESC,state
  ```