# Yelp Dataset Data Model

This notebook explores the structure and relationships of the five Yelp dataset tables:
- **Business**: Information about businesses (restaurants, shops, etc.)
- **Reviews**: User reviews of businesses
- **Users**: User profile information
- **Tips**: Short tips from users about businesses
- **Checkins**: Check-in timestamps for businesses

## Setup and Data Loading

In [1]:
# Import required libraries
import json
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import matplotlib.pyplot as plt
import seaborn as sns

# Load credentials
with open("creds.json", "r") as f:
    creds = json.load(f)
    f.close()

print("Libraries and credentials loaded successfully!")



Libraries and credentials loaded successfully!


## stop spark

In [2]:
# Stop any existing Spark session
if 'spark' in locals():
    spark.stop()

In [3]:
# Initialize Spark Session
try:
    spark = SparkSession.builder \
        .appName("YelpDataModel") \
        .master("spark://spark-master:7077") \
        .config("spark.driver.memory", "2g") \
        .config("spark.executor.memory", "2g") \
        .config("spark.executor.cores", "4") \
        .config("spark.worker.memory", "2g") \
        .config("spark.cores.max", "4") \
        .config("spark.hadoop.fs.s3a.access.key", creds["aws_client"]) \
        .config("spark.hadoop.fs.s3a.secret.key", creds["aws_secret"]) \
        .config("spark.jars.packages", 
                "org.apache.hadoop:hadoop-aws:3.3.4," + 
                "org.apache.hadoop:hadoop-common:3.3.4," +
                "com.amazonaws:aws-java-sdk-bundle:1.12.261," +
                "org.apache.logging.log4j:log4j-slf4j-impl:2.17.2," +
                "org.apache.logging.log4j:log4j-api:2.17.2," +
                "org.apache.logging.log4j:log4j-core:2.17.2," + 
                "org.apache.hadoop:hadoop-client:3.3.4," + 
                "io.delta:delta-core_2.12:2.4.0," + 
                "org.postgresql:postgresql:42.2.18") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
        .config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
    
    print("Spark session initialized successfully!")
    
except Exception as e:
    print(f"Error initializing Spark: {str(e)}")

:: loading settings :: url = jar:file:/usr/local/lib/python3.7/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
org.apache.hadoop#hadoop-common added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.apache.logging.log4j#log4j-slf4j-impl added as a dependency
org.apache.logging.log4j#log4j-api added as a dependency
org.apache.logging.log4j#log4j-core added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
io.delta#delta-core_2.12 added as a dependency
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3a19c2ad-28f1-4260-a55d-bef755375baf;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.apache.hadoop#hadoop-common;3.3.4 in central
	found org.apache.hadoop.thirdparty#hadoop-shaded-pr

Spark session initialized successfully!


In [5]:
# Load data from Delta tables
def read_delta(path: str):
    """Read a Delta table from S3 path"""
    try:
        df = spark.read \
            .format("delta") \
            .option("inferSchema", "true") \
            .load(path)
            
        print(f"Successfully read delta table from: {path}")
        print(f"Number of rows: {df.count():,}")
        return df
        
    except Exception as e:
        print(f"Error reading delta table from {path}")
        print(f"Error: {str(e)}")
        return None

# Load all tables
bucket = "yelp-stevenhurwitt-2"

business_df = read_delta(f"s3a://{bucket}/business")
review_df = read_delta(f"s3a://{bucket}/reviews")
user_df = read_delta(f"s3a://{bucket}/users")
checkin_df = read_delta(f"s3a://{bucket}/checkins")
tip_df = read_delta(f"s3a://{bucket}/tips")

                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/02 22:01:38 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/reviews


                                                                                

Number of rows: 6,990,280
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users


                                                                                

Number of rows: 1,987,897
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/checkins


                                                                                

Number of rows: 131,930
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/tips
Number of rows: 908,915


## Table Schemas and Structure

### 1. Business Table

Contains information about businesses including restaurants, shops, services, etc.

In [6]:
print("=== BUSINESS TABLE SCHEMA ===")
business_df.printSchema()

print("\n=== BUSINESS SAMPLE DATA ===")
business_df.show(3, truncate=False)

print(f"\nTotal Businesses: {business_df.count():,}")
print(f"Columns: {len(business_df.columns)}")

=== BUSINESS TABLE SCHEMA ===
root
 |-- business_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- stars: double (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- is_open: integer (nullable = true)
 |-- attributes: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- categories: string (nullable = true)
 |-- hours: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)


=== BUSINESS SAMPLE DATA ===


                                                                                

+----------------------+-----------------------------+----------------------+------------+-----+-----------+----------+-----------+-----+------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------

### 2. Review Table

Contains user reviews for businesses with ratings, text, and metadata.

In [7]:
print("=== REVIEW TABLE SCHEMA ===")
review_df.printSchema()

print("\n=== REVIEW SAMPLE DATA ===")
review_df.select("review_id", "user_id", "business_id", "stars", "date", "useful", "funny", "cool").show(3, truncate=False)

print(f"\nTotal Reviews: {review_df.count():,}")
print(f"Columns: {len(review_df.columns)}")

=== REVIEW TABLE SCHEMA ===
root
 |-- review_id: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- business_id: string (nullable = true)
 |-- stars: float (nullable = true)
 |-- useful: integer (nullable = true)
 |-- funny: integer (nullable = true)
 |-- cool: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)


=== REVIEW SAMPLE DATA ===
+----------------------+----------------------+----------------------+-----+-------------------+------+-----+----+
|review_id             |user_id               |business_id           |stars|date               |useful|funny|cool|
+----------------------+----------------------+----------------------+-----+-------------------+------+-----+----+
|gSBCrfXhCSEoSbH0o72u_w|BI4lPhrUpmEySIJUywjIjQ|Aes-0Q_guDeYewMapFs_vg|3.0  |2005-11-11 06:29:59|0     |1    |0   |
|2Gm9_x5b2sQGSVy8HWweAQ|F4cbJPyoF1nd3S7Rqz72vw|TBmbB10q

### 3. User Table

Contains user profile information and statistics.

In [8]:
print("=== USER TABLE SCHEMA ===")
user_df.printSchema()

print("\n=== USER SAMPLE DATA ===")
user_df.select("user_id", "name", "review_count", "yelping_since", "useful", "funny", "cool", "fans", "average_stars").show(3, truncate=False)

print(f"\nTotal Users: {user_df.count():,}")
print(f"Columns: {len(user_df.columns)}")

=== USER TABLE SCHEMA ===
root
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- yelping_since: timestamp (nullable = true)
 |-- friends: string (nullable = true)
 |-- useful: integer (nullable = true)
 |-- funny: integer (nullable = true)
 |-- cool: integer (nullable = true)
 |-- fans: integer (nullable = true)
 |-- elite: string (nullable = true)
 |-- average_stars: float (nullable = true)
 |-- compliment_hot: integer (nullable = true)
 |-- compliment_more: integer (nullable = true)
 |-- compliment_profile: integer (nullable = true)
 |-- compliment_cute: integer (nullable = true)
 |-- compliment_list: integer (nullable = true)
 |-- compliment_note: integer (nullable = true)
 |-- compliment_plain: integer (nullable = true)
 |-- compliment_cool: integer (nullable = true)
 |-- compliment_funny: integer (nullable = true)
 |-- compliment_writer: integer (nullable = true)
 |-- compliment_photos: integer (nullable = 

                                                                                

+----------------------+---------+------------+-------------------+------+-----+----+----+-------------+
|user_id               |name     |review_count|yelping_since      |useful|funny|cool|fans|average_stars|
+----------------------+---------+------------+-------------------+------+-----+----+----+-------------+
|XACigsMQP4VYX970eML8XQ|Al       |3           |2019-06-30 19:07:49|1     |0    |0   |0   |2.25         |
|f7bENcNzAPWTCTfpqWxHeg|Kacey    |41          |2013-10-28 19:17:12|48    |8    |2   |0   |3.27         |
|mg0pSgLvuiXKqLsd_uLadQ|Magdaline|114         |2008-09-29 23:51:09|90    |21   |12  |1   |3.5          |
+----------------------+---------+------------+-------------------+------+-----+----+----+-------------+
only showing top 3 rows


Total Users: 1,987,897
Columns: 22


### 4. Tip Table

Contains short tips/advice from users about businesses.

In [9]:
print("=== TIP TABLE SCHEMA ===")
tip_df.printSchema()

print("\n=== TIP SAMPLE DATA ===")
tip_df.show(5, truncate=False)

print(f"\nTotal Tips: {tip_df.count():,}")
print(f"Columns: {len(tip_df.columns)}")

=== TIP TABLE SCHEMA ===
root
 |-- user_id: string (nullable = true)
 |-- business_id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- compliment_count: integer (nullable = true)
 |-- year: integer (nullable = true)


=== TIP SAMPLE DATA ===


                                                                                

+----------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------------+----------------+----+
|user_id               |business_id           |text                                                                                                                              |date               |compliment_count|year|
+----------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------------+----------------+----+
|veuUOGS0bbOeQzu71gVxEQ|aurSXIlX86Ob94kYvSSDPw|Awesome                                                                                                                           |2021-01-01 23:19:41|0               |2021|
|JfAqGalRYKo3Byw735-eOw|mP9dVul2VKgVIs_kZlPAqw|SO quick and easy for a dryer vent cleaning--and a very reasonable pr

### 5. Checkin Table

Contains check-in timestamps for businesses.

In [10]:
print("=== CHECKIN TABLE SCHEMA ===")
checkin_df.printSchema()

print("\n=== CHECKIN SAMPLE DATA ===")
checkin_df.show(5, truncate=False)

print(f"\nTotal Checkin Records: {checkin_df.count():,}")
print(f"Columns: {len(checkin_df.columns)}")

=== CHECKIN TABLE SCHEMA ===
root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)


=== CHECKIN SAMPLE DATA ===


                                                                                

+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|business_id           |date                                                                                                                                                                                                                                                                                                                                                                                                                                       

## Entity Relationship Diagram (ERD)

The Yelp dataset follows this relationship structure:

```
┌─────────────────┐    ┌─────────────────┐
│     BUSINESS    │    │      USERS      │
│                 │    │                 │
│ • business_id   │    │ • user_id       │
│ • name          │    │ • name          │
│ • address       │    │ • review_count  │
│ • city          │    │ • yelping_since │
│ • state         │    │ • useful        │
│ • postal_code   │    │ • funny         │
│ • latitude      │    │ • cool          │
│ • longitude     │    │ • fans          │
│ • stars         │    │ • average_stars │
│ • review_count  │    │ • elite         │
│ • is_open       │    │ • friends       │
│ • attributes    │    │ • compliments   │
│ • categories    │    └─────────────────┘
│ • hours         │              │
└─────────────────┘              │
         │                       │
         │                       │
    ┌────▼───────────────────────▼────┐
    │          REVIEWS               │
    │                               │
    │ • review_id                   │
    │ • user_id     (FK)            │
    │ • business_id (FK)            │
    │ • stars                       │
    │ • useful                      │
    │ • funny                       │
    │ • cool                        │
    │ • text                        │
    │ • date                        │
    └───────────────────────────────┘
         │                       │
         │                       │
    ┌────▼─────────┐       ┌─────▼──────┐
    │   CHECKINS   │       │    TIPS    │
    │              │       │            │
    │ • business_id│       │ • user_id  │
    │   (FK)       │       │   (FK)     │
    │ • date       │       │ • business_│
    │              │       │   id (FK)  │
    └──────────────┘       │ • text     │
                           │ • date     │
                           │ • complim. │
                           │   _count   │
                           └────────────┘
```

## Key Relationships and Constraints

In [11]:
# Analyze key relationships
print("=== KEY RELATIONSHIP ANALYSIS ===")

# 1. Business-Review relationship
businesses_with_reviews = business_df.join(review_df, "business_id", "inner").select("business_id").distinct().count()
total_businesses = business_df.count()
print(f"Businesses with reviews: {businesses_with_reviews:,} out of {total_businesses:,} ({businesses_with_reviews/total_businesses*100:.1f}%)")

# 2. User-Review relationship
users_with_reviews = user_df.join(review_df, "user_id", "inner").select("user_id").distinct().count()
total_users = user_df.count()
print(f"Users with reviews: {users_with_reviews:,} out of {total_users:,} ({users_with_reviews/total_users*100:.1f}%)")

# 3. Business-Checkin relationship
businesses_with_checkins = business_df.join(checkin_df, "business_id", "inner").select("business_id").distinct().count()
print(f"Businesses with checkins: {businesses_with_checkins:,} out of {total_businesses:,} ({businesses_with_checkins/total_businesses*100:.1f}%)")

# 4. Business-Tip relationship
businesses_with_tips = business_df.join(tip_df, "business_id", "inner").select("business_id").distinct().count()
print(f"Businesses with tips: {businesses_with_tips:,} out of {total_businesses:,} ({businesses_with_tips/total_businesses*100:.1f}%)")

# 5. User-Tip relationship
users_with_tips = user_df.join(tip_df, "user_id", "inner").select("user_id").distinct().count()
print(f"Users with tips: {users_with_tips:,} out of {total_users:,} ({users_with_tips/total_users*100:.1f}%)")

=== KEY RELATIONSHIP ANALYSIS ===


25/09/02 22:29:36 ERROR TaskSchedulerImpl: Lost executor 0 on 172.18.0.5: worker lost
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_90_41 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_18_48 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_66_22 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_90_32 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_42_6 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_114_42 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_14_43 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_90_44 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_42_15 !
25/09/02 22:29:36 WARN BlockManagerMasterEndpoint: No more replicas 

Businesses with reviews: 150,346 out of 150,346 (100.0%)


                                                                                

Users with reviews: 1,987,897 out of 1,987,897 (100.0%)


                                                                                

Businesses with checkins: 131,930 out of 150,346 (87.8%)


                                                                                

Businesses with tips: 106,193 out of 150,346 (70.6%)


[Stage 166:>                                                        (0 + 4) / 4]

Users with tips: 301,758 out of 1,987,897 (15.2%)


                                                                                

## Data Quality and Distribution Analysis

In [12]:
# Data quality analysis
print("=== DATA QUALITY ANALYSIS ===")

# Check for null values in key columns
tables = {
    'Business': business_df,
    'Reviews': review_df,
    'Users': user_df,
    'Tips': tip_df,
    'Checkins': checkin_df
}

for table_name, df in tables.items():
    print(f"\n{table_name.upper()} TABLE:")
    total_rows = df.count()
    
    # Check null values for each column
    for col_name in df.columns[:5]:  # Check first 5 columns to keep output manageable
        null_count = df.filter(col(col_name).isNull()).count()
        null_percentage = (null_count / total_rows) * 100
        print(f"  {col_name}: {null_count:,} nulls ({null_percentage:.1f}%)")

=== DATA QUALITY ANALYSIS ===

BUSINESS TABLE:
  business_id: 0 nulls (0.0%)
  name: 0 nulls (0.0%)
  address: 0 nulls (0.0%)
  city: 0 nulls (0.0%)
  state: 0 nulls (0.0%)

REVIEWS TABLE:
  review_id: 0 nulls (0.0%)
  user_id: 0 nulls (0.0%)
  business_id: 0 nulls (0.0%)
  stars: 0 nulls (0.0%)
  useful: 0 nulls (0.0%)

USERS TABLE:
  user_id: 0 nulls (0.0%)
  name: 0 nulls (0.0%)
  review_count: 0 nulls (0.0%)
  yelping_since: 0 nulls (0.0%)
  friends: 0 nulls (0.0%)

TIPS TABLE:
  user_id: 0 nulls (0.0%)
  business_id: 0 nulls (0.0%)
  text: 0 nulls (0.0%)
  date: 0 nulls (0.0%)
  compliment_count: 0 nulls (0.0%)

CHECKINS TABLE:
  business_id: 0 nulls (0.0%)
  date: 0 nulls (0.0%)


In [13]:
# Distribution analysis
print("=== DATA DISTRIBUTION ANALYSIS ===")

# Business star ratings distribution
print("\nBusiness Star Ratings Distribution:")
business_df.groupBy("stars").count().orderBy("stars").show()

# Review star ratings distribution
print("\nReview Star Ratings Distribution:")
review_df.groupBy("stars").count().orderBy("stars").show()

# Top states by business count
print("\nTop 10 States by Business Count:")
business_df.groupBy("state").count().orderBy(desc("count")).show(10)

# Business categories (top 10)
print("\nBusiness Status (Open vs Closed):")
business_df.groupBy("is_open").count().show()

=== DATA DISTRIBUTION ANALYSIS ===

Business Star Ratings Distribution:


                                                                                

+-----+-----+
|stars|count|
+-----+-----+
|  1.0| 1986|
|  1.5| 4932|
|  2.0| 9527|
|  2.5|14316|
|  3.0|18453|
|  3.5|26519|
|  4.0|31125|
|  4.5|27181|
|  5.0|16307|
+-----+-----+


Review Star Ratings Distribution:


                                                                                

+-----+-------+
|stars|  count|
+-----+-------+
|  1.0|1069561|
|  2.0| 544240|
|  3.0| 691934|
|  4.0|1452918|
|  5.0|3231627|
+-----+-------+


Top 10 States by Business Count:


                                                                                

+-----+-----+
|state|count|
+-----+-----+
|   PA|34039|
|   FL|26330|
|   TN|12056|
|   IN|11247|
|   MO|10913|
|   LA| 9924|
|   AZ| 9912|
|   NJ| 8536|
|   NV| 7715|
|   AB| 5573|
+-----+-----+
only showing top 10 rows


Business Status (Open vs Closed):




+-------+------+
|is_open| count|
+-------+------+
|      1|119698|
|      0| 30648|
+-------+------+



                                                                                

## Primary and Foreign Key Analysis

In [14]:
# Primary key uniqueness check
print("=== PRIMARY KEY UNIQUENESS CHECK ===")

# Business table
business_total = business_df.count()
business_unique = business_df.select("business_id").distinct().count()
print(f"Business table: {business_total:,} total rows, {business_unique:,} unique business_ids ({'✓' if business_total == business_unique else '✗'})")

# Users table
users_total = user_df.count()
users_unique = user_df.select("user_id").distinct().count()
print(f"Users table: {users_total:,} total rows, {users_unique:,} unique user_ids ({'✓' if users_total == users_unique else '✗'})")

# Reviews table
reviews_total = review_df.count()
reviews_unique = review_df.select("review_id").distinct().count()
print(f"Reviews table: {reviews_total:,} total rows, {reviews_unique:,} unique review_ids ({'✓' if reviews_total == reviews_unique else '✗'})")

print("\n=== FOREIGN KEY INTEGRITY CHECK ===")

# Check if all review business_ids exist in business table
review_business_ids = review_df.select("business_id").distinct()
valid_review_businesses = review_business_ids.join(business_df.select("business_id"), "business_id", "inner").count()
total_review_businesses = review_business_ids.count()
print(f"Review business_id integrity: {valid_review_businesses:,}/{total_review_businesses:,} ({'✓' if valid_review_businesses == total_review_businesses else '✗'})")

# Check if all review user_ids exist in users table
review_user_ids = review_df.select("user_id").distinct()
valid_review_users = review_user_ids.join(user_df.select("user_id"), "user_id", "inner").count()
total_review_users = review_user_ids.count()
print(f"Review user_id integrity: {valid_review_users:,}/{total_review_users:,} ({'✓' if valid_review_users == total_review_users else '✗'})")

# Check tips foreign key integrity
tip_business_ids = tip_df.select("business_id").distinct()
valid_tip_businesses = tip_business_ids.join(business_df.select("business_id"), "business_id", "inner").count()
total_tip_businesses = tip_business_ids.count()
print(f"Tip business_id integrity: {valid_tip_businesses:,}/{total_tip_businesses:,} ({'✓' if valid_tip_businesses == total_tip_businesses else '✗'})")

# Check checkins foreign key integrity
checkin_business_ids = checkin_df.select("business_id").distinct()
valid_checkin_businesses = checkin_business_ids.join(business_df.select("business_id"), "business_id", "inner").count()
total_checkin_businesses = checkin_business_ids.count()
print(f"Checkin business_id integrity: {valid_checkin_businesses:,}/{total_checkin_businesses:,} ({'✓' if valid_checkin_businesses == total_checkin_businesses else '✗'})")

=== PRIMARY KEY UNIQUENESS CHECK ===


                                                                                

Business table: 150,346 total rows, 150,346 unique business_ids (✓)


                                                                                

Users table: 1,987,897 total rows, 1,987,897 unique user_ids (✓)


                                                                                

Reviews table: 6,990,280 total rows, 6,990,280 unique review_ids (✓)

=== FOREIGN KEY INTEGRITY CHECK ===


                                                                                

Review business_id integrity: 150,346/150,346 (✓)


                                                                                

Review user_id integrity: 1,987,897/1,987,929 (✗)


                                                                                

Tip business_id integrity: 106,193/106,193 (✓)


                                                                                

Checkin business_id integrity: 131,930/131,930 (✓)


## Table Statistics Summary

In [15]:
# Create a comprehensive summary table
summary_data = []

tables_info = [
    ('Business', business_df, 'business_id', 'Contains business information (restaurants, shops, services)'),
    ('Reviews', review_df, 'review_id', 'User reviews with ratings and text content'),
    ('Users', user_df, 'user_id', 'User profiles and statistics'),
    ('Tips', tip_df, None, 'Short tips/advice about businesses'),
    ('Checkins', checkin_df, None, 'Business check-in timestamps')
]

for table_name, df, pk, description in tables_info:
    row_count = df.count()
    col_count = len(df.columns)
    summary_data.append({
        'Table': table_name,
        'Rows': f"{row_count:,}",
        'Columns': col_count,
        'Primary Key': pk if pk else 'Composite',
        'Description': description
    })

# Create and display summary DataFrame
summary_df = pd.DataFrame(summary_data)
print("=== YELP DATASET SUMMARY ===")
print(summary_df.to_string(index=False))

# Calculate total storage estimates
total_rows = sum([business_df.count(), review_df.count(), user_df.count(), tip_df.count(), checkin_df.count()])
print(f"\nTotal Records Across All Tables: {total_rows:,}")
print(f"Average Records per Table: {total_rows/5:,.0f}")

=== YELP DATASET SUMMARY ===
   Table      Rows  Columns Primary Key                                                  Description
Business   150,346       14 business_id Contains business information (restaurants, shops, services)
 Reviews 6,990,280       11   review_id                   User reviews with ratings and text content
   Users 1,987,897       22     user_id                                 User profiles and statistics
    Tips   908,915        6   Composite                           Short tips/advice about businesses
Checkins   131,930        2   Composite                                 Business check-in timestamps


TypeError: Invalid argument, not a string or column: [150346, 6990280, 1987897, 908915, 131930] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

25/09/03 21:03:37 ERROR TaskSchedulerImpl: Lost executor 2 on 172.18.0.6: Command exited with code 137
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_90_41 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_18_48 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_66_22 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_90_32 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_42_6 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_114_42 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_14_43 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_90_44 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_42_15 !
25/09/03 21:03:37 WARN BlockManagerMasterEndpoint: 

## Data Model Insights

### Key Observations:

1. **Central Entities**: Business and Users are the core entities, with Reviews connecting them
2. **Relationship Patterns**: 
   - One-to-Many: Business → Reviews, Users → Reviews
   - Many-to-Many: Business ↔ Users (through Reviews and Tips)
3. **Data Completeness**: Most tables maintain good referential integrity
4. **Scale**: The dataset represents a comprehensive view of business-customer interactions

### Recommended Indexes for PostgreSQL:
- `business(business_id)` - Primary key
- `users(user_id)` - Primary key  
- `reviews(business_id, user_id)` - Foreign key indexes
- `business(state, city)` - Geographic queries
- `reviews(date)` - Temporal queries
- `business(categories)` - Category-based searches