# Yelp Dataset Machine Learning Analysis

This notebook applies various machine learning models to the Yelp dataset for:
1. **Business Rating Prediction** - Predict business star ratings
2. **Review Sentiment Analysis** - Classify review sentiment from text
3. **User Behavior Prediction** - Predict user engagement metrics
4. **Business Success Classification** - Classify businesses as successful or not
5. **Recommendation Systems** - Build collaborative filtering models

## Setup and Data Loading

In [1]:
# Import required libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR, SVC
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

# PySpark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler as SparkStandardScaler
from pyspark.ml.regression import RandomForestRegressor as SparkRandomForestRegressor
from pyspark.ml.classification import RandomForestClassifier as SparkRandomForestClassifier
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
from pyspark.ml import Pipeline

print("Libraries imported successfully!")

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)



Libraries imported successfully!


In [2]:
# Load credentials
with open("creds.json", "r") as f:
    creds = json.load(f)
    f.close()

print("Credentials loaded successfully!")

Credentials loaded successfully!


In [3]:
# Initialize Spark Session
try:
    spark = SparkSession.builder \
        .appName("YelpMachineLearning") \
        .master("spark://spark-master:7077") \
        .config("spark.driver.memory", "4g") \
        .config("spark.executor.memory", "4g") \
        .config("spark.executor.cores", "4") \
        .config("spark.worker.memory", "4g") \
        .config("spark.cores.max", "8") \
        .config("spark.hadoop.fs.s3a.access.key", creds["aws_client"]) \
        .config("spark.hadoop.fs.s3a.secret.key", creds["aws_secret"]) \
        .config("spark.jars.packages", 
                "org.apache.hadoop:hadoop-aws:3.3.4," + 
                "org.apache.hadoop:hadoop-common:3.3.4," +
                "com.amazonaws:aws-java-sdk-bundle:1.12.261," +
                "org.apache.logging.log4j:log4j-slf4j-impl:2.17.2," +
                "org.apache.logging.log4j:log4j-api:2.17.2," +
                "org.apache.logging.log4j:log4j-core:2.17.2," + 
                "org.apache.hadoop:hadoop-client:3.3.4," + 
                "io.delta:delta-core_2.12:2.4.0," + 
                "org.postgresql:postgresql:42.2.18") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
        .config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
    
    print("Spark session initialized successfully!")
    
except Exception as e:
    print(f"Error initializing Spark: {str(e)}")

:: loading settings :: url = jar:file:/usr/local/lib/python3.7/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
org.apache.hadoop#hadoop-common added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.apache.logging.log4j#log4j-slf4j-impl added as a dependency
org.apache.logging.log4j#log4j-api added as a dependency
org.apache.logging.log4j#log4j-core added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
io.delta#delta-core_2.12 added as a dependency
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d6813aa7-6518-432e-9166-a3fb2dbfc173;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.apache.hadoop#hadoop-common;3.3.4 in central
	found org.apache.hadoop.thirdparty#hadoop-shaded-pr

Spark session initialized successfully!


In [4]:
# Load data from Delta tables
def read_delta(path: str):
    """Read a Delta table from S3 path"""
    try:
        df = spark.read \
            .format("delta") \
            .option("inferSchema", "true") \
            .load(path)
            
        print(f"Successfully read delta table from: {path}")
        print(f"Number of rows: {df.count():,}")
        return df
        
    except Exception as e:
        print(f"Error reading delta table from {path}")
        print(f"Error: {str(e)}")
        return None

# Load all tables
bucket = "yelp-stevenhurwitt-2"

business_df = read_delta(f"s3a://{bucket}/business")
review_df = read_delta(f"s3a://{bucket}/reviews")
user_df = read_delta(f"s3a://{bucket}/users")
checkin_df = read_delta(f"s3a://{bucket}/checkins")
tip_df = read_delta(f"s3a://{bucket}/tips")

25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/reviews


25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/reviews


                                                                                

25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/reviews


                                                                                

Number of rows: 6,990,280
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users


25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/reviews


                                                                                

Number of rows: 6,990,280
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users


                                                                                

25/09/03 21:02:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/business


25/09/03 21:03:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Number of rows: 150,346


                                                                                

Successfully read delta table from: s3a://yelp-stevenhurwitt-2/reviews


                                                                                

Number of rows: 6,990,280
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/users


                                                                                

Number of rows: 1,987,897
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/checkins
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/checkins
Number of rows: 131,930
Number of rows: 131,930
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/tips
Successfully read delta table from: s3a://yelp-stevenhurwitt-2/tips
Number of rows: 908,915
Number of rows: 908,915


## Feature Engineering

In [5]:
# Create comprehensive features for machine learning
print("Creating feature dataset...")

# Business features with aggregated metrics
business_features = business_df.select(
    "business_id",
    "stars",
    "review_count",
    "is_open",
    "state",
    "city",
    "categories",
    "latitude",
    "longitude"
).withColumn(
    "has_categories", when(col("categories").isNotNull(), 1).otherwise(0)
).withColumn(
    "is_restaurant", when(col("categories").contains("Restaurant"), 1).otherwise(0)
).withColumn(
    "category_count", when(col("categories").isNotNull(), size(split(col("categories"), ","))).otherwise(0)
)

# Add user aggregated features to reviews
user_agg = user_df.select(
    "user_id",
    "review_count",
    "useful",
    "funny",
    "cool",
    "fans",
    "average_stars"
).withColumnRenamed("review_count", "user_review_count") \
 .withColumnRenamed("useful", "user_useful") \
 .withColumnRenamed("funny", "user_funny") \
 .withColumnRenamed("cool", "user_cool")

# Create review features dataset
review_features = review_df.select(
    "review_id",
    "user_id",
    "business_id",
    "stars",
    "useful",
    "funny",
    "cool",
    "text",
    "date"
).withColumn("text_length", length(col("text"))) \
 .withColumn("year", year(col("date"))) \
 .withColumn("month", month(col("date"))) \
 .withColumn("total_votes", col("useful") + col("funny") + col("cool"))

# Join datasets for comprehensive feature set
ml_dataset = review_features.join(business_features, "business_id", "inner") \
                           .join(user_agg, "user_id", "inner")

print(f"ML dataset created with {ml_dataset.count():,} rows and {len(ml_dataset.columns)} columns")

Creating feature dataset...


Creating feature dataset...


25/09/03 21:28:37 ERROR TaskSchedulerImpl: Lost executor 0 on 172.18.0.5: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 2.0 in stage 46.0 (TID 744) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 1.0 in stage 46.0 (TID 743) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 0.0 in stage 46.0 (TID 742) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 3.0 in stage 46.0 (TID 745) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN BlockManagerMaster

Creating feature dataset...


25/09/03 21:28:37 ERROR TaskSchedulerImpl: Lost executor 0 on 172.18.0.5: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 2.0 in stage 46.0 (TID 744) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 1.0 in stage 46.0 (TID 743) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 0.0 in stage 46.0 (TID 742) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN TaskSetManager: Lost task 3.0 in stage 46.0 (TID 745) (172.18.0.5 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/03 21:28:37 WARN BlockManagerMaster

ML dataset created with 6,990,247 rows and 30 columns


                                                                                

In [None]:
# Sample data for local ML processing (to manage memory)
sample_size = 0.1  # 10% sample
ml_sample = ml_dataset.sample(sample_size, seed=42)

print(f"Sampled {ml_sample.count():,} rows for ML analysis")
ml_sample.show(5)

25/09/04 20:53:27 ERROR TaskSchedulerImpl: Lost executor 55 on 172.18.0.6: Command exited with code 137
25/09/04 20:53:27 WARN TaskSetManager: Lost task 0.0 in stage 116.0 (TID 5296) (172.18.0.6 executor 55): ExecutorLostFailure (executor 55 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 20:53:27 WARN TaskSetManager: Lost task 4.0 in stage 116.0 (TID 5298) (172.18.0.6 executor 55): ExecutorLostFailure (executor 55 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 20:53:27 WARN TaskSetManager: Lost task 2.0 in stage 116.0 (TID 5297) (172.18.0.6 executor 55): ExecutorLostFailure (executor 55 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 20:53:27 WARN TaskSetManager: Lost task 5.0 in stage 116.0 (TID 5299) (172.18.0.6 executor 55): ExecutorLostFailure (executor 55 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 20:53:27 WARN B

Sampled 699,039 rows for ML analysis


25/09/04 21:12:52 ERROR TaskSchedulerImpl: Lost executor 57 on 172.18.0.6: Command exited with code 137
25/09/04 21:12:52 WARN TaskSetManager: Lost task 41.0 in stage 128.0 (TID 5899) (172.18.0.6 executor 57): ExecutorLostFailure (executor 57 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 21:12:52 WARN TaskSetManager: Lost task 44.0 in stage 128.0 (TID 5902) (172.18.0.6 executor 57): ExecutorLostFailure (executor 57 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 21:12:52 WARN TaskSetManager: Lost task 46.0 in stage 128.0 (TID 5904) (172.18.0.6 executor 57): ExecutorLostFailure (executor 57 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 21:12:52 WARN TaskSetManager: Lost task 45.0 in stage 128.0 (TID 5903) (172.18.0.6 executor 57): ExecutorLostFailure (executor 57 exited caused by one of the running tasks) Reason: Command exited with code 137
25/09/04 21:12:52 WA

## 1. Business Rating Prediction

In [None]:
# Prepare data for business rating prediction
print("=== BUSINESS RATING PREDICTION ===")

# Aggregate features at business level
business_agg = ml_sample.groupBy("business_id").agg(
    avg("stars").alias("avg_rating"),
    count("review_id").alias("total_reviews"),
    avg("text_length").alias("avg_text_length"),
    sum("useful").alias("total_useful"),
    sum("funny").alias("total_funny"),
    sum("cool").alias("total_cool"),
    stddev("stars").alias("rating_variance")
).fillna(0)

# Join with business features
business_ml = business_agg.join(
    business_features.select("business_id", "review_count", "is_open", "category_count", "is_restaurant"),
    "business_id",
    "inner"
)

# Convert to Pandas for sklearn
business_pd = business_ml.toPandas()

print(f"Business dataset shape: {business_pd.shape}")
business_pd.head()

In [None]:
# Prepare features and target for business rating prediction
feature_columns = ['total_reviews', 'avg_text_length', 'total_useful', 'total_funny', 'total_cool', 
                   'rating_variance', 'review_count', 'is_open', 'category_count', 'is_restaurant']

# Remove rows with missing values
business_clean = business_pd.dropna(subset=feature_columns + ['avg_rating'])

X_business = business_clean[feature_columns]
y_business = business_clean['avg_rating']

print(f"Clean dataset shape: {X_business.shape}")
print(f"Target range: {y_business.min():.2f} - {y_business.max():.2f}")

# Split the data
X_train_bus, X_test_bus, y_train_bus, y_test_bus = train_test_split(
    X_business, y_business, test_size=0.2, random_state=42
)

# Scale features
scaler_bus = StandardScaler()
X_train_bus_scaled = scaler_bus.fit_transform(X_train_bus)
X_test_bus_scaled = scaler_bus.transform(X_test_bus)

print(f"Training set: {X_train_bus_scaled.shape}")
print(f"Test set: {X_test_bus_scaled.shape}")

In [None]:
# Train multiple regression models for business rating prediction
models_regression = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'SVR': SVR(kernel='rbf', C=1.0)
}

results_regression = {}

for name, model in models_regression.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train_bus_scaled, y_train_bus)
    
    # Make predictions
    y_pred_train = model.predict(X_train_bus_scaled)
    y_pred_test = model.predict(X_test_bus_scaled)
    
    # Calculate metrics
    train_rmse = np.sqrt(mean_squared_error(y_train_bus, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test_bus, y_pred_test))
    train_r2 = r2_score(y_train_bus, y_pred_train)
    test_r2 = r2_score(y_test_bus, y_pred_test)
    
    results_regression[name] = {
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Train R²': train_r2,
        'Test R²': test_r2,
        'Model': model
    }
    
    print(f"Train RMSE: {train_rmse:.4f}, Test RMSE: {test_rmse:.4f}")
    print(f"Train R²: {train_r2:.4f}, Test R²: {test_r2:.4f}")

# Display results summary
results_df = pd.DataFrame({k: {metric: v[metric] for metric in ['Train RMSE', 'Test RMSE', 'Train R²', 'Test R²']} 
                          for k, v in results_regression.items()}).T
print("\n=== BUSINESS RATING PREDICTION RESULTS ===")
print(results_df.round(4))

## 2. Review Sentiment Classification

In [None]:
# Prepare data for sentiment classification
print("=== REVIEW SENTIMENT CLASSIFICATION ===")

# Create sentiment labels (1-2 stars = negative, 4-5 stars = positive, skip 3 for clear distinction)
sentiment_sample = ml_sample.filter((col("stars") <= 2) | (col("stars") >= 4)) \
                           .withColumn("sentiment", when(col("stars") >= 4, 1).otherwise(0)) \
                           .select("text", "sentiment", "stars", "text_length")

# Sample smaller subset for text processing
text_sample = sentiment_sample.sample(0.2, seed=42).toPandas()

print(f"Sentiment dataset shape: {text_sample.shape}")
print(f"Positive samples: {text_sample['sentiment'].sum()}")
print(f"Negative samples: {len(text_sample) - text_sample['sentiment'].sum()}")

text_sample.head()

In [None]:
# Text preprocessing and feature extraction
print("Processing text features...")

# Clean text data
text_clean = text_sample.dropna(subset=['text', 'sentiment'])
text_clean = text_clean[text_clean['text'].str.len() > 10]  # Remove very short texts

# Create TF-IDF features
tfidf = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

X_text_tfidf = tfidf.fit_transform(text_clean['text'])
y_text = text_clean['sentiment']

print(f"TF-IDF matrix shape: {X_text_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf.vocabulary_)}")

# Reduce dimensionality for faster processing
svd = TruncatedSVD(n_components=100, random_state=42)
X_text_reduced = svd.fit_transform(X_text_tfidf)

print(f"Reduced feature matrix shape: {X_text_reduced.shape}")
print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.3f}")

In [None]:
# Train sentiment classification models
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
    X_text_reduced, y_text, test_size=0.2, random_state=42, stratify=y_text
)

models_classification = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}

results_classification = {}

for name, model in models_classification.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train_text, y_train_text)
    
    # Make predictions
    y_pred_train = model.predict(X_train_text)
    y_pred_test = model.predict(X_test_text)
    
    # Calculate accuracy
    train_acc = (y_pred_train == y_train_text).mean()
    test_acc = (y_pred_test == y_test_text).mean()
    
    results_classification[name] = {
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc,
        'Model': model,
        'Predictions': y_pred_test
    }
    
    print(f"Train Accuracy: {train_acc:.4f}")
    print(f"Test Accuracy: {test_acc:.4f}")

# Display classification results
class_results_df = pd.DataFrame({k: {metric: v[metric] for metric in ['Train Accuracy', 'Test Accuracy']} 
                                for k, v in results_classification.items()}).T
print("\n=== SENTIMENT CLASSIFICATION RESULTS ===")
print(class_results_df.round(4))

# Show classification report for best model
best_model_name = class_results_df['Test Accuracy'].idxmax()
best_predictions = results_classification[best_model_name]['Predictions']

print(f"\n=== CLASSIFICATION REPORT ({best_model_name}) ===")
print(classification_report(y_test_text, best_predictions, target_names=['Negative', 'Positive']))

## 3. Business Success Classification

In [None]:
# Define business success based on multiple criteria
print("=== BUSINESS SUCCESS CLASSIFICATION ===")

# Create success metric (high rating + many reviews + still open)
business_success = business_features.withColumn(
    "successful", 
    when((col("stars") >= 4.0) & (col("review_count") >= 50) & (col("is_open") == 1), 1)
    .otherwise(0)
)

# Select features for success prediction
success_features = business_success.select(
    "business_id",
    "review_count",
    "is_restaurant",
    "category_count",
    "has_categories",
    "successful"
).fillna(0)

# Convert to pandas
success_pd = success_features.toPandas()

print(f"Business success dataset shape: {success_pd.shape}")
print(f"Successful businesses: {success_pd['successful'].sum()} ({success_pd['successful'].mean()*100:.1f}%)")
print(f"Unsuccessful businesses: {len(success_pd) - success_pd['successful'].sum()} ({(1-success_pd['successful'].mean())*100:.1f}%)")

success_pd.head()

In [None]:
# Train business success classification model
feature_cols_success = ['review_count', 'is_restaurant', 'category_count', 'has_categories']

X_success = success_pd[feature_cols_success]
y_success = success_pd['successful']

# Split data
X_train_suc, X_test_suc, y_train_suc, y_test_suc = train_test_split(
    X_success, y_success, test_size=0.2, random_state=42, stratify=y_success
)

# Scale features
scaler_suc = StandardScaler()
X_train_suc_scaled = scaler_suc.fit_transform(X_train_suc)
X_test_suc_scaled = scaler_suc.transform(X_test_suc)

# Train classification models
success_models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}

success_results = {}

for name, model in success_models.items():
    print(f"\nTraining {name} for business success...")
    
    # Train model
    model.fit(X_train_suc_scaled, y_train_suc)
    
    # Make predictions
    y_pred_train_suc = model.predict(X_train_suc_scaled)
    y_pred_test_suc = model.predict(X_test_suc_scaled)
    
    # Calculate accuracy
    train_acc = (y_pred_train_suc == y_train_suc).mean()
    test_acc = (y_pred_test_suc == y_test_suc).mean()
    
    success_results[name] = {
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc,
        'Model': model,
        'Predictions': y_pred_test_suc
    }
    
    print(f"Train Accuracy: {train_acc:.4f}")
    print(f"Test Accuracy: {test_acc:.4f}")

# Display results
success_results_df = pd.DataFrame({k: {metric: v[metric] for metric in ['Train Accuracy', 'Test Accuracy']} 
                                  for k, v in success_results.items()}).T
print("\n=== BUSINESS SUCCESS CLASSIFICATION RESULTS ===")
print(success_results_df.round(4))

## 4. User Clustering Analysis

In [None]:
# User behavior clustering
print("=== USER CLUSTERING ANALYSIS ===")

# Prepare user features for clustering
user_features = user_df.select(
    "user_id",
    "review_count",
    "useful",
    "funny",
    "cool",
    "fans",
    "average_stars"
).fillna(0)

# Sample for clustering
user_cluster_sample = user_features.sample(0.05, seed=42).toPandas()

print(f"User clustering sample: {user_cluster_sample.shape}")

# Prepare features for clustering
cluster_features = ['review_count', 'useful', 'funny', 'cool', 'fans', 'average_stars']
X_cluster = user_cluster_sample[cluster_features]

# Remove outliers (users with extremely high values)
for col in cluster_features:
    q99 = X_cluster[col].quantile(0.99)
    X_cluster[col] = X_cluster[col].clip(upper=q99)

# Scale features for clustering
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print(f"Features for clustering: {X_cluster_scaled.shape}")

In [None]:
# Perform K-means clustering
# Find optimal number of clusters using elbow method
inertias = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_cluster_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True, alpha=0.3)
plt.show()

# Use k=5 based on elbow method
optimal_k = 5
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_cluster_scaled)

# Add cluster labels to dataframe
user_cluster_sample['cluster'] = cluster_labels

print(f"\nCluster distribution:")
print(user_cluster_sample['cluster'].value_counts().sort_index())

In [None]:
# Analyze cluster characteristics
cluster_analysis = user_cluster_sample.groupby('cluster')[cluster_features].mean()

print("=== CLUSTER CHARACTERISTICS ===")
print(cluster_analysis.round(2))

# Visualize clusters
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(cluster_features):
    for cluster_id in range(optimal_k):
        cluster_data = user_cluster_sample[user_cluster_sample['cluster'] == cluster_id][feature]
        axes[i].hist(cluster_data, alpha=0.7, label=f'Cluster {cluster_id}', bins=20)
    
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'Distribution of {feature} by Cluster')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interpret clusters
print("\n=== CLUSTER INTERPRETATION ===")
cluster_names = {
    0: "Casual Users",
    1: "Active Reviewers", 
    2: "Social Influencers",
    3: "Power Users",
    4: "Elite Critics"
}

for i in range(optimal_k):
    cluster_stats = cluster_analysis.loc[i]
    print(f"\nCluster {i} ({cluster_names.get(i, f'Cluster {i}')}):") 
    print(f"  Avg Reviews: {cluster_stats['review_count']:.1f}")
    print(f"  Avg Rating: {cluster_stats['average_stars']:.2f}")
    print(f"  Fans: {cluster_stats['fans']:.1f}")
    print(f"  Useful Votes: {cluster_stats['useful']:.1f}")

## 5. Model Performance Summary

In [None]:
# Create comprehensive results summary
print("=== MACHINE LEARNING RESULTS SUMMARY ===")

# Business rating prediction summary
print("\n1. BUSINESS RATING PREDICTION:")
print("Task: Predict average business rating from business characteristics")
best_reg_model = results_df['Test R²'].idxmax()
best_reg_r2 = results_df.loc[best_reg_model, 'Test R²']
best_reg_rmse = results_df.loc[best_reg_model, 'Test RMSE']
print(f"Best Model: {best_reg_model} (R² = {best_reg_r2:.4f}, RMSE = {best_reg_rmse:.4f})")

# Sentiment classification summary
print("\n2. REVIEW SENTIMENT CLASSIFICATION:")
print("Task: Classify review sentiment (positive/negative) from text")
best_class_model = class_results_df['Test Accuracy'].idxmax()
best_class_acc = class_results_df.loc[best_class_model, 'Test Accuracy']
print(f"Best Model: {best_class_model} (Accuracy = {best_class_acc:.4f})")

# Business success classification summary
print("\n3. BUSINESS SUCCESS CLASSIFICATION:")
print("Task: Predict business success (high rating + many reviews + open)")
best_success_model = success_results_df['Test Accuracy'].idxmax()
best_success_acc = success_results_df.loc[best_success_model, 'Test Accuracy']
print(f"Best Model: {best_success_model} (Accuracy = {best_success_acc:.4f})")

# Clustering summary
print("\n4. USER CLUSTERING:")
print(f"Task: Segment users based on behavior patterns")
print(f"Clusters Identified: {optimal_k} distinct user segments")
print(f"Sample Size: {len(user_cluster_sample):,} users")

print("\n=== KEY INSIGHTS ===")
print("• Business ratings can be predicted with moderate accuracy from aggregated features")
print("• Text-based sentiment classification achieves good performance on polarized reviews")
print("• Business success can be classified using simple business metrics")
print("• Users show distinct behavioral patterns suitable for targeted recommendations")
print("• Feature engineering and data quality are crucial for model performance")

## 6. Feature Importance Analysis

In [None]:
# Analyze feature importance for best performing models
print("=== FEATURE IMPORTANCE ANALYSIS ===")

# Business rating prediction feature importance
best_reg_model_obj = results_regression[best_reg_model]['Model']
if hasattr(best_reg_model_obj, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': feature_columns,
        'importance': best_reg_model_obj.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\nFeature Importance - {best_reg_model} (Business Rating Prediction):")
    print(feature_importance)
    
    # Plot feature importance
    plt.figure(figsize=(10, 6))
    plt.barh(feature_importance['feature'], feature_importance['importance'])
    plt.xlabel('Feature Importance')
    plt.title(f'Feature Importance - {best_reg_model}')
    plt.tight_layout()
    plt.show()

# Business success feature importance
best_success_model_obj = success_results[best_success_model]['Model']
if hasattr(best_success_model_obj, 'feature_importances_'):
    success_importance = pd.DataFrame({
        'feature': feature_cols_success,
        'importance': best_success_model_obj.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\nFeature Importance - {best_success_model} (Business Success):")
    print(success_importance)
    
    # Plot feature importance
    plt.figure(figsize=(10, 4))
    plt.barh(success_importance['feature'], success_importance['importance'])
    plt.xlabel('Feature Importance')
    plt.title(f'Feature Importance - {best_success_model} (Success Prediction)')
    plt.tight_layout()
    plt.show()

## Recommendations and Next Steps

### Model Performance Insights:
1. **Business Rating Prediction**: Models show moderate predictive power, suggesting ratings are influenced by factors beyond basic business metrics
2. **Sentiment Classification**: Text-based models perform well on polarized reviews (1-2 vs 4-5 stars)
3. **Business Success**: Simple metrics like review count are strong predictors of success
4. **User Clustering**: Clear user segments emerge, enabling targeted strategies

### Recommendations for Improvement:
1. **Feature Engineering**: Include temporal features, geographic features, and competitive analysis
2. **Text Processing**: Use advanced NLP techniques (BERT, word embeddings) for better sentiment analysis
3. **Deep Learning**: Implement neural networks for complex pattern recognition
4. **Ensemble Methods**: Combine multiple models for better performance
5. **Cross-validation**: Use more robust validation strategies for model selection

### Business Applications:
1. **Restaurant Recommendations**: Use collaborative filtering based on user clusters
2. **Business Intelligence**: Predict which businesses are likely to succeed
3. **Review Quality**: Automatically flag potentially fake or low-quality reviews
4. **Market Analysis**: Identify market gaps and opportunities
5. **Customer Segmentation**: Tailor marketing strategies to different user segments