# Yelp Review Rating Prediction via Prompting

This notebook implements 3 different prompting approaches to classify Yelp reviews into 1-5 star ratings.

## Dataset
Using Yelp Reviews dataset from Kaggle: https://www.kaggle.com/datasets/omkarsabnis/yelp-reviews-dataset

In [None]:
import pandas as pd
import numpy as np
import json
import os
from typing import Dict, List, Tuple
import google.generativeai as genai
from dotenv import load_dotenv
import time
from collections import Counter

load_dotenv()

# Initialize Gemini API
api_key = os.getenv('GEMINI_API_KEY')
if api_key:
    genai.configure(api_key=api_key)
    model = genai.GenerativeModel('gemini-pro')
else:
    print("Warning: GEMINI_API_KEY not found. Please set it in .env file")
    model = None

## Load and Sample Dataset

In [None]:
# Load dataset (adjust path as needed)
# If dataset is large, sample 200 rows for evaluation
try:
    df = pd.read_csv('yelp_reviews.csv')
    print(f"Loaded {len(df)} reviews")
    
    # Sample 200 rows if dataset is larger
    if len(df) > 200:
        df_sample = df.sample(n=200, random_state=42)
        print(f"Sampled {len(df_sample)} reviews for evaluation")
    else:
        df_sample = df
        
    # Ensure we have required columns
    if 'text' not in df_sample.columns:
        # Try common column names
        text_col = [c for c in df_sample.columns if 'text' in c.lower() or 'review' in c.lower()][0]
        df_sample['text'] = df_sample[text_col]
    
    if 'stars' not in df_sample.columns:
        # Try common column names
        stars_col = [c for c in df_sample.columns if 'star' in c.lower() or 'rating' in c.lower()][0]
        df_sample['stars'] = df_sample[stars_col]
        
    print(f"\nDataset columns: {df_sample.columns.tolist()}")
    print(f"\nStar distribution:")
    print(df_sample['stars'].value_counts().sort_index())
    
except FileNotFoundError:
    print("Dataset file not found. Please download from Kaggle and place as 'yelp_reviews.csv'")
    print("Creating sample data for demonstration...")
    # Create sample data for demonstration
    sample_reviews = [
        ("This restaurant is absolutely amazing! The food was delicious and the service was outstanding. I will definitely come back.", 5),
        ("Terrible experience. Food was cold, service was slow, and the place was dirty. Never coming back.", 1),
        ("It was okay. Nothing special, but nothing terrible either. Average food and service.", 3),
        ("Great food but the wait time was too long. Staff was friendly though.", 4),
        ("The worst restaurant I've ever been to. Food poisoning and rude staff.", 1),
        ("Excellent service and good food. A bit pricey but worth it.", 4),
        ("Mediocre at best. Expected more for the price.", 2),
        ("Perfect! Everything was amazing from start to finish.", 5),
    ]
    
    # Expand to 200 samples
    expanded = []
    for i in range(25):
        for text, stars in sample_reviews:
            expanded.append((f"{text} (Sample {i+1})", stars))
    
    df_sample = pd.DataFrame(expanded, columns=['text', 'stars'])
    print(f"Created {len(df_sample)} sample reviews")

## Prompting Approach 1: Direct Classification

In [None]:
def prompt_approach_1(review_text: str) -> str:
    """Direct classification prompt - straightforward approach"""
    prompt = f"""Classify the following Yelp review into a star rating from 1 to 5.

Review: {review_text}

Return your response as a JSON object with the following structure:
{{
    "predicted_stars": <number between 1 and 5>,
    "explanation": "<brief reasoning for the assigned rating>"
}}

JSON Response:"""
    return prompt

## Prompting Approach 2: Chain-of-Thought

In [None]:
def prompt_approach_2(review_text: str) -> str:
    """Chain-of-thought prompting - reasoning through the classification"""
    prompt = f"""Analyze the following Yelp review step by step:

Review: {review_text}

Step 1: Identify positive aspects mentioned in the review
Step 2: Identify negative aspects mentioned in the review
Step 3: Assess the overall sentiment (very negative, negative, neutral, positive, very positive)
Step 4: Map the sentiment to a star rating (1=very negative, 2=negative, 3=neutral, 4=positive, 5=very positive)
Step 5: Consider the intensity and specificity of the feedback

Based on your analysis, return a JSON object:
{{
    "predicted_stars": <number between 1 and 5>,
    "explanation": "<detailed reasoning based on your step-by-step analysis>"
}}

JSON Response:"""
    return prompt

## Prompting Approach 3: Few-Shot Learning

In [None]:
def prompt_approach_3(review_text: str) -> str:
    """Few-shot learning prompt with examples"""
    prompt = f"""You are classifying Yelp reviews into star ratings (1-5). Here are some examples:

Example 1:
Review: "This restaurant is absolutely amazing! The food was delicious and the service was outstanding."
Response: {{"predicted_stars": 5, "explanation": "Highly positive language with strong praise for both food and service"}}

Example 2:
Review: "Terrible experience. Food was cold, service was slow, and the place was dirty."
Response: {{"predicted_stars": 1, "explanation": "Multiple serious complaints about food quality, service, and cleanliness"}}

Example 3:
Review: "It was okay. Nothing special, but nothing terrible either. Average food and service."
Response: {{"predicted_stars": 3, "explanation": "Neutral sentiment with average ratings for both food and service"}}

Example 4:
Review: "Great food but the wait time was too long. Staff was friendly though."
Response: {{"predicted_stars": 4, "explanation": "Positive overall with good food and friendly staff, but one negative aspect (wait time)"}}

Example 5:
Review: "The worst restaurant I've ever been to. Food poisoning and rude staff."
Response: {{"predicted_stars": 1, "explanation": "Extremely negative with serious health and service issues"}}

Now classify this review:
Review: {review_text}

Return your response as a JSON object:
{{
    "predicted_stars": <number between 1 and 5>,
    "explanation": "<brief reasoning for the assigned rating>"
}}

JSON Response:"""
    return prompt

## Helper Functions for LLM Calls and Evaluation

In [None]:
def call_llm(prompt: str, max_retries: int = 3) -> Dict:
    """Call LLM with retry logic"""
    if model is None:
        return {"error": "Model not initialized"}
    
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            text = response.text.strip()
            
            # Try to extract JSON from response
            # Remove markdown code blocks if present
            if '```json' in text:
                text = text.split('```json')[1].split('```')[0].strip()
            elif '```' in text:
                text = text.split('```')[1].split('```')[0].strip()
            
            # Parse JSON
            result = json.loads(text)
            return result
        except json.JSONDecodeError as e:
            if attempt < max_retries - 1:
                time.sleep(1)
                continue
            return {"error": f"JSON decode error: {str(e)}", "raw_response": text}
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(1)
                continue
            return {"error": str(e)}
    
    return {"error": "Max retries exceeded"}


def evaluate_approach(df: pd.DataFrame, prompt_func, approach_name: str) -> Dict:
    """Evaluate a prompting approach on the dataset"""
    results = []
    valid_json_count = 0
    
    print(f"\nEvaluating {approach_name}...")
    
    for idx, row in df.iterrows():
        review_text = str(row['text'])
        actual_stars = int(row['stars'])
        
        prompt = prompt_func(review_text)
        response = call_llm(prompt)
        
        if 'error' not in response and 'predicted_stars' in response:
            valid_json_count += 1
            predicted_stars = int(response['predicted_stars'])
            explanation = response.get('explanation', '')
            
            results.append({
                'actual': actual_stars,
                'predicted': predicted_stars,
                'correct': actual_stars == predicted_stars,
                'explanation': explanation
            })
        else:
            results.append({
                'actual': actual_stars,
                'predicted': None,
                'correct': False,
                'error': response.get('error', 'Unknown error')
            })
        
        # Progress indicator
        if (idx + 1) % 20 == 0:
            print(f"  Processed {idx + 1}/{len(df)} reviews...")
        
        # Rate limiting
        time.sleep(0.5)
    
    # Calculate metrics
    valid_results = [r for r in results if r['predicted'] is not None]
    
    if len(valid_results) == 0:
        return {
            'approach': approach_name,
            'accuracy': 0,
            'json_validity_rate': 0,
            'total_reviews': len(df),
            'valid_responses': 0
        }
    
    correct_count = sum(1 for r in valid_results if r['correct'])
    accuracy = correct_count / len(valid_results)
    json_validity_rate = valid_json_count / len(df)
    
    # Calculate consistency (variance in predictions for same actual rating)
    actual_to_predicted = {}
    for r in valid_results:
        actual = r['actual']
        if actual not in actual_to_predicted:
            actual_to_predicted[actual] = []
        actual_to_predicted[actual].append(r['predicted'])
    
    consistency_scores = []
    for actual, predictions in actual_to_predicted.items():
        if len(predictions) > 1:
            variance = np.var(predictions)
            consistency_scores.append(1 / (1 + variance))  # Lower variance = higher consistency
    
    avg_consistency = np.mean(consistency_scores) if consistency_scores else 0
    
    return {
        'approach': approach_name,
        'accuracy': accuracy,
        'json_validity_rate': json_validity_rate,
        'consistency': avg_consistency,
        'total_reviews': len(df),
        'valid_responses': len(valid_results),
        'correct_predictions': correct_count,
        'results': results
    }

In [None]:
# Evaluate all three approaches
results_approach_1 = evaluate_approach(df_sample, prompt_approach_1, "Approach 1: Direct Classification")
results_approach_2 = evaluate_approach(df_sample, prompt_approach_2, "Approach 2: Chain-of-Thought")
results_approach_3 = evaluate_approach(df_sample, prompt_approach_3, "Approach 3: Few-Shot Learning")

## Results Comparison

In [None]:
# Create comparison table
comparison_data = {
    'Approach': [
        results_approach_1['approach'],
        results_approach_2['approach'],
        results_approach_3['approach']
    ],
    'Accuracy': [
        f"{results_approach_1['accuracy']:.3f}",
        f"{results_approach_2['accuracy']:.3f}",
        f"{results_approach_3['accuracy']:.3f}"
    ],
    'JSON Validity Rate': [
        f"{results_approach_1['json_validity_rate']:.3f}",
        f"{results_approach_2['json_validity_rate']:.3f}",
        f"{results_approach_3['json_validity_rate']:.3f}"
    ],
    'Consistency': [
        f"{results_approach_1.get('consistency', 0):.3f}",
        f"{results_approach_2.get('consistency', 0):.3f}",
        f"{results_approach_3.get('consistency', 0):.3f}"
    ],
    'Valid Responses': [
        results_approach_1['valid_responses'],
        results_approach_2['valid_responses'],
        results_approach_3['valid_responses']
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("\n" + "="*80)
print("COMPARISON TABLE")
print("="*80)
print(comparison_df.to_string(index=False))
print("\n" + "="*80)

## Detailed Analysis

In [None]:
# Confusion matrices for each approach
def create_confusion_matrix(results_dict, approach_name):
    results = results_dict['results']
    valid_results = [r for r in results if r['predicted'] is not None]
    
    if len(valid_results) == 0:
        print(f"\n{approach_name}: No valid results")
        return
    
    confusion = np.zeros((5, 5), dtype=int)
    for r in valid_results:
        actual_idx = int(r['actual']) - 1
        predicted_idx = int(r['predicted']) - 1
        confusion[actual_idx, predicted_idx] += 1
    
    print(f"\n{approach_name} - Confusion Matrix:")
    print("\nActual \\ Predicted", end="")
    for i in range(1, 6):
        print(f"\t{i}", end="")
    print()
    
    for i in range(5):
        print(f"{i+1}", end="")
        for j in range(5):
            print(f"\t{confusion[i, j]}", end="")
        print()

create_confusion_matrix(results_approach_1, "Approach 1")
create_confusion_matrix(results_approach_2, "Approach 2")
create_confusion_matrix(results_approach_3, "Approach 3")

## Discussion and Trade-offs

### Approach 1: Direct Classification
- **Pros**: Simple, fast, low token usage
- **Cons**: May miss nuanced sentiment, less reasoning transparency

### Approach 2: Chain-of-Thought
- **Pros**: More transparent reasoning, better for complex reviews
- **Cons**: Higher token usage, slower processing

### Approach 3: Few-Shot Learning
- **Pros**: Learns from examples, potentially better pattern recognition
- **Cons**: Requires good example selection, higher token usage

### Key Findings:
1. **Accuracy**: [To be filled after evaluation]
2. **JSON Validity**: [To be filled after evaluation]
3. **Consistency**: [To be filled after evaluation]

### Recommendations:
Based on the evaluation results, [recommendation will be added after running the evaluation]