# 🏆 TechJam 2025: Review Quality Assessment System

## 🎯 Challenge: ML for Trustworthy Location Reviews

This notebook will guide you through building a system to detect policy violations in Google location reviews:
- 🚫 **Advertisements**: Reviews containing promotional content
- 🚫 **Irrelevant Content**: Reviews not related to the location
- 🚫 **Fake Rants**: Complaints from users who never visited

**Today's Goal (Day 1)**: Set up environment, explore data, and build basic understanding

---

## 📚 Step 1: Import Required Libraries

Let's start by importing all the libraries we'll need for data processing, ML models, and visualization.

In [None]:
# Install required packages (run this if packages are not installed)
# Uncomment the lines below if you need to install packages

# !pip install pandas numpy matplotlib seaborn
# !pip install transformers torch
# !pip install huggingface_hub
# !pip install scikit-learn
# !pip install streamlit --quiet

In [None]:
# Core data processing libraries
import pandas as pd
import numpy as np
import re
import json
from typing import List, Dict, Tuple

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# ML and NLP libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, classification_report

# Hugging Face transformers
try:
    from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
    import torch
    print("✅ Transformers library loaded successfully")
except ImportError:
    print("❌ Transformers not installed. Please run: pip install transformers torch")

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📚 All libraries imported successfully!")

## 📊 Step 2: Data Loading and Initial Exploration

First, let's load the Google Reviews dataset and understand its structure.

In [None]:
def load_sample_data():
    """
    Create sample data for testing if you don't have the dataset yet.
    Replace this with actual data loading when you get the dataset.
    """
    
    # Sample reviews with different violation types
    sample_reviews = [
        # Normal reviews
        {"review_text": "Great food and excellent service. The pasta was delicious and the staff was very friendly. Highly recommend!", "rating": 5, "business_name": "Mario's Restaurant"},
        {"review_text": "Average experience. Food was okay but service was slow. Not bad but not great either.", "rating": 3, "business_name": "Central Cafe"},
        {"review_text": "Terrible experience. Food was cold and the waiter was rude. Will not return.", "rating": 1, "business_name": "Downtown Diner"},
        
        # Advertisement examples
        {"review_text": "Amazing pizza! Visit our website www.pizzadeals.com for 50% off coupons and special offers!", "rating": 5, "business_name": "Tony's Pizza"},
        {"review_text": "Great burgers! Call us at 555-BURGER for catering services and party packages!", "rating": 5, "business_name": "Burger Palace"},
        {"review_text": "Delicious food! Check out our new location on Main Street. Grand opening specials available!", "rating": 5, "business_name": "Fresh Bites"},
        
        # Irrelevant content examples
        {"review_text": "I love my new smartphone camera! Anyway, this restaurant has okay food I guess.", "rating": 3, "business_name": "City Grill"},
        {"review_text": "Traffic was terrible today because of construction. Politics are crazy these days. Oh, the coffee was fine.", "rating": 3, "business_name": "Corner Coffee"},
        {"review_text": "My car broke down on the way here, what a terrible day. The weather is also awful. Food was decent though.", "rating": 2, "business_name": "Highway Diner"},
        
        # Fake rant examples
        {"review_text": "Never been here but I heard from my neighbor that it's absolutely terrible. Probably overpriced too.", "rating": 1, "business_name": "Elite Restaurant"},
        {"review_text": "I hate these fancy restaurants, they're all scams. Never visited but I'm sure it's pretentious.", "rating": 1, "business_name": "Fine Dining Co"},
        {"review_text": "Looks dirty from the outside, probably awful inside too. Won't waste my time going there.", "rating": 1, "business_name": "Street Food Truck"}
    ]
    
    return pd.DataFrame(sample_reviews)

# Load data
# TODO: Replace this with actual dataset loading
# df = pd.read_csv('path_to_google_reviews_dataset.csv')

# For now, use sample data
df = load_sample_data()

print(f"📊 Dataset loaded with {len(df)} reviews")
print(f"📋 Columns: {df.columns.tolist()}")
print("\n📝 First 3 reviews:")
df.head(3)

In [None]:
# Basic data exploration
def explore_data(df):
    """
    Perform basic exploration of the review dataset
    """
    print("🔍 BASIC DATA EXPLORATION")
    print("=" * 40)
    
    # Dataset info
    print(f"Dataset shape: {df.shape}")
    print(f"Missing values: {df.isnull().sum().sum()}")
    
    # Text statistics
    df['review_length'] = df['review_text'].str.len()
    df['word_count'] = df['review_text'].str.split().str.len()
    
    print(f"\n📏 Review Length Statistics:")
    print(f"  Average length: {df['review_length'].mean():.1f} characters")
    print(f"  Average words: {df['word_count'].mean():.1f} words")
    print(f"  Shortest review: {df['review_length'].min()} characters")
    print(f"  Longest review: {df['review_length'].max()} characters")
    
    # Rating distribution
    print(f"\n⭐ Rating Distribution:")
    print(df['rating'].value_counts().sort_index())
    
    return df

df = explore_data(df)

In [None]:
# Visualize data distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Review length distribution
axes[0, 0].hist(df['review_length'], bins=20, alpha=0.7, color='skyblue')
axes[0, 0].set_title('Distribution of Review Length (Characters)')
axes[0, 0].set_xlabel('Characters')
axes[0, 0].set_ylabel('Frequency')

# Word count distribution
axes[0, 1].hist(df['word_count'], bins=20, alpha=0.7, color='lightgreen')
axes[0, 1].set_title('Distribution of Word Count')
axes[0, 1].set_xlabel('Words')
axes[0, 1].set_ylabel('Frequency')

# Rating distribution
rating_counts = df['rating'].value_counts().sort_index()
axes[1, 0].bar(rating_counts.index, rating_counts.values, alpha=0.7, color='orange')
axes[1, 0].set_title('Distribution of Ratings')
axes[1, 0].set_xlabel('Rating')
axes[1, 0].set_ylabel('Count')

# Review length vs rating scatter
axes[1, 1].scatter(df['rating'], df['review_length'], alpha=0.6, color='purple')
axes[1, 1].set_title('Review Length vs Rating')
axes[1, 1].set_xlabel('Rating')
axes[1, 1].set_ylabel('Review Length (Characters)')

plt.tight_layout()
plt.show()

print("📈 Data visualization complete!")

## 🔧 Step 3: Feature Engineering

Let's extract useful features that can help identify policy violations.

In [None]:
def extract_features(df):
    """
    Extract features that might indicate policy violations
    """
    print("🔧 EXTRACTING FEATURES FOR VIOLATION DETECTION")
    print("=" * 50)
    
    # Basic text features
    df['review_length'] = df['review_text'].str.len()
    df['word_count'] = df['review_text'].str.split().str.len()
    df['exclamation_count'] = df['review_text'].str.count('!')
    df['question_count'] = df['review_text'].str.count('?')  # Fixed: removed invalid escape sequence
    
    # Capitalization features (potential indicators of spam/rants)
    df['caps_ratio'] = df['review_text'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x) if len(x) > 0 else 0)
    df['excessive_caps'] = df['caps_ratio'] > 0.3
    
    # Contact information detection (advertisements)
    df['has_url'] = df['review_text'].str.contains(r'www\.|http|\.com', case=False, na=False)
    df['has_phone'] = df['review_text'].str.contains(r'\d{3}[-.]?\d{3}[-.]?\d{4}', na=False)
    df['has_email'] = df['review_text'].str.contains(r'@[\w\.-]+\.\w+', na=False)
    
    # Promotional language (advertisements)
    promotional_keywords = ['discount', 'deal', 'promo', 'sale', 'offer', 'coupon', 'special', 'visit our', 'check out']
    df['promotional_words'] = df['review_text'].apply(
        lambda x: sum(1 for word in promotional_keywords if word.lower() in x.lower())
    )
    
    # Irrelevant content indicators
    irrelevant_keywords = ['politics', 'weather', 'my phone', 'my car', 'personal life', 'coronavirus']
    df['irrelevant_words'] = df['review_text'].apply(
        lambda x: sum(1 for word in irrelevant_keywords if word.lower() in x.lower())
    )
    
    # Fake rant indicators
    fake_rant_keywords = ['never been', 'never visited', 'heard it', 'probably', 'i bet', 'sounds like']
    df['fake_rant_words'] = df['review_text'].apply(
        lambda x: sum(1 for phrase in fake_rant_keywords if phrase.lower() in x.lower())
    )
    
    # Business context keywords (legitimate reviews should have these)
    business_keywords = ['service', 'staff', 'food', 'place', 'experience', 'visit', 'restaurant', 'store']
    df['business_words'] = df['review_text'].apply(
        lambda x: sum(1 for word in business_keywords if word.lower() in x.lower())
    )
    
    print(f"✅ Features extracted for {len(df)} reviews")
    print(f"   - Average review length: {df['review_length'].mean():.1f} characters")
    print(f"   - Average word count: {df['word_count'].mean():.1f} words")
    print(f"   - Reviews with URLs: {df['has_url'].sum()}")
    print(f"   - Reviews with promotional words: {(df['promotional_words'] > 0).sum()}")
    print(f"   - Reviews with irrelevant content: {(df['irrelevant_words'] > 0).sum()}")
    print(f"   - Reviews with fake rant indicators: {(df['fake_rant_words'] > 0).sum()}")
    
    return df

## 🏷️ Step 4: Manual Labeling (Create Ground Truth)

Let's manually label our sample data to create ground truth for evaluation.

In [None]:
def create_manual_labels(df):
    """
    Create manual labels for the sample data
    In a real scenario, you would label a larger subset manually
    """
    print("🏷️ CREATING MANUAL LABELS FOR GROUND TRUTH")
    print("=" * 45)
    
    # Manual labels based on our sample data
    # 0 = No violation, 1 = Violation
    
    # Advertisement labels (reviews 3, 4, 5 in our sample)
    ad_labels = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0]
    
    # Irrelevant content labels (reviews 6, 7, 8 in our sample)
    irrelevant_labels = [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
    
    # Fake rant labels (reviews 9, 10, 11 in our sample)
    fake_rant_labels = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
    
    df['is_advertisement'] = ad_labels
    df['is_irrelevant'] = irrelevant_labels
    df['is_fake_rant'] = fake_rant_labels
    
    # Summary
    print(f"📊 Label Summary:")
    print(f"  Advertisements: {df['is_advertisement'].sum()}/{len(df)} ({df['is_advertisement'].sum()/len(df)*100:.1f}%)")
    print(f"  Irrelevant: {df['is_irrelevant'].sum()}/{len(df)} ({df['is_irrelevant'].sum()/len(df)*100:.1f}%)")
    print(f"  Fake Rants: {df['is_fake_rant'].sum()}/{len(df)} ({df['is_fake_rant'].sum()/len(df)*100:.1f}%)")
    
    return df

df = create_manual_labels(df)
print("\n✅ Manual labels created successfully!")

## 🤖 Step 5: Simple Rule-Based Classifier (Baseline)

Let's start with a simple rule-based approach as our baseline before moving to ML models.

In [None]:
class SimpleRuleBasedClassifier:
    """
    Simple rule-based classifier for policy violation detection
    """
    
    def __init__(self):
        # Define keywords for each violation type
        self.ad_keywords = [
            'visit', 'website', 'www', 'http', 'call', 'phone', 'discount',
            'deal', 'promo', 'sale', 'coupon', 'special offer'
        ]
        
        self.irrelevant_keywords = [
            'my phone', 'my car', 'politics', 'weather', 'traffic',
            'my day', 'my life', 'news', 'government'
        ]
        
        self.fake_rant_keywords = [
            'never been', 'heard it', 'looks like', 'probably',
            'i hate these', 'all these places', 'never visited'
        ]
    
    def classify_advertisement(self, text):
        """Check if review contains advertisement"""
        text_lower = text.lower()
        return any(keyword in text_lower for keyword in self.ad_keywords)
    
    def classify_irrelevant(self, text):
        """Check if review contains irrelevant content"""
        text_lower = text.lower()
        return any(keyword in text_lower for keyword in self.irrelevant_keywords)
    
    def classify_fake_rant(self, text):
        """Check if review is a fake rant"""
        text_lower = text.lower()
        return any(keyword in text_lower for keyword in self.fake_rant_keywords)
    
    def classify_review(self, text):
        """Classify a single review for all violation types"""
        return {
            'advertisement': self.classify_advertisement(text),
            'irrelevant': self.classify_irrelevant(text),
            'fake_rant': self.classify_fake_rant(text)
        }
    
    def classify_batch(self, texts):
        """Classify multiple reviews"""
        results = []
        for text in texts:
            results.append(self.classify_review(text))
        return results

# Test the rule-based classifier
print("🤖 TESTING RULE-BASED CLASSIFIER")
print("=" * 35)

classifier = SimpleRuleBasedClassifier()

# Get predictions
reviews = df['review_text'].tolist()
predictions = classifier.classify_batch(reviews)

# Add predictions to dataframe
df['pred_advertisement'] = [pred['advertisement'] for pred in predictions]
df['pred_irrelevant'] = [pred['irrelevant'] for pred in predictions]
df['pred_fake_rant'] = [pred['fake_rant'] for pred in predictions]

print("✅ Rule-based classification complete!")
print(f"\n📊 Prediction Summary:")
print(f"  Predicted Advertisements: {df['pred_advertisement'].sum()}")
print(f"  Predicted Irrelevant: {df['pred_irrelevant'].sum()}")
print(f"  Predicted Fake Rants: {df['pred_fake_rant'].sum()}")

## 📊 Step 6: Evaluation and Metrics

Let's evaluate our rule-based classifier performance.

In [None]:
def evaluate_classifier(df, violation_type):
    """
    Evaluate classifier performance for a specific violation type
    """
    true_col = f'is_{violation_type}'
    pred_col = f'pred_{violation_type}'
    
    y_true = df[true_col].tolist()
    y_pred = df[pred_col].tolist()
    
    # Calculate metrics
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary', zero_division=0)
    accuracy = sum(t == p for t, p in zip(y_true, y_pred)) / len(y_true)
    
    return {
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'accuracy': accuracy
    }

def plot_confusion_matrix(df, violation_type):
    """
    Plot confusion matrix for a violation type
    """
    true_col = f'is_{violation_type}'
    pred_col = f'pred_{violation_type}'
    
    y_true = df[true_col].tolist()
    y_pred = df[pred_col].tolist()
    
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
               xticklabels=['No Violation', 'Violation'],
               yticklabels=['No Violation', 'Violation'])
    plt.title(f'Confusion Matrix - {violation_type.title()} Detection')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

# Evaluate all violation types
print("📊 RULE-BASED CLASSIFIER EVALUATION")
print("=" * 40)

violation_types = ['advertisement', 'irrelevant', 'fake_rant']
results = {}

for vtype in violation_types:
    metrics = evaluate_classifier(df, vtype)
    results[vtype] = metrics
    
    print(f"\n🎯 {vtype.upper()} DETECTION:")
    print(f"  Precision: {metrics['precision']:.3f}")
    print(f"  Recall:    {metrics['recall']:.3f}")
    print(f"  F1-Score:  {metrics['f1_score']:.3f}")
    print(f"  Accuracy:  {metrics['accuracy']:.3f}")
    
    # Plot confusion matrix
    plot_confusion_matrix(df, vtype)

# Overall average F1-score
avg_f1 = np.mean([results[vtype]['f1_score'] for vtype in violation_types])
print(f"\n🏆 OVERALL AVERAGE F1-SCORE: {avg_f1:.3f}")

## 🔮 Step 7: Prepare for ML Models (Next Session)

Let's set up the foundation for using Hugging Face models in our next session.

In [None]:
def setup_huggingface_models():
    """
    Test Hugging Face model loading for next session
    """
    print("🔮 PREPARING FOR ML MODELS (DAY 2)")
    print("=" * 35)
    
    try:
        # Test loading a simple classification model
        print("Testing model loading...")
        classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=-1  # Use CPU
        )
        
        # Test on a sample review
        test_review = "Great food and excellent service!"
        result = classifier(test_review)
        
        print(f"✅ Model loading successful!")
        print(f"Test result: {result}")
        
        return True
        
    except Exception as e:
        print(f"❌ Model loading failed: {e}")
        print("This is okay - we'll set this up properly in Day 2")
        return False

model_ready = setup_huggingface_models()

## 💾 Step 8: Save Progress and Plan Next Steps

Let's save our work and prepare for tomorrow.

In [None]:
# Save processed data for next session
df.to_csv('processed_reviews_day1.csv', index=False)
print("💾 Data saved to 'processed_reviews_day1.csv'")

# Save results summary
summary = {
    'day': 1,
    'date': '2025-08-25',
    'dataset_size': len(df),
    'features_extracted': len([col for col in df.columns if col.startswith(('has_', 'is_', 'pred_'))]),
    'baseline_results': results,
    'avg_f1_score': avg_f1,
    'next_steps': [
        'Implement Hugging Face models',
        'Improve prompt engineering',
        'Test on larger dataset',
        'Create ensemble approach'
    ]
}

with open('day1_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)

print("📄 Summary saved to 'day1_summary.json'")

print("\n🎉 DAY 1 COMPLETE!")
print("=" * 20)
print("✅ Environment set up")
print("✅ Data loaded and explored")
print("✅ Features extracted")
print("✅ Baseline classifier implemented")
print("✅ Evaluation metrics calculated")
print(f"✅ Baseline F1-score: {avg_f1:.3f}")

print("\n📅 TOMORROW'S PLAN (Day 2):")
print("🎯 Implement Hugging Face models")
print("🎯 Create smart prompts for LLMs")
print("🎯 Test and improve accuracy")
print("🎯 Prepare for demo creation")

print("\n🏆 Great job! You're on track to win this hackathon! 🏆")