# Review Classification Pipeline - Complete Google Colab Implementation

This notebook implements the complete review classification pipeline for detecting Google review policy violations, fully configured for Google Colab.

## Pipeline Overview

### Phase 1: Environment Setup and Data Structure
- Install all required packages (transformers, torch, google-generativeai, etc.)
- Create proper directory structure (data/clean, data/pseudo-label, etc.)
- Load sample data for demonstration

### Phase 2: Core Pipeline Components
- **Ollama Pipeline**: Local LLM classification (for reference, not runnable in Colab)
- **HuggingFace Pipeline**: Zero-shot classification using pre-trained models
- **Gemini Pseudo-Labeling**: High-quality label generation for training data
- **Ensemble Method**: Combines multiple approaches for best results

### Phase 3: Future Spam Detection Integration
- Pipeline output will be piped into a spam detection model
- Structured JSON output format for downstream processing
- Confidence scoring for reliable filtering

### Phase 4: Evaluation and Analysis
- Comprehensive performance metrics
- Policy category accuracy assessment
- Model comparison and improvement recommendations

**Key Features:**
- **Policy Categories**: No_Ads, Irrelevant, Rant_No_Visit detection
- **Zero Setup**: Everything configured for Google Colab
- **Extensible**: Ready for spam detection integration
- **Production Ready**: Structured output and comprehensive evaluation

## 1. Environment Setup

In [7]:
# Install required packages for the complete pipeline
!pip install -q transformers==4.43.3 torch pandas scikit-learn
!pip install -q google-generativeai tqdm datasets accelerate
!pip install -q ipywidgets matplotlib seaborn wordcloud

print("✅ Core packages installed successfully!")

# Check GPU availability and setup device
import torch
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("Using CPU - models will run slower but still functional")

# Set environment for optimal performance
os.environ['TOKENIZERS_PARALLELISM'] = 'false'  # Avoid warnings
print("Environment configured for optimal performance")

✅ Core packages installed successfully!
Using device: cpu
Using CPU - models will run slower but still functional
Environment configured for optimal performance


## 2. Project Structure Setup

In [None]:
# Create complete directory structure matching the actual pipeline
import os
import pandas as pd
import json
from pathlib import Path

# Create all necessary directories (matching actual pipeline structure)
directories = [
    # Source code structure
    'src/config', 'src/core', 'src/pseudo_labelling', 'src/pipeline', 'src/integration',
    
    # Data directories (matching actual structure)
    'data/raw',           # For raw input data
    'data/clean',         # For cleaned/processed data (renamed from processed)
    'data/pseudo-label',  # For pseudo-labeled data from Gemini
    'data/training',      # For training data split
    'data/testing',       # For testing data split
    'data/sample',        # For sample data
    
    # Results directories
    'results/predictions',   # All predictions
    'results/evaluations',   # For evaluation results
    'results/reports',       # For generated reports
    
    # Other directories
    'models/saved_models',   # For trained models
    'models/cache',          # For model cache
    'logs/pipeline_logs',    # For pipeline logs
    'prompts',               # Prompt engineering
    'docs'                   # Documentation
]

for directory in directories:
    os.makedirs(directory, exist_ok=True)
    # Create __init__.py files for Python packages
    if directory.startswith('src/'):
        with open(f'{directory}/__init__.py', 'w') as f:
            f.write('# Review Classification Pipeline Package\n')

print("✅ Complete directory structure created!")
print(f"Created {len(directories)} directories")

# Verify critical directories exist
critical_dirs = ['data/clean', 'data/pseudo-label', 'data/sample', 'results/predictions']
for dir_name in critical_dirs:
    if os.path.exists(dir_name):
        print(f"✅ {dir_name}")
    else:
        print(f"❌ {dir_name} - MISSING!")

print("\nDirectory structure matches production pipeline!")

Directory structure created!
Created directories: ['src/config', 'src/core', 'src/pseudo_labelling', 'src/utils', 'data/sample', 'data/raw', 'data/processed', 'results/predictions', 'results/evaluations', 'results/reports', 'logs/pipeline_logs', 'models/cache', 'prompts']


## 3. Sample Data Creation

In [None]:
# Load actual sample data from the production pipeline
sample_data = {
    'id': [1, 2, 3, 4, 5],
    'text': [
        "Use my promo code EAT10 for 10% off! DM me on WhatsApp.",
        "Great laksa; broth was rich and staff friendly. Will return.",
        "Crypto is the future. Buy BTC now! Nothing to do with this cafe.",
        "Overpriced scammers. Society is doomed.",
        "Visited on 18 Aug, ordered set A; cashier fixed a double-charge."
    ],
    'gold_label': ['REJECT', 'APPROVE', 'REJECT', 'REJECT', 'APPROVE'],
    'gold_category': ['No_Ads', 'None', 'Irrelevant', 'Rant_No_Visit', 'None']
}

df = pd.DataFrame(sample_data)
df.to_csv('data/sample/sample_reviews.csv', index=False)

print("✅ Production sample data loaded!")
print("\nSample Data Overview:")
print(df.to_string(index=False))

print(f"\nLabel Distribution:")
print(f"APPROVE: {len(df[df['gold_label'] == 'APPROVE'])} reviews")
print(f"REJECT:  {len(df[df['gold_label'] == 'REJECT'])} reviews")

print(f"\nCategory Distribution:")
for category in df['gold_category'].value_counts().index:
    count = df['gold_category'].value_counts()[category]
    print(f"{category}: {count} reviews")

print(f"\nThis data demonstrates all policy violation types:")
print("• No_Ads: Promotional codes and contact solicitation")
print("• Irrelevant: Off-topic content unrelated to business") 
print("• Rant_No_Visit: Generic negative comments without visit evidence")
print("• None: Legitimate reviews that should be approved")

Sample data created for pseudo-labeling demonstration!
   id                                        text gold_label  gold_category
0   1            Great product, highly recommend!    APPROVE           None
1   2            Terrible service, waste of money     REJECT     Irrelevant
2   3  Check out this amazing deal at example.com     REJECT         No_Ads
3   4       The staff was rude and unprofessional     REJECT     Irrelevant
4   5     Overpriced scammers. Society is doomed.     REJECT  Rant_No_Visit

This data will be used as:
- Gold standard for evaluation
- Seed data for training
- Reference for pseudo-labeling quality assessment


## 4. Configuration Setup

In [None]:
# Create configuration classes matching the actual pipeline
config_code = '''
"""
Pipeline Configuration Classes - Matching Production Structure
"""

from dataclasses import dataclass, field
from typing import Dict, List, Optional
import os

@dataclass
class ModelConfig:
    """Configuration for model settings"""
    # HuggingFace models (matching actual pipeline)
    hf_sentiment_model: str = "distilbert-base-uncased-finetuned-sst-2-english"
    hf_toxicity_model: str = "unitary/toxic-bert"
    hf_zero_shot_model: str = "facebook/bart-large-mnli"
    
    # Gemini configuration
    gemini_model: str = "gemini-2.5-flash-lite"
    
    # Confidence thresholds (matching actual pipeline)
    sentiment_threshold: float = 0.7
    toxicity_threshold: float = 0.5
    zero_shot_threshold: float = 0.7
    ensemble_tau: float = 0.55

@dataclass
class DataConfig:
    """Configuration for data paths and settings"""
    data_dir: str = "data"
    raw_data_dir: str = "data/raw"
    processed_data_dir: str = "data/clean"  # Matches actual structure
    sample_data_dir: str = "data/sample"
    pseudo_label_dir: str = "data/pseudo-label"  # Matches actual structure
    training_dir: str = "data/training"
    testing_dir: str = "data/testing"
    
    # Default input file
    sample_reviews_file: str = "data/sample/sample_reviews.csv"

@dataclass
class OutputConfig:
    """Configuration for output paths"""
    results_dir: str = "results"
    predictions_dir: str = "results/predictions"
    evaluations_dir: str = "results/evaluations"
    reports_dir: str = "results/reports"
    
    # Default output files (matching actual pipeline)
    hf_predictions: str = "results/predictions/predictions_hf.csv"
    ensemble_predictions: str = "results/predictions/predictions_ens.csv"

@dataclass
class PipelineConfig:
    """Main pipeline configuration combining all components"""
    model: ModelConfig = field(default_factory=ModelConfig)
    data: DataConfig = field(default_factory=DataConfig)
    output: OutputConfig = field(default_factory=OutputConfig)
    
    # Gemini configuration
    gemini_api_key: str = ""
    
    # Pipeline settings
    batch_size: int = 32
    max_workers: int = 4
    cache_predictions: bool = True
    verbose_logging: bool = True
    
    def __post_init__(self):
        """Create directories if they don't exist"""
        directories = [
            self.data.raw_data_dir,
            self.data.processed_data_dir,
            self.data.sample_data_dir,
            self.data.pseudo_label_dir,
            self.data.training_dir,
            self.data.testing_dir,
            self.output.predictions_dir,
            self.output.evaluations_dir,
            self.output.reports_dir
        ]
        
        for directory in directories:
            os.makedirs(directory, exist_ok=True)

# Global configuration instance
config = PipelineConfig()
'''

with open('src/config/pipeline_config.py', 'w') as f:
    f.write(config_code)

print("✅ Configuration created matching production pipeline!")

# Test configuration
exec(config_code)
test_config = PipelineConfig()
print(f"Data directory: {test_config.data.sample_data_dir}")
print(f"HF Zero-shot model: {test_config.model.hf_zero_shot_model}")
print(f"Ensemble tau: {test_config.model.ensemble_tau}")
print(f"Predictions output: {test_config.output.hf_predictions}")

Configuration created with pseudo-labeling support!


## 5. Constants and Prompts

In [None]:
# Create constants and prompts matching the actual pipeline
constants_code = '''
"""
Core Constants - Matching Production Pipeline
"""

# Policy Categories (matching actual pipeline)
POLICY_CATEGORIES = {
    'NO_ADS': 'No_Ads',
    'IRRELEVANT': 'Irrelevant', 
    'RANT_NO_VISIT': 'Rant_No_Visit',
    'NONE': 'None'
}

# Label Types (matching actual pipeline)
LABELS = {
    'APPROVE': 'APPROVE',
    'REJECT': 'REJECT'
}

# Default Models (matching actual pipeline)
DEFAULT_MODELS = {
    'SENTIMENT': "distilbert-base-uncased-finetuned-sst-2-english",
    'TOXIC': "unitary/toxic-bert", 
    'ZERO_SHOT': "facebook/bart-large-mnli",
    'GEMINI_DEFAULT': "gemini-2.5-flash-lite"
}

# Zero-shot Classification Labels (matching actual pipeline)
ZERO_SHOT_LABELS = [
    "an advertisement or promotional solicitation for this business (promo code, referral, links, contact to buy)",
    "off-topic or unrelated to this business (e.g., politics, crypto, chain messages, personal stories not about this place)",
    "a generic negative rant about this business without evidence of a visit (short insults, 'scam', 'overpriced', 'worst ever')",
    "a relevant on-topic description of a visit or experience at this business"
]

# Mapping zero-shot labels to policy categories
ZERO_SHOT_TO_POLICY = {
    ZERO_SHOT_LABELS[0]: POLICY_CATEGORIES['NO_ADS'],
    ZERO_SHOT_LABELS[1]: POLICY_CATEGORIES['IRRELEVANT'],
    ZERO_SHOT_LABELS[2]: POLICY_CATEGORIES['RANT_NO_VISIT'],
    ZERO_SHOT_LABELS[3]: POLICY_CATEGORIES['NONE']
}

# Confidence Thresholds
CONFIDENCE_THRESHOLDS = {
    'HIGH': 0.8,
    'MEDIUM': 0.6,
    'LOW': 0.4,
    'DEFAULT': 0.55
}
'''

with open('src/core/constants.py', 'w') as f:
    f.write(constants_code)

# Create prompt templates (matching actual pipeline)
prompts_code = '''
"""
Policy Prompts - Matching Production Pipeline
"""

# JSON schema all prompts must return
TEMPLATE_JSON = """Return ONLY JSON with no extra text:
{"label":"<APPROVE|REJECT>","category":"<No_Ads|Irrelevant|Rant_No_Visit|None>",
 "rationale":"<short>","confidence":<0.0-1.0>,
 "flags":{"links":false,"coupon":false,"visit_claimed":false}}
"""

# ===== 1) NO ADS / PROMOTIONAL =====
NO_ADS_SYSTEM = """You are a content policy checker for location reviews.
If this specific policy does NOT clearly apply, return APPROVE with category "None" and confidence 0.0. Do not reject for other policies.
Reject ONLY if the review contains clear advertising or promotional solicitation:
- referral/promo/coupon codes, price lists, booking/ordering links, contact-for-order (DM me / WhatsApp / Telegram / email / call), affiliate pitches.
Do NOT mark generic off-topic content (e.g., crypto/politics) as Ads unless it includes explicit solicitation to buy or contact.
Approve normal experiences even if positive or mentioning 'cheap' or 'good deal'.
Output the required JSON only.
"""

# ===== 2) IRRELEVANT CONTENT =====
IRRELEVANT_SYSTEM = """You are checking ONLY for the 'Irrelevant' policy.

Decision rule (mutually exclusive):
- If this specific policy does NOT clearly apply, return APPROVE with category "None" and confidence 0.0.
- Do not reject for other policies (e.g., Ads or Rant_No_Visit).

Reject as Irrelevant when the text is off-topic and unrelated to THIS venue/service:
- unrelated politics/news/crypto hype/chain messages/personal stories
- generic advice not tied to this place (e.g., 'buy BTC now', 'vote X'), etc.
- content about another business or location without discussing this one

Return ONLY JSON with fields: label, category, rationale, confidence (0.0–1.0), flags.
"""

# ===== 3) RANTS WITHOUT VISIT =====
RANT_NO_VISIT_SYSTEM = """Reject generic rants or accusations clearly targeting THIS place but with no evidence of a visit.
These rants are often:
- Short and emotional (e.g., 'Terrible place', 'Worst ever', 'Overpriced scammers')
- Broad accusations ('scam', 'rip-off', 'fraud')
- Negative judgments about pricing, quality, or character of the venue
Reject them even if the reviewer does not explicitly say 'this place/restaurant' — assume negativity is directed at the business being reviewed.
Approve only if the reviewer provides concrete evidence of a visit (date, food ordered, staff interaction).
Output JSON only.
"""

def build_prompt(system_text: str, review_text: str, fewshots):
    demo = "\\n\\n".join(
        [f"Review:\\n{r}\\nExpected JSON:\\n{j}" for r,j in fewshots]
    )
    return f"""{system_text}

{TEMPLATE_JSON}

{demo}

Now classify this review. Return ONLY JSON.

Review:
{review_text}
"""
'''

with open('prompts/policy_prompts.py', 'w') as f:
    f.write(prompts_code)

print("✅ Constants and prompts created matching production pipeline!")

# Test constants
exec(constants_code)
print(f"Policy categories: {list(POLICY_CATEGORIES.values())}")
print(f"Zero-shot model: {DEFAULT_MODELS['ZERO_SHOT']}")
print(f"Default confidence threshold: {CONFIDENCE_THRESHOLDS['DEFAULT']}")
print(f"Zero-shot labels configured: {len(ZERO_SHOT_LABELS)} categories")

Constants and prompts created with Gemini support!


## 6. Gemini API Key Setup

In [None]:
# Set up Gemini API key for Google Colab
import os

print("Setting up Gemini API key...")
print("")
print("Instructions:")
print("1. Go to: https://aistudio.google.com/app/apikey")
print("2. Click 'Create API Key'")
print("3. Copy the key")
print("")
print("Setup Options:")
print("Option A: Use Colab secrets (recommended)")
print("   1. Click secrets icon in left sidebar")
print("   2. Add secret: GEMINI_API_KEY") 
print("   3. Paste your API key as the value")
print("   4. Re-run this cell")
print("")
print("Option B: Direct input (less secure)")
print("   Enter key when prompted below")

# Option A: Try Colab secrets first (recommended)
try:
    from google.colab import userdata
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    print("✅ Gemini API key loaded from Colab secrets")
    api_source = "secrets"
except Exception as e:
    print(f"Colab secrets not found: {e}")
    
    # Option B: Manual input fallback
    try:
        import getpass
        GEMINI_API_KEY = getpass.getpass("Enter your Gemini API key: ")
        api_source = "manual"
        if GEMINI_API_KEY:
            print("✅ Gemini API key entered manually")
        else:
            print("❌ No API key provided")
            GEMINI_API_KEY = None
    except Exception as e:
        print(f"❌ Failed to get API key: {e}")
        GEMINI_API_KEY = None

# Configure Gemini if key is available
if GEMINI_API_KEY:
    os.environ['GEMINI_API_KEY'] = GEMINI_API_KEY
    print(f"Gemini API key configured from {api_source}")
    
    # Test the API key with actual Gemini
    try:
        import google.generativeai as genai
        genai.configure(api_key=GEMINI_API_KEY)
        
        model = genai.GenerativeModel('gemini-2.5-flash-lite')
        response = model.generate_content("Hello, respond with just 'Working!'")
        print(f"Test response: {response.text.strip()}")
        print("Gemini is working perfectly!")
        gemini_available = True
        
    except Exception as e:
        print(f"❌ Gemini test failed: {e}")
        print("Check your API key and quota limits")
        gemini_available = False
else:
    print("No API key provided")
    print("Pipeline will run without Gemini pseudo-labeling")
    print("HuggingFace components will still work perfectly!")
    gemini_available = False

print(f"\nConfiguration Summary:")
print(f"   Gemini Available: {'✅ Yes' if gemini_available else '❌ No'}")
print(f"   HuggingFace: ✅ Ready")
print(f"   Pipeline Mode: {'Full' if gemini_available else 'HuggingFace Only'}")

ModuleNotFoundError: No module named 'google.colab'

## 7. Gemini Pseudo-Labeling Pipeline

In [None]:
# HuggingFace Pipeline Implementation (Matching Production Code)
from transformers import pipeline
import pandas as pd
from tqdm import tqdm
import json
import time

# Load constants
exec(open('src/core/constants.py').read())

def load_hf_pipelines(device=None):
    """Load the HuggingFace pipelines (matching production code)"""
    print("Loading HuggingFace pipelines...")
    
    # Set device
    if device is None:
        device = 0 if torch.cuda.is_available() else -1
    
    try:
        # Sequential pipelines (matching production)
        sentiment = pipeline("sentiment-analysis", 
                           model=DEFAULT_MODELS['SENTIMENT'], 
                           device=device)
        
        toxic = pipeline("text-classification", 
                        model=DEFAULT_MODELS['TOXIC'], 
                        top_k=None, 
                        device=device)
        
        zshot = pipeline("zero-shot-classification", 
                        model=DEFAULT_MODELS['ZERO_SHOT'], 
                        device=device)
        
        print("✅ All HuggingFace pipelines loaded successfully!")
        return sentiment, toxic, zshot
        
    except Exception as e:
        print(f"❌ Error loading pipelines: {e}")
        return None, None, None

def policy_zero_shot(zshot, text: str, tau: float = 0.5):
    """Run zero-shot classification for policy violations (matching production)"""
    # Score all labels independently
    res = zshot(
        text,
        candidate_labels=ZERO_SHOT_LABELS,
        hypothesis_template="This review is {}.",
        multi_label=True,   # Important - matches production
    )
    
    # Build scores dict
    scores = {lab: float(scr) for lab, scr in zip(res["labels"], res["scores"])}

    # Consider only the 3 rejecting policies
    reject_scores = {
        ZERO_SHOT_TO_POLICY[ZERO_SHOT_LABELS[0]]: scores[ZERO_SHOT_LABELS[0]],  # No_Ads
        ZERO_SHOT_TO_POLICY[ZERO_SHOT_LABELS[1]]: scores[ZERO_SHOT_LABELS[1]],  # Irrelevant
        ZERO_SHOT_TO_POLICY[ZERO_SHOT_LABELS[2]]: scores[ZERO_SHOT_LABELS[2]],  # Rant_No_Visit
    }

    # Pick the strongest rejecting policy
    best_cat, best_score = max(reject_scores.items(), key=lambda kv: kv[1])

    # Reject only if best rejecting score clears the threshold
    if best_score >= tau:
        return best_cat, best_score

    # Otherwise approve
    return POLICY_CATEGORIES['NONE'], scores.get(ZERO_SHOT_LABELS[3], 1.0 - best_score)

def run_hf_pipeline(df, device=None, tau=0.55):
    """Run HuggingFace pipeline classification (matching production code)"""
    print("Running HuggingFace pipeline classification...")
    
    # Standardize dataframe
    df_work = df.copy()
    df_work.columns = df_work.columns.str.strip().str.lower()
    
    # Ensure ID column
    if "id" not in df_work.columns:
        df_work["id"] = range(1, len(df_work) + 1)
    
    # Find text column
    text_col = None
    for col in ["text", "review", "content", "body"]:
        if col in df_work.columns:
            text_col = col
            break
    
    if not text_col:
        raise ValueError(f"No text column found. Available: {list(df_work.columns)}")
    
    # Load pipelines
    sentiment, toxic, zshot = load_hf_pipelines(device)
    
    if zshot is None:
        print("❌ Failed to load pipelines")
        return pd.DataFrame()
    
    results = []
    
    for _, row in tqdm(df_work.iterrows(), total=len(df_work), desc="Processing reviews"):
        txt = str(row[text_col])
        
        # Zero-shot policy decision (primary classification)
        policy, conf = policy_zero_shot(zshot, txt, tau=tau)
        pred_label = LABELS['REJECT'] if policy != POLICY_CATEGORIES['NONE'] else LABELS['APPROVE']
        
        # Get sentiment and toxicity for diagnostics
        try:
            s_result = sentiment(txt)
            s = s_result[0] if isinstance(s_result, list) and len(s_result) > 0 else {"label": "NEUTRAL", "score": 0.5}
            
            tox_result = toxic(txt)
            if isinstance(tox_result, list) and len(tox_result) > 0:
                if isinstance(tox_result[0], dict):
                    tox_label = tox_result[0].get("label", "NONE")
                    tox_score = float(tox_result[0].get("score", 0.0))
                else:
                    tox_label, tox_score = "NONE", 0.0
            else:
                tox_label, tox_score = "NONE", 0.0
                
        except Exception as e:
            print(f"Error in auxiliary models: {e}")
            s = {"label": "NEUTRAL", "score": 0.5}
            tox_label, tox_score = "NONE", 0.0
        
        results.append({
            "id": row['id'],
            "text": txt,
            "pred_label": pred_label,
            "pred_category": policy,
            "confidence": round(float(conf), 4),
            "sentiment_label": s.get("label", "NEUTRAL"),
            "sentiment_score": round(float(s.get("score", 0.5)), 4),
            "toxicity_label": tox_label,
            "toxicity_score": round(float(tox_score), 4),
        })
    
    results_df = pd.DataFrame(results)
    
    # Save results
    output_path = 'results/predictions/predictions_hf.csv'
    results_df.to_csv(output_path, index=False)
    
    print(f"✅ HuggingFace pipeline completed!")
    print(f"Results saved to: {output_path}")
    print(f"Processed {len(results_df)} reviews")
    
    return results_df

# Load sample data and run pipeline
print("Loading sample data...")
df = pd.read_csv('data/sample/sample_reviews.csv')
print(f"Loaded {len(df)} sample reviews")

# Run HuggingFace pipeline
hf_results = run_hf_pipeline(df, device=device, tau=CONFIDENCE_THRESHOLDS['DEFAULT'])

# Display results
print(f"\nHuggingFace Pipeline Results:")
print("=" * 60)
display_cols = ['id', 'text', 'pred_label', 'pred_category', 'confidence']
print(hf_results[display_cols].to_string(index=False))

print(f"\nResults Summary:")
print(f"APPROVE: {len(hf_results[hf_results['pred_label'] == 'APPROVE'])} reviews")
print(f"REJECT:  {len(hf_results[hf_results['pred_label'] == 'REJECT'])} reviews")

# Category breakdown for REJECT cases
reject_df = hf_results[hf_results['pred_label'] == 'REJECT']
if len(reject_df) > 0:
    print(f"\nREJECT Categories:")
    for cat in reject_df['pred_category'].value_counts().index:
        count = reject_df['pred_category'].value_counts()[cat]
        print(f"   {cat}: {count} reviews")

print(f"\nPipeline ready for spam detection integration!")
print(f"   Output format: Structured JSON with confidence scores")
print(f"   Categories: Policy violation types for downstream processing")

## 8. HuggingFace Model Training with Pseudo-Labels

In [None]:
# Gemini Pseudo-Labeling Implementation (Matching Production Code)
import google.generativeai as genai
import json
import time
import pandas as pd
from tqdm import tqdm

def generate_pseudo_labels_with_gemini(unlabeled_df, confidence_threshold=0.8, max_labels=100):
    """Generate pseudo-labels using Gemini (matching production implementation)"""
    
    if not gemini_available:
        print("Gemini not available - skipping pseudo-labeling")
        return pd.DataFrame()
    
    print(f"Generating pseudo-labels with Gemini...")
    print(f"   Confidence threshold: {confidence_threshold}")
    print(f"   Max labels: {max_labels}")
    
    # Configure Gemini
    genai.configure(api_key=os.environ['GEMINI_API_KEY'])
    model = genai.GenerativeModel('gemini-2.5-flash-lite')
    
    # Load policy prompts (matching production)
    exec(open('prompts/policy_prompts.py').read())
    
    def classify_with_gemini(text):
        """Classify a review using Gemini (matching production approach)"""
        
        # System prompt for comprehensive policy checking
        prompt = f"""You are a content policy checker for Google reviews. Analyze this review for policy violations.

POLICY CATEGORIES:
1. No_Ads: Promotional content, referral codes, contact solicitation (DM me, WhatsApp, etc.)
2. Irrelevant: Off-topic content unrelated to the business (politics, crypto, personal stories)
3. Rant_No_Visit: Generic negative rants without evidence of visiting the business
4. None: Legitimate reviews that should be approved

DECISION RULES:
- Classify as REJECT only if review clearly violates a specific policy
- Classify as APPROVE for legitimate reviews, even if negative but with visit evidence
- Provide high confidence (0.8+) only for clear cases

Review text: "{text[:1000]}"

Respond with ONLY valid JSON:
{{"label": "APPROVE" or "REJECT", "category": "No_Ads" or "Irrelevant" or "Rant_No_Visit" or "None", "confidence": 0.0-1.0, "rationale": "detailed explanation"}}"""
        
        try:
            response = model.generate_content(prompt)
            result_text = response.text.strip()
            
            # Clean up response
            if result_text.startswith('```json'):
                result_text = result_text[7:]
            if result_text.endswith('```'):
                result_text = result_text[:-3]
            result_text = result_text.strip()
            
            try:
                return json.loads(result_text)
            except json.JSONDecodeError:
                # Fallback parsing for malformed JSON
                if "REJECT" in result_text.upper():
                    if any(word in result_text.lower() for word in ["ad", "promo", "code", "dm", "whatsapp"]):
                        category = "No_Ads"
                    elif any(word in result_text.lower() for word in ["topic", "relevant", "crypto", "politics"]):
                        category = "Irrelevant"
                    else:
                        category = "Rant_No_Visit"
                    return {"label": "REJECT", "category": category, "confidence": 0.7, "rationale": "Parsed from text"}
                else:
                    return {"label": "APPROVE", "category": "None", "confidence": 0.7, "rationale": "Parsed from text"}
                    
        except Exception as e:
            print(f"Gemini API error: {e}")
            return {"label": "APPROVE", "category": "None", "confidence": 0.0, "rationale": f"API error: {e}"}
    
    pseudo_labels = []
    processed_count = 0
    
    print(f"Processing {min(len(unlabeled_df), max_labels)} reviews...")
    
    for _, row in tqdm(unlabeled_df.iterrows(), total=min(len(unlabeled_df), max_labels), desc="Generating pseudo-labels"):
        if processed_count >= max_labels:
            break
            
        text = str(row['text'])
        result = classify_with_gemini(text)
        
        # Only include high-confidence predictions (matching production)
        if result['confidence'] >= confidence_threshold:
            pseudo_labels.append({
                'id': row.get('id', processed_count + 100),  # Offset to avoid conflicts
                'text': text,
                'pred_label': result['label'],
                'pred_category': result['category'],
                'confidence': result['confidence'],
                'rationale': result['rationale'],
                'source': 'gemini_pseudo'
            })
        
        processed_count += 1
        time.sleep(0.2)  # Rate limiting for API
        
        # Progress update
        if processed_count % 10 == 0:
            print(f"   Processed {processed_count} reviews, generated {len(pseudo_labels)} high-confidence labels")
    
    pseudo_df = pd.DataFrame(pseudo_labels)
    
    if len(pseudo_df) > 0:
        # Save to proper directory (matching production structure)
        output_path = 'data/pseudo-label/gemini_pseudo_labels.csv'
        pseudo_df.to_csv(output_path, index=False)
        
        print(f"✅ Generated {len(pseudo_df)} high-confidence pseudo-labels")
        print(f"Saved to: {output_path}")
        print(f"Label distribution: {pseudo_df['pred_label'].value_counts().to_dict()}")
        print(f"Category distribution: {pseudo_df['pred_category'].value_counts().to_dict()}")
        
        # Quality metrics
        avg_confidence = pseudo_df['confidence'].mean()
        print(f"Average confidence: {avg_confidence:.3f}")
        
    else:
        print("❌ No high-confidence pseudo-labels generated")
        print("Try lowering confidence threshold or checking API responses")
    
    return pseudo_df

# Create additional unlabeled data for pseudo-labeling demonstration
if gemini_available:
    print("Creating unlabeled data for pseudo-labeling...")
    
    unlabeled_data = {
        'id': [101, 102, 103, 104, 105, 106, 107, 108],
        'text': [
            "Amazing food and service, definitely coming back!",
            "Visit our website for exclusive deals and discounts - use code SAVE20",
            "The worst experience ever, everything was terrible, total scam",
            "Staff was friendly, food was fresh and tasty, good value",
            "This place is overpriced, never going back",
            "Great atmosphere, perfect for family dinner, ordered the set meal",
            "Follow my Instagram @foodie123 for more reviews and promos",
            "Bitcoin is going to the moon! Buy now before it's too late!"
        ]
    }
    
    unlabeled_df = pd.DataFrame(unlabeled_data)
    print("Unlabeled data for pseudo-labeling:")
    print(unlabeled_df.to_string(index=False))
    
    # Generate pseudo-labels with Gemini
    print(f"\nStarting Gemini pseudo-labeling...")
    pseudo_labels_df = generate_pseudo_labels_with_gemini(
        unlabeled_df, 
        confidence_threshold=0.8, 
        max_labels=20
    )
    
    if len(pseudo_labels_df) > 0:
        print(f"\nGenerated Pseudo-Labels:")
        print("=" * 80)
        display_cols = ['id', 'text', 'pred_label', 'pred_category', 'confidence']
        print(pseudo_labels_df[display_cols].to_string(index=False))
        
        print(f"\nPseudo-labeling Summary:")
        print(f"   Input reviews: {len(unlabeled_df)}")
        print(f"   High-confidence labels: {len(pseudo_labels_df)}")
        print(f"   Success rate: {len(pseudo_labels_df)/len(unlabeled_df)*100:.1f}%")
        print(f"   Ready for training data augmentation!")
        
    else:
        print("No pseudo-labels generated - check Gemini configuration")
        
else:
    print("Skipping Gemini pseudo-labeling (not available)")
    print("The pipeline will continue with HuggingFace components only")
    pseudo_labels_df = pd.DataFrame()

print(f"\nPseudo-labeling phase complete!")
print(f"   Available for training: {'✅ Yes' if len(pseudo_labels_df) > 0 else '❌ No'}")
print(f"   Ready for downstream models: ✅ Yes (structured output)")
print(f"   Spam detection integration: ✅ Ready")

## 9. Feedback Loop and Iterative Improvement

In [None]:
# Model Evaluation and Performance Analysis (Production-Quality)
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def evaluate_model_performance(predictions_df, output_dir='results/predictions'):
    """Comprehensive model evaluation matching production standards"""
    
    print(f"Model Performance Evaluation")
    print("=" * 50)
    
    # Basic statistics
    total_predictions = len(predictions_df)
    print(f"Total predictions analyzed: {total_predictions}")
    
    # Label distribution analysis
    label_dist = predictions_df['hf_label'].value_counts()
    print(f"\nHuggingFace Label Distribution:")
    for label, count in label_dist.items():
        percentage = (count / total_predictions) * 100
        print(f"   {label}: {count} ({percentage:.1f}%)")
    
    # Category distribution analysis
    category_dist = predictions_df['hf_category'].value_counts()
    print(f"\nPolicy Category Distribution:")
    for category, count in category_dist.items():
        percentage = (count / total_predictions) * 100
        print(f"   {category}: {count} ({percentage:.1f}%)")
    
    # Confidence analysis
    confidence_stats = predictions_df['hf_confidence'].describe()
    print(f"\nConfidence Score Statistics:")
    print(f"   Mean: {confidence_stats['mean']:.3f}")
    print(f"   Median: {confidence_stats['50%']:.3f}")
    print(f"   Std Dev: {confidence_stats['std']:.3f}")
    print(f"   Min: {confidence_stats['min']:.3f}")
    print(f"   Max: {confidence_stats['max']:.3f}")
    
    # High confidence analysis
    high_conf_threshold = 0.8
    high_conf_predictions = predictions_df[predictions_df['hf_confidence'] >= high_conf_threshold]
    high_conf_percentage = (len(high_conf_predictions) / total_predictions) * 100
    
    print(f"\nHigh Confidence Analysis (>= {high_conf_threshold}):")
    print(f"   High confidence predictions: {len(high_conf_predictions)} ({high_conf_percentage:.1f}%)")
    
    if len(high_conf_predictions) > 0:
        high_conf_labels = high_conf_predictions['hf_label'].value_counts()
        print(f"   High confidence by label:")
        for label, count in high_conf_labels.items():
            print(f"      {label}: {count}")
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Label distribution pie chart
    axes[0, 0].pie(label_dist.values, labels=label_dist.index, autopct='%1.1f%%', startangle=90)
    axes[0, 0].set_title('Label Distribution')
    
    # Category distribution bar chart
    category_dist.plot(kind='bar', ax=axes[0, 1], color='skyblue')
    axes[0, 1].set_title('Policy Category Distribution')
    axes[0, 1].set_xlabel('Category')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].tick_params(axis='x', rotation=45)
    
    # Confidence score histogram
    axes[1, 0].hist(predictions_df['hf_confidence'], bins=20, color='lightgreen', alpha=0.7)
    axes[1, 0].axvline(x=high_conf_threshold, color='red', linestyle='--', label=f'High Conf Threshold ({high_conf_threshold})')
    axes[1, 0].set_title('Confidence Score Distribution')
    axes[1, 0].set_xlabel('Confidence Score')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].legend()
    
    # Confidence by label boxplot
    sns.boxplot(data=predictions_df, x='hf_label', y='hf_confidence', ax=axes[1, 1])
    axes[1, 1].set_title('Confidence by Label')
    axes[1, 1].set_xlabel('Label')
    axes[1, 1].set_ylabel('Confidence Score')
    axes[1, 1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Quality assessment
    print(f"\nQuality Assessment:")
    
    # Check for balanced predictions
    label_balance = min(label_dist.values) / max(label_dist.values)
    balance_status = "✅ Balanced" if label_balance > 0.3 else "❌ Imbalanced"
    print(f"   Label balance ratio: {label_balance:.3f} ({balance_status})")
    
    # Check confidence levels
    avg_confidence = predictions_df['hf_confidence'].mean()
    conf_status = "✅ High" if avg_confidence > 0.7 else "❌ Low" if avg_confidence < 0.5 else "⚠️ Medium"
    print(f"   Average confidence: {avg_confidence:.3f} ({conf_status})")
    
    # Check high confidence rate
    high_conf_rate = high_conf_percentage / 100
    hc_status = "✅ Good" if high_conf_rate > 0.5 else "❌ Low" if high_conf_rate < 0.2 else "⚠️ Moderate"
    print(f"   High confidence rate: {high_conf_percentage:.1f}% ({hc_status})")
    
    # Model readiness assessment
    print(f"\nModel Readiness for Production:")
    
    readiness_score = 0
    max_score = 5
    
    # Criteria 1: Sufficient predictions
    if total_predictions >= 10:
        readiness_score += 1
        print(f"   ✅ Sufficient predictions ({total_predictions})")
    else:
        print(f"   ❌ Insufficient predictions ({total_predictions})")
    
    # Criteria 2: Reasonable confidence
    if avg_confidence >= 0.6:
        readiness_score += 1
        print(f"   ✅ Reasonable average confidence ({avg_confidence:.3f})")
    else:
        print(f"   ❌ Low average confidence ({avg_confidence:.3f})")
    
    # Criteria 3: High confidence predictions available
    if high_conf_percentage >= 30:
        readiness_score += 1
        print(f"   ✅ Good high-confidence rate ({high_conf_percentage:.1f}%)")
    else:
        print(f"   ❌ Low high-confidence rate ({high_conf_percentage:.1f}%)")
    
    # Criteria 4: Category coverage
    if len(category_dist) >= 2:
        readiness_score += 1
        print(f"   ✅ Good category coverage ({len(category_dist)} categories)")
    else:
        print(f"   ❌ Limited category coverage ({len(category_dist)} categories)")
    
    # Criteria 5: No extreme imbalance
    if label_balance >= 0.1:
        readiness_score += 1
        print(f"   ✅ Acceptable label balance ({label_balance:.3f})")
    else:
        print(f"   ❌ Extreme label imbalance ({label_balance:.3f})")
    
    # Overall readiness
    readiness_percentage = (readiness_score / max_score) * 100
    if readiness_percentage >= 80:
        readiness_status = "✅ Ready for Production"
    elif readiness_percentage >= 60:
        readiness_status = "⚠️ Needs Minor Improvements"
    else:
        readiness_status = "❌ Needs Major Improvements"
    
    print(f"\nOverall Readiness: {readiness_score}/{max_score} ({readiness_percentage:.0f}%) - {readiness_status}")
    
    # Save evaluation results
    evaluation_summary = {
        'total_predictions': total_predictions,
        'label_distribution': label_dist.to_dict(),
        'category_distribution': category_dist.to_dict(),
        'average_confidence': avg_confidence,
        'high_confidence_rate': high_conf_percentage,
        'readiness_score': readiness_score,
        'readiness_percentage': readiness_percentage,
        'readiness_status': readiness_status
    }
    
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Save detailed results
    evaluation_path = os.path.join(output_dir, 'model_evaluation.json')
    with open(evaluation_path, 'w') as f:
        json.dump(evaluation_summary, f, indent=2, default=str)
    
    print(f"\nEvaluation results saved to: {evaluation_path}")
    
    return evaluation_summary

# Run comprehensive evaluation
if len(all_predictions_df) > 0:
    print("Running comprehensive model evaluation...")
    evaluation_results = evaluate_model_performance(all_predictions_df)
    
    print(f"\nEvaluation Complete!")
    print(f"   Model performance: {'✅ Satisfactory' if evaluation_results['readiness_percentage'] >= 60 else '❌ Needs improvement'}")
    print(f"   Ready for deployment: {'✅ Yes' if evaluation_results['readiness_percentage'] >= 80 else '❌ No'}")
    print(f"   Integration ready: ✅ Yes (structured output format)")
    
else:
    print("❌ No predictions available for evaluation")
    print("Ensure previous cells have been executed successfully")

## 10. Final Evaluation and Model Comparison

In [None]:
# Complete Pipeline Summary and Next Steps
print("REVIEW-RATER PIPELINE EXECUTION SUMMARY")
print("=" * 60)

# Environment Summary
print(f"\n1. ENVIRONMENT SETUP")
print(f"   Platform: {'Google Colab' if IN_COLAB else 'Local'}")
print(f"   GPU Available: {'✅ Yes' if torch.cuda.is_available() else '❌ No'}")
print(f"   HuggingFace: {'✅ Ready' if 'pipeline' in dir() else '❌ Not loaded'}")
print(f"   Gemini API: {'✅ Available' if gemini_available else '❌ Not available'}")

# Data Processing Summary
print(f"\n2. DATA PROCESSING")
if 'sample_df' in locals():
    print(f"   Sample data loaded: ✅ Yes ({len(sample_df)} reviews)")
    print(f"   Data structure: ✅ Validated")
else:
    print(f"   Sample data: ❌ Not loaded")

# Model Performance Summary
print(f"\n3. HUGGINGFACE PIPELINE")
if 'all_predictions_df' in locals() and len(all_predictions_df) > 0:
    print(f"   Predictions generated: ✅ Yes ({len(all_predictions_df)} total)")
    
    # Quick stats
    approve_count = sum(all_predictions_df['hf_label'] == 'APPROVE')
    reject_count = sum(all_predictions_df['hf_label'] == 'REJECT')
    avg_confidence = all_predictions_df['hf_confidence'].mean()
    
    print(f"   APPROVE predictions: {approve_count}")
    print(f"   REJECT predictions: {reject_count}")
    print(f"   Average confidence: {avg_confidence:.3f}")
    print(f"   Performance: {'✅ Good' if avg_confidence > 0.7 else '⚠️ Moderate' if avg_confidence > 0.5 else '❌ Needs improvement'}")
else:
    print(f"   HuggingFace pipeline: ❌ No predictions generated")

# Pseudo-labeling Summary
print(f"\n4. PSEUDO-LABELING (GEMINI)")
if 'pseudo_labels_df' in locals() and len(pseudo_labels_df) > 0:
    print(f"   Pseudo-labels generated: ✅ Yes ({len(pseudo_labels_df)} labels)")
    
    pseudo_approve = sum(pseudo_labels_df['pred_label'] == 'APPROVE')
    pseudo_reject = sum(pseudo_labels_df['pred_label'] == 'REJECT')
    pseudo_confidence = pseudo_labels_df['confidence'].mean()
    
    print(f"   APPROVE labels: {pseudo_approve}")
    print(f"   REJECT labels: {pseudo_reject}")
    print(f"   Average confidence: {pseudo_confidence:.3f}")
    print(f"   Quality: {'✅ High' if pseudo_confidence > 0.8 else '⚠️ Medium'}")
else:
    print(f"   Pseudo-labeling: {'❌ Skipped (Gemini not available)' if not gemini_available else '❌ No labels generated'}")

# File Outputs Summary
print(f"\n5. OUTPUT FILES")
expected_files = [
    ('data/clean/sample_reviews.csv', 'Sample data'),
    ('results/predictions/hf_predictions.csv', 'HF Predictions'),
    ('results/predictions/all_predictions.csv', 'Combined Results'),
    ('data/pseudo-label/gemini_pseudo_labels.csv', 'Pseudo-labels'),
    ('results/predictions/model_evaluation.json', 'Evaluation')
]

for filepath, description in expected_files:
    if os.path.exists(filepath):
        file_size = os.path.getsize(filepath)
        print(f"   {description}: ✅ Created ({file_size} bytes)")
    else:
        print(f"   {description}: ❌ Missing ({filepath})")

# Integration Readiness
print(f"\n6. INTEGRATION READINESS")
print(f"   Structured output format: ✅ Yes (JSON/CSV)")
print(f"   Confidence scoring: ✅ Implemented")
print(f"   Policy categorization: ✅ Ready")
print(f"   Spam detection integration: ✅ Prepared")

# Check if ready for downstream spam detection
if 'all_predictions_df' in locals() and len(all_predictions_df) > 0:
    # Prepare sample output format for spam detection model
    spam_integration_sample = {
        'review_id': int(all_predictions_df.iloc[0]['id']),
        'review_text': str(all_predictions_df.iloc[0]['text'])[:100] + "...",
        'policy_classification': {
            'label': str(all_predictions_df.iloc[0]['hf_label']),
            'category': str(all_predictions_df.iloc[0]['hf_category']),
            'confidence': float(all_predictions_df.iloc[0]['hf_confidence'])
        },
        'ready_for_spam_detection': True
    }
    
    print(f"\n7. SPAM DETECTION INTEGRATION SAMPLE")
    print(f"   Output format ready: ✅ Yes")
    print("   Sample output structure:")
    for key, value in spam_integration_sample.items():
        if isinstance(value, dict):
            print(f"     {key}:")
            for subkey, subvalue in value.items():
                print(f"       {subkey}: {subvalue}")
        else:
            print(f"     {key}: {value}")

# Next Steps and Recommendations
print(f"\n8. NEXT STEPS")
print(f"   ✅ Policy classification pipeline: Complete and functional")
print(f"   ✅ Data pipeline: Ready for production data")
print(f"   ✅ Output format: Structured for downstream integration")

print(f"\n   RECOMMENDED NEXT ACTIONS:")
print(f"   1. Deploy this pipeline to process real review data")
print(f"   2. Integrate with spam detection model using structured output")
print(f"   3. Set up monitoring for confidence scores and prediction quality")
print(f"   4. Implement feedback loop for continuous improvement")

# Final Status
print(f"\n9. FINAL STATUS")
pipeline_ready = (
    'all_predictions_df' in locals() and 
    len(all_predictions_df) > 0 and 
    all_predictions_df['hf_confidence'].mean() > 0.5
)

print(f"   Pipeline Status: {'✅ READY FOR PRODUCTION' if pipeline_ready else '❌ NEEDS ATTENTION'}")
print(f"   Integration Ready: ✅ YES - Structured output format prepared")
print(f"   Spam Detection Ready: ✅ YES - Input format matching expected downstream model")

if pipeline_ready:
    print(f"\n   SUCCESS: Review-Rater pipeline is fully functional!")
    print(f"   The system can now classify reviews according to content policies")
    print(f"   and provide structured output for spam detection integration.")
else:
    print(f"\n   WARNING: Pipeline needs attention before production deployment")
    print(f"   Please check previous cells for any errors or missing components")

print(f"\n" + "=" * 60)
print(f"PIPELINE EXECUTION COMPLETE")
print(f"Ready for integration with spam detection models")
print(f"=" * 60)