# Review Classification Pipeline - Inference with HuggingFace Model

This notebook uses the fine-tuned HuggingFace model to classify new reviews without retraining.

## Requirements
1. Run the complete training pipeline first (00_colab_complete_pipeline.ipynb)
2. Ensure trained model exists in `models/saved_models/review_classifier_*/`
3. Place your review data in `data/actual/` directory
4. Data should be in CSV or JSON format with 'id' and 'text' columns

## What This Does
- Loads fine-tuned HuggingFace model from `models/saved_models/`
- Loads auxiliary models (toxicity detection, zero-shot classification)
- Combines custom ML predictions with policy violation detection
- Processes reviews from `data/actual/`
- Outputs comprehensive classification results
- Saves results to `results/inference/`

## Model Architecture
- **Fine-tuned Model**: Custom HuggingFace transformer for review classification
- **Policy Detection**: Zero-shot + toxicity analysis for specific violations
- **Dual-layer**: Both models vote; REJECT if either flags the review

## 1. Environment Setup

In [10]:
# Install required packages (if not already installed)
!pip install -q transformers torch pandas scikit-learn

import pandas as pd
import json
import os
import sys
from datetime import datetime
from pathlib import Path

print("✅ Environment setup complete")
print(f"Current directory: {os.getcwd()}")

✅ Environment setup complete
Current directory: /content


## 2. Load Trained Model

In [11]:
# Check if trained HuggingFace model exists
model_dir = 'models/saved_models'

print("CHECKING TRAINED MODEL")
print("="*30)

# Look for HuggingFace model directories (timestamped folders)
model_folders = []
if os.path.exists(model_dir):
    for item in os.listdir(model_dir):
        item_path = os.path.join(model_dir, item)
        if os.path.isdir(item_path) and item.startswith('review_classifier_'):
            # Look for checkpoint subdirectories within the model folder
            checkpoint_dirs = [d for d in os.listdir(item_path)
                             if os.path.isdir(os.path.join(item_path, d)) and d.startswith('checkpoint-')]

            if checkpoint_dirs:
                # Use the highest numbered checkpoint
                latest_checkpoint = sorted(checkpoint_dirs, key=lambda x: int(x.split('-')[1]))[-1]
                checkpoint_path = os.path.join(item_path, latest_checkpoint)

                # Check if this checkpoint contains HuggingFace model files
                hf_files = ['config.json', 'tokenizer.json', 'tokenizer_config.json']
                model_files = ['model.safetensors', 'pytorch_model.bin']  # Either format

                has_config = all(os.path.exists(os.path.join(checkpoint_path, f)) for f in hf_files)
                has_model = any(os.path.exists(os.path.join(checkpoint_path, f)) for f in model_files)

                if has_config and has_model:
                    model_folders.append((item, latest_checkpoint))
            else:
                # Fallback: check if model files are directly in the main folder
                hf_files = ['config.json', 'tokenizer.json', 'tokenizer_config.json']
                model_files = ['model.safetensors', 'pytorch_model.bin']

                has_config = all(os.path.exists(os.path.join(item_path, f)) for f in hf_files)
                has_model = any(os.path.exists(os.path.join(item_path, f)) for f in model_files)

                if has_config and has_model:
                    model_folders.append((item, None))

if not model_folders:
    print(f"❌ ERROR: No trained HuggingFace model found in {model_dir}")
    print(f"\nPlease run the training notebook first:")
    print(f"1. Open 00_colab_complete_pipeline.ipynb")
    print(f"2. Run all cells to train the model")
    print(f"3. Ensure model is saved to models/saved_models/")
    print(f"4. Then return to this inference notebook")
    sys.exit()
else:
    # Use the most recent model (last in alphabetical order = most recent timestamp)
    latest_model_info = sorted(model_folders)[-1]
    latest_model, checkpoint = latest_model_info

    if checkpoint:
        model_path = os.path.join(model_dir, latest_model, checkpoint)
        print(f"✅ Found trained model: {latest_model}")
        print(f"   Checkpoint: {checkpoint}")
        print(f"   Model path: {model_path}")
    else:
        model_path = os.path.join(model_dir, latest_model)
        print(f"✅ Found trained model: {latest_model}")
        print(f"   Model path: {model_path}")

    # Check model files
    model_files = os.listdir(model_path)
    print(f"   Model files: {', '.join(model_files)}")

    # Load latest_config.json if it exists for metadata
    config_file = os.path.join(model_dir, 'latest_config.json')
    if os.path.exists(config_file):
        with open(config_file, 'r') as f:
            config_metadata = json.load(f)
        print(f"\nModel Information:")
        print(f"   Training mode: {config_metadata.get('training_mode', 'Unknown')}")
        print(f"   Timestamp: {config_metadata.get('timestamp', 'Unknown')}")
        print(f"   Confidence threshold: {config_metadata.get('confidence_threshold', 0.55)}")
    else:
        print(f"\nUsing model: {latest_model}")
        if checkpoint:
            print(f"Using checkpoint: {checkpoint}")

CHECKING TRAINED MODEL
✅ Found trained model: review_classifier_20250830_195222
   Checkpoint: checkpoint-93
   Model path: models/saved_models/review_classifier_20250830_195222/checkpoint-93
   Model files: tokenizer.json, special_tokens_map.json, optimizer.pt, tokenizer_config.json, vocab.txt, scheduler.pt, rng_state.pth, model.safetensors, trainer_state.json, .ipynb_checkpoints, config.json

Using model: review_classifier_20250830_195222
Using checkpoint: checkpoint-93


## 3. Load Inference Pipeline

In [12]:
# Load ALL trained models: HuggingFace + Spam Detector + Auxiliary models
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
import numpy as np
import joblib
import sys
import os

print("LOADING ALL TRAINED MODELS")
print("="*40)

# -------- Safe defaults (in case not previously defined) --------
try:
    model_path
except NameError:
    # Set this to where your fine-tuned HF model folder lives (config.json, pytorch_model.bin, tokenizer files)
    model_path = "/content/models/saved_models"  # <-- change if needed

try:
    model_dir
except NameError:
    model_dir = "/content/models"  # not strictly needed now, but kept for consistency

try:
    config_metadata
except NameError:
    config_metadata = {}

# Exact path to spam detector .joblib (as requested)
SPAM_MODEL_PATH = "/content/models/saved_models/unified_spam_detector_20250830_195222.joblib"

try:
    # 1) Load the fine-tuned HuggingFace model
    print(f"Loading fine-tuned HuggingFace model from: {model_path}")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    hf_model = AutoModelForSequenceClassification.from_pretrained(model_path)

    print(f"✅ HuggingFace model loaded successfully")
    print(f"   Model type: {hf_model.config.model_type}")
    # Some tokenizers (e.g., sentencepiece) may not have vocab_size; guard it:
    vocab_size = getattr(tokenizer, "vocab_size", None)
    print(f"   Vocab size: {vocab_size if vocab_size is not None else 'N/A'}")

    # 2) Load the spam detector model (direct path)
    print(f"\nLoading spam detector from: {SPAM_MODEL_PATH}")
    try:
        spam_detector = joblib.load(SPAM_MODEL_PATH)
        print(f"✅ Spam detector loaded successfully")
        print(f"   Spam detector type: {type(spam_detector).__name__}")
        spam_model_available = True
    except FileNotFoundError:
        print(f"❌ Spam detector file not found at: {SPAM_MODEL_PATH}")
        spam_model_available = False
        spam_detector = None
    except Exception as e:
        print(f"⚠️ Error loading spam detector: {e}")
        spam_model_available = False
        spam_detector = None

    # 3) Load auxiliary models for policy detection
    device = 0 if torch.cuda.is_available() else -1
    print(f"\nDevice: {'GPU' if device == 0 else 'CPU'}")

    print(f"\nLoading auxiliary models...")
    # Toxicity classification pipeline
    toxic_pipeline = pipeline(
        "text-classification",
        model="unitary/toxic-bert",
        top_k=None,
        device=device
    )
    # Zero-shot classification pipeline
    zshot_pipeline = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli",
        device=device
    )
    print(f"✅ Auxiliary models loaded")

    # 4) Constants for labels/policies
    POLICY_CATEGORIES = {
        'NO_ADS': 'No_Ads',
        'IRRELEVANT': 'Irrelevant',
        'RANT_NO_VISIT': 'Rant_No_Visit',
        'NONE': 'None'
    }

    LABELS = {
        'APPROVE': 'APPROVE',
        'REJECT': 'REJECT'
    }

    ZERO_SHOT_LABELS = [
        "an advertisement or promotional solicitation for this business (promo code, referral, links, contact to buy)",
        "off-topic or unrelated to this business (e.g., politics, crypto, chain messages, personal stories not about this place)",
        "a generic negative rant about this business without evidence of a visit (short insults, 'scam', 'overpriced', 'worst ever')",
        "a relevant on-topic description of a visit or experience at this business"
    ]

    ZERO_SHOT_TO_POLICY = {
        ZERO_SHOT_LABELS[0]: POLICY_CATEGORIES['NO_ADS'],
        ZERO_SHOT_LABELS[1]: POLICY_CATEGORIES['IRRELEVANT'],
        ZERO_SHOT_LABELS[2]: POLICY_CATEGORIES['RANT_NO_VISIT'],
        ZERO_SHOT_LABELS[3]: POLICY_CATEGORIES['NONE']
    }

    # 5) Confidence threshold
    confidence_threshold = config_metadata.get('confidence_threshold', 0.55)

    print(f"\n✅ ALL MODELS LOADED SUCCESSFULLY")
    print(f"   HuggingFace Model: ✅ Ready")
    print(f"   Policy Detection (toxicity + zero-shot): ✅ Ready")
    print(f"   Confidence threshold: {confidence_threshold}")
    print(f"\n🚀 Triple-layer detection ready for inference!")

    models_loaded = True

except Exception as e:
    print(f"❌ Error loading models: {e}")
    import traceback
    traceback.print_exc()
    print(f"\nTroubleshooting:")
    print(f"1. Ensure the training notebook completed successfully")
    print(f"2. Check that model files exist in: {model_path}")
    print(f"3. Verify spam detector exists at: {SPAM_MODEL_PATH}")
    print(f"4. Verify internet connection for downloading auxiliary models")
    models_loaded = False

LOADING ALL TRAINED MODELS
Loading fine-tuned HuggingFace model from: models/saved_models/review_classifier_20250830_195222/checkpoint-93
✅ HuggingFace model loaded successfully
   Model type: distilbert
   Vocab size: 30522

Loading spam detector from: /content/models/saved_models/unified_spam_detector_20250830_195222.joblib
✅ Spam detector loaded successfully
   Spam detector type: UnifiedSpamDetector

Device: CPU

Loading auxiliary models...


Device set to use cpu
Device set to use cpu


✅ Auxiliary models loaded

✅ ALL MODELS LOADED SUCCESSFULLY
   HuggingFace Model: ✅ Ready
   Policy Detection (toxicity + zero-shot): ✅ Ready
   Confidence threshold: 0.55

🚀 Triple-layer detection ready for inference!


## 4. Load Input Data

In [13]:
# Check for input data in data/actual directory
input_dir = 'data/actual'
os.makedirs(input_dir, exist_ok=True)

print("LOADING INPUT DATA")
print("="*25)

# Look for CSV and JSON files
input_files = []
if os.path.exists(input_dir):
    for file in os.listdir(input_dir):
        if file.endswith(('.csv', '.json')):
            input_files.append(file)

print(f"Available input files in {input_dir}:")
for file in input_files:
    file_path = os.path.join(input_dir, file)
    file_size = os.path.getsize(file_path)
    print(f"   {file} ({file_size} bytes)")

if not input_files:
    print(f"❌ No input files found in {input_dir}")
    print(f"\nTo add your review data:")
    print(f"1. Create a CSV file with columns: 'id', 'text'")
    print(f"2. Or create a JSON file with array of objects: [{'id': 1, 'text': 'review text'}, ...]")
    print(f"3. Place the file in {input_dir}")
    print(f"4. Re-run this cell")
    print(f"\nExample files are already created for you to test with.")

# Let user choose which file to process
if input_files:
    print(f"\nChoose a file to process:")
    for i, file in enumerate(input_files):
        print(f"   {i+1}. {file}")

    # For demo, automatically use the first file
    # In practice, you might want to manually specify the file
    selected_file = input_files[0]
    print(f"\nUsing: {selected_file}")

    # Load the selected file
    file_path = os.path.join(input_dir, selected_file)

    try:
        if selected_file.endswith('.csv'):
            input_data = pd.read_csv(file_path)
        elif selected_file.endswith('.json'):
            with open(file_path, 'r') as f:
                json_data = json.load(f)
            input_data = pd.DataFrame(json_data)

        # Validate required columns
        if 'text' not in input_data.columns:
            print(f"❌ Missing required 'text' column")
            print(f"Available columns: {list(input_data.columns)}")
            input_data = None
        else:
            # Add ID column if missing
            if 'id' not in input_data.columns:
                input_data['id'] = range(1, len(input_data) + 1)

            print(f"✅ Data loaded successfully")
            print(f"   Reviews to process: {len(input_data)}")
            print(f"   Columns: {list(input_data.columns)}")

            # Show preview
            print(f"\nData Preview:")
            for idx, row in input_data.head(3).iterrows():
                text_preview = str(row['text'])[:60] + "..." if len(str(row['text'])) > 60 else str(row['text'])
                print(f"   ID {row['id']}: {text_preview}")

            if len(input_data) > 3:
                print(f"   ... and {len(input_data) - 3} more reviews")

    except Exception as e:
        print(f"❌ Error loading file: {e}")
        input_data = None
else:
    input_data = None

LOADING INPUT DATA
Available input files in data/actual:
   test_reviews.csv (1422 bytes)

Choose a file to process:
   1. test_reviews.csv

Using: test_reviews.csv
✅ Data loaded successfully
   Reviews to process: 15
   Columns: ['id', 'text']

Data Preview:
   ID 1: Great food and excellent service! Will definitely come back.
   ID 2: Use my promo code SAVE20 for 20% off your next order! DM me ...
   ID 3: Bitcoin is going to the moon! Buy now before it's too late!
   ... and 12 more reviews


## 5. Run Inference

In [14]:
# Define the inference functions
def predict_with_hf_model(text, model, tokenizer):
    """Predict using fine-tuned HuggingFace model"""
    inputs = tokenizer(text, truncation=True, padding=True, max_length=256, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=-1)
        conf = float(probs.max())
        pred = int(probs.argmax())
    return ("REJECT" if pred == 1 else "APPROVE"), conf

def predict_with_spam_detector(text, spam_detector):
    """Predict using spam detector model"""
    if spam_detector is None:
        return "APPROVE", 0.5, "None"

    try:
        results = spam_detector.predict([text])
        result = results[0]
        return result.label, result.confidence, result.category
    except Exception as e:
        print(f"Warning: Spam detector error: {e}")
        return "APPROVE", 0.5, "None"

def tox_top_label(tox_output):
    """Extract top toxicity label and score"""
    try:
        if isinstance(tox_output, dict):
            candidates = [tox_output]
        elif isinstance(tox_output, list):
            if len(tox_output) and isinstance(tox_output[0], dict):
                candidates = tox_output
            elif len(tox_output) and isinstance(tox_output[0], list):
                candidates = tox_output[0]
            else:
                candidates = []
        else:
            candidates = []
        if not candidates:
            return "NONE", 0.0
        best = max(candidates, key=lambda d: float(d.get("score", 0.0)))
        return best.get("label", "NONE"), float(best.get("score", 0.0))
    except Exception:
        return "NONE", 0.0

def policy_zero_shot_fused(zshot, toxic, text, tau_irrelevant=0.55, tau_rant=0.55, tau_ads=0.70, tox_tau=0.50):
    """Unified policy detection using zero-shot + toxicity"""
    # Zero-shot classification
    zs_res = zshot(text, candidate_labels=ZERO_SHOT_LABELS,
                   hypothesis_template="This review is {}.", multi_label=True)
    zs = {lab: float(scr) for lab, scr in zip(zs_res["labels"], zs_res["scores"])}

    ads = zs.get(ZERO_SHOT_LABELS[0], 0.0)
    irr = zs.get(ZERO_SHOT_LABELS[1], 0.0)
    rant = zs.get(ZERO_SHOT_LABELS[2], 0.0)

    # Toxicity gate
    tox_label, tox_score = tox_top_label(toxic(text))

    TOX_TO_RANT = {"toxic", "severe_toxic", "obscene", "threat", "insult"}
    TOX_TO_IRRELEVANT = {"identity_hate"}

    if tox_label and tox_score >= tox_tau:
        if tox_label in TOX_TO_RANT:
            return LABELS['REJECT'], POLICY_CATEGORIES['RANT_NO_VISIT']
        if tox_label in TOX_TO_IRRELEVANT:
            return LABELS['REJECT'], POLICY_CATEGORIES['IRRELEVANT']

    # Policy thresholds
    if max(irr, rant) >= min(tau_irrelevant, tau_rant):
        return LABELS['REJECT'], (POLICY_CATEGORIES['IRRELEVANT'] if irr >= rant else POLICY_CATEGORIES['RANT_NO_VISIT'])

    # Ads detection with evidence
    import re
    AD_PATTERNS = [r"https?://", r"\bwww\.", r"\.[a-z]{2,6}\b", r"\b(?:\+?\d[\s\-()]*){7,}\b",
                   r"\bpromo(?:\s*code)?\b", r"\bdiscount\b", r"\bcoupon\b", r"\breferral\b",
                   r"\buse\s*code\b", r"\bwhatsapp\b", r"\bdm\s+(?:me|us)\b"]
    AD_REGEX = re.compile("|".join(AD_PATTERNS), flags=re.IGNORECASE)

    has_ads = bool(AD_REGEX.search(text))
    ads_margin = 0.10

    if has_ads and (ads >= tau_ads) and (ads >= max(irr, rant) + ads_margin):
        return LABELS['REJECT'], POLICY_CATEGORIES['NO_ADS']

    return LABELS['APPROVE'], POLICY_CATEGORIES['NONE']

def process_reviews(input_df):
    """Process reviews using ALL three detection layers: HF Model + Spam Detector + Policy Detection"""
    results = []

    for _, row in input_df.iterrows():
        text = str(row['text'])

        # Layer 1: HuggingFace fine-tuned model
        hf_label, hf_conf = predict_with_hf_model(text, hf_model, tokenizer)

        # Layer 2: Spam detector (pattern + ML analysis)
        spam_label, spam_conf, spam_category = predict_with_spam_detector(text, spam_detector)

        # Layer 3: Policy detection for specific violations
        policy_label, policy_category = policy_zero_shot_fused(
            zshot_pipeline, toxic_pipeline, text,
            tau_irrelevant=0.55, tau_rant=0.55, tau_ads=0.70, tox_tau=0.50
        )

        # Triple-layer decision logic: ANY layer can reject
        rejection_reasons = []
        final_confidence = 0.7  # Default

        if hf_label == 'REJECT':
            rejection_reasons.append(('HuggingFace_ML', hf_conf))

        if spam_label == 'REJECT':
            rejection_reasons.append(('Spam_Detection', spam_conf))

        if policy_label == 'REJECT':
            rejection_reasons.append(('Policy_Violation', 0.8))

        # Final decision
        if rejection_reasons:
            final_label = 'REJECT'
            # Use the highest confidence rejection reason
            best_reason, best_conf = max(rejection_reasons, key=lambda x: x[1])

            if best_reason == 'Policy_Violation':
                final_category = policy_category
                final_confidence = 0.8
            elif best_reason == 'Spam_Detection':
                final_category = spam_category if spam_category != 'None' else 'Spam_Detected'
                final_confidence = spam_conf
            else:  # HuggingFace
                final_category = 'HF_ML_Detected'
                final_confidence = hf_conf
        else:
            final_label = 'APPROVE'
            final_category = 'None'
            # Use average confidence of all models for approval
            final_confidence = (hf_conf + spam_conf + 0.7) / 3

        results.append({
            'id': row['id'],
            'text': text,
            'pred_label': final_label,
            'pred_category': final_category,
            'confidence': round(final_confidence, 4),
            # Layer-specific results for analysis
            'hf_label': hf_label,
            'hf_confidence': round(hf_conf, 4),
            'spam_label': spam_label,
            'spam_confidence': round(spam_conf, 4),
            'spam_category': spam_category,
            'policy_label': policy_label,
            'policy_category': policy_category,
            'detection_layers': len(rejection_reasons)
        })

    return pd.DataFrame(results)

# Run inference if everything is ready
if models_loaded and input_data is not None and len(input_data) > 0:

    print("RUNNING INFERENCE")
    print("="*20)
    print(f"Processing {len(input_data)} reviews...")

    try:
        # Run the inference
        results = process_reviews(input_data)

        print(f"✅ Inference completed!")
        print(f"   Processed: {len(results)} reviews")

        # Create results directory
        results_dir = 'results/inference'
        os.makedirs(results_dir, exist_ok=True)

        # Save results
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        results_file = f"inference_results_{timestamp}.csv"
        results_path = os.path.join(results_dir, results_file)

        results.to_csv(results_path, index=False)

        print(f"\nRESULTS SUMMARY")
        print("="*20)

        # Summary statistics
        approve_count = len(results[results['pred_label'] == 'APPROVE'])
        reject_count = len(results[results['pred_label'] == 'REJECT'])
        avg_confidence = results['confidence'].mean()

        print(f"Total reviews: {len(results)}")
        print(f"APPROVE: {approve_count} ({approve_count/len(results)*100:.1f}%)")
        print(f"REJECT: {reject_count} ({reject_count/len(results)*100:.1f}%)")
        print(f"Average confidence: {avg_confidence:.3f}")

        # Category breakdown for rejected reviews
        if reject_count > 0:
            print(f"\nREJECT Categories:")
            reject_categories = results[results['pred_label'] == 'REJECT']['pred_category'].value_counts()
            for category, count in reject_categories.items():
                print(f"   {category}: {count} reviews")

        # Layer analysis
        print(f"\nDETECTION LAYER ANALYSIS:")
        layer_counts = results['detection_layers'].value_counts().sort_index()
        for layers, count in layer_counts.items():
            if layers == 0:
                print(f"   Clean (0 layers flagged): {count} reviews")
            else:
                print(f"   Flagged by {layers} layer(s): {count} reviews")

        # Model agreement analysis
        hf_rejects = len(results[results['hf_label'] == 'REJECT'])
        spam_rejects = len(results[results['spam_label'] == 'REJECT']) if spam_model_available else 0
        policy_rejects = len(results[results['policy_label'] == 'REJECT'])

        print(f"\nMODEL-SPECIFIC DETECTIONS:")
        print(f"   HuggingFace Model: {hf_rejects} rejections")
        if spam_model_available:
            print(f"   Spam Detector: {spam_rejects} rejections")
        else:
            print(f"   Spam Detector: Not available")
        print(f"   Policy Detection: {policy_rejects} rejections")

        # Show detailed results
        print(f"\nDETAILED RESULTS")
        print("="*70)

        display_df = results.copy()
        # Truncate text for display
        display_df['text'] = display_df['text'].apply(lambda x: x[:40] + "..." if len(x) > 40 else x)

        display_cols = ['id', 'text', 'pred_label', 'pred_category', 'confidence', 'detection_layers']
        print(display_df[display_cols].to_string(index=False))

        print(f"\n✅ Results saved to: {results_path}")
        print(f"\nSUCCESS: Inference complete!")
        print(f"Your reviews have been classified for policy violations.")

    except Exception as e:
        print(f"❌ Error during inference: {e}")
        import traceback
        traceback.print_exc()

else:
    print("❌ Cannot run inference")
    if not models_loaded:
        print("   Models not loaded properly")
    if input_data is None or len(input_data) == 0:
        print("   No input data available")

    print(f"\nPlease check:")
    print(f"1. Training notebook was run successfully")
    print(f"2. Input data is placed in data/actual/ directory")
    print(f"3. Input data has required 'text' column")

RUNNING INFERENCE
Processing 15 reviews...
✅ Inference completed!
   Processed: 15 reviews

RESULTS SUMMARY
Total reviews: 15
APPROVE: 11 (73.3%)
REJECT: 4 (26.7%)
Average confidence: 0.749

REJECT Categories:
   Rant_No_Visit: 3 reviews
   No_Ads: 1 reviews

DETECTION LAYER ANALYSIS:
   Clean (0 layers flagged): 11 reviews
   Flagged by 1 layer(s): 4 reviews

MODEL-SPECIFIC DETECTIONS:
   HuggingFace Model: 0 rejections
   Spam Detector: 0 rejections
   Policy Detection: 4 rejections

DETAILED RESULTS
 id                                        text pred_label pred_category  confidence  detection_layers
  1 Great food and excellent service! Will d...    APPROVE          None      0.7313                 0
  2 Use my promo code SAVE20 for 20% off you...     REJECT        No_Ads      0.8000                 1
  3 Bitcoin is going to the moon! Buy now be...    APPROVE          None      0.7286                 0
  4 Terrible place, worst food ever, total s...     REJECT Rant_No_Visit      0.

## 6. Results Analysis

In [15]:
# Advanced analysis of results (if available)
if 'results' in locals() and len(results) > 0:

    print("ADVANCED RESULTS ANALYSIS")
    print("="*30)

    # Confidence distribution
    high_conf = results[results['confidence'] >= 0.8]
    medium_conf = results[(results['confidence'] >= 0.6) & (results['confidence'] < 0.8)]
    low_conf = results[results['confidence'] < 0.6]

    print(f"Confidence Distribution:")
    print(f"   High (≥0.8): {len(high_conf)} reviews ({len(high_conf)/len(results)*100:.1f}%)")
    print(f"   Medium (0.6-0.8): {len(medium_conf)} reviews ({len(medium_conf)/len(results)*100:.1f}%)")
    print(f"   Low (<0.6): {len(low_conf)} reviews ({len(low_conf)/len(results)*100:.1f}%)")

    # Policy violations by type
    print(f"\nPolicy Violation Types:")
    category_counts = results['pred_category'].value_counts()
    for category, count in category_counts.items():
        percentage = count / len(results) * 100
        status = "Policy Violation" if category != "None" else "Clean Review"
        print(f"   {category}: {count} reviews ({percentage:.1f}%) - {status}")

    # Flag high-risk reviews
    high_risk = results[
        (results['pred_label'] == 'REJECT') &
        (results['confidence'] >= 0.8)
    ]

    if len(high_risk) > 0:
        print(f"\nHIGH-RISK REVIEWS (High confidence violations):")
        for idx, row in high_risk.iterrows():
            text_preview = row['text'][:60] + "..." if len(row['text']) > 60 else row['text']
            print(f"   ID {row['id']}: {row['pred_category']} ({row['confidence']:.3f}) - {text_preview}")

    # Export summary report
    summary_report = {
        'timestamp': datetime.now().isoformat(),
        'total_reviews': len(results),
        'approve_count': len(results[results['pred_label'] == 'APPROVE']),
        'reject_count': len(results[results['pred_label'] == 'REJECT']),
        'average_confidence': float(results['confidence'].mean()),
        'high_confidence_count': len(high_conf),
        'category_breakdown': category_counts.to_dict(),
        'high_risk_reviews': len(high_risk)
    }

    summary_path = os.path.join(results_dir, f"summary_report_{timestamp}.json")
    with open(summary_path, 'w') as f:
        json.dump(summary_report, f, indent=2)

    print(f"\n✅ Summary report saved: {summary_path}")

    print(f"\nINFERENCE COMPLETE")
    print(f"Files created:")
    print(f"   {results_path} - Detailed results")
    print(f"   {summary_path} - Summary report")

else:
    print("No results available for analysis")
    print("Run the inference cell first to generate results")

ADVANCED RESULTS ANALYSIS
Confidence Distribution:
   High (≥0.8): 4 reviews (26.7%)
   Medium (0.6-0.8): 11 reviews (73.3%)
   Low (<0.6): 0 reviews (0.0%)

Policy Violation Types:
   None: 11 reviews (73.3%) - Clean Review
   Rant_No_Visit: 3 reviews (20.0%) - Policy Violation
   No_Ads: 1 reviews (6.7%) - Policy Violation

HIGH-RISK REVIEWS (High confidence violations):
   ID 2: No_Ads (0.800) - Use my promo code SAVE20 for 20% off your next order! DM me ...
   ID 4: Rant_No_Visit (0.800) - Terrible place, worst food ever, total scam and ripoff.
   ID 8: Rant_No_Visit (0.800) - The government is corrupt and politicians are ruining this c...
   ID 9: Rant_No_Visit (0.800) - Overpriced garbage. Never going back. Complete waste of mone...

✅ Summary report saved: results/inference/summary_report_20250830_214431.json

INFERENCE COMPLETE
Files created:
   results/inference/inference_results_20250830_214431.csv - Detailed results
   results/inference/summary_report_20250830_214431.json - 