# Review Scraper Operations Guide

This guide covers all operations for the review scraper system, including the new language standardization features.

## Pipeline Overview

The review processing pipeline now has three main stages:

1. **Scrape**: Extract reviews from Google Maps and Trustpilot
2. **Unify**: Combine raw reviews into standardized format in `unified_reviews` collection
3. **Standardize**: Translate non-English content to English in `ls_unified_reviews` collection

Each stage is incremental - only new data is processed.

# 1. Scrape Reviews

In [None]:
python operations_controller.py scrape --excel ../database/establishments.xlsx

# 2. Unify Reviews (Incremental)

In [None]:
# Unify all new reviews (incremental - only processes new reviews)
python operations_controller.py unify

# Unify specific establishments
python operations_controller.py unify --establishments "id1,id2,id3"

# Quick mode (minimal output)
python operations_controller.py unify --quick

# 3. Standardize Reviews (Language Translation) - NEW!

In [None]:
# Standardize all new reviews (incremental - only processes new reviews)
python operations_controller.py standardize

# Standardize specific establishments
python operations_controller.py standardize --establishments "id1,id2,id3"

# Quick mode (minimal output)
python operations_controller.py standardize --quick

## What Language Standardization Does:

- **Detects language** of owner responses using langdetect
- **For Google reviews**: Translates `response_from_owner_text` if not English
- **For Trustpilot reviews**: 
  - Translates `title` + `review_text` if `review_language` is not English
  - Translates `response_from_owner_text` if detected language is not English
- **Adds new field**: `response_from_owner_language` for all reviews
- **Uses Google Gemini** for translations
- **Caches translations** to avoid duplicate API calls

## Full Rebuild (When Needed)

If you need to rebuild entire collections from scratch:

In [None]:
# Rebuild unified_reviews collection
# Option 1: Delete collection via MongoDB Compass/CLI, then run:
python operations_controller.py unify

# Rebuild ls_unified_reviews collection  
# Option 1: Delete collection via MongoDB Compass/CLI, then run:
python operations_controller.py standardize

# Option 2: Using MongoDB CLI (if you have access)
# db.unified_reviews.drop()
# db.ls_unified_reviews.drop()
# Then run: python operations_controller.py unify
# Then run: python operations_controller.py standardize

# 4. Show Statistics

In [None]:
python operations_controller.py stats

## Statistics Now Include:

- **Raw collections**: google, trustpilot, establishments counts
- **Unified reviews**: Total count and platform breakdown with average ratings
- **Language standardized reviews**: Total count, platform breakdown, owner response counts
- **Response language breakdown**: Shows detected languages in owner responses

# 5. Combined Operations

In [None]:
# Scrape + Unify
python operations_controller.py scrape-and-unify --excel ../database/establishments.xlsx
python operations_controller.py scrape-and-unify --excel ../database/establishments.xlsx --quick-unify

In [None]:
# Full Pipeline: Scrape + Unify + Standardize (NEW!)
python operations_controller.py full-pipeline --excel ../database/establishments.xlsx

# With quick modes for faster processing
python operations_controller.py full-pipeline --excel ../database/establishments.xlsx --quick-unify --quick-standardize

# 6. Verbose Mode

In [None]:
# Add -v or --verbose to any command for detailed logging
python operations_controller.py unify --verbose
python operations_controller.py standardize --verbose
python operations_controller.py scrape --excel ../database/establishments.xlsx --verbose
python operations_controller.py stats --verbose
python operations_controller.py full-pipeline --excel ../database/establishments.xlsx --verbose

# Usage Examples

## Daily Operations

In [None]:
# Quick daily processing (only new reviews)
python operations_controller.py unify --quick
python operations_controller.py standardize --quick

# Or combined:
python operations_controller.py unify --quick && python operations_controller.py standardize --quick

# Check database status
python operations_controller.py stats

# Full pipeline for new establishments
python operations_controller.py full-pipeline --excel new_establishments.xlsx --quick-unify --quick-standardize

## Maintenance Operations

In [None]:
# Check current database status and statistics
python operations_controller.py stats

# Re-process all reviews (after deleting collections)
# First: Delete unified_reviews and ls_unified_reviews collections via MongoDB Compass
# Then: python operations_controller.py unify
# Then: python operations_controller.py standardize

# Verbose troubleshooting
python operations_controller.py standardize --verbose

## Targeted Operations

In [None]:
# Process specific establishments only
python operations_controller.py unify --establishments "687a51385c7e5bb6b9c1a5d6,another_id"
python operations_controller.py standardize --establishments "687a51385c7e5bb6b9c1a5d6,another_id"

# Re-scrape specific establishments (add them to a new Excel file)
python operations_controller.py scrape --excel specific_establishments.xlsx

## Programmatic Usage

In [None]:
from engine.operations_controller import OperationsController
from database.db_manager import DatabaseManager

# Use the operations controller directly
controller = OperationsController(verbose=False)
controller.initialize()

# Unify reviews quietly
success = controller.unify_reviews(quick=True)

# Standardize reviews quietly
success = controller.standardize_reviews(quick=True)

# Get statistics
controller.show_statistics()

# Clean up
controller.cleanup()

# Use the database manager directly
db_manager = DatabaseManager()
mongodb_connection = "your_connection_string"
db_manager.connect(mongodb_connection)

# Run incremental unification
unify_results = db_manager.unify_reviews_incremental()
print(f"Unified: {unify_results}")

# Run incremental standardization
standardize_results = db_manager.standardize_reviews_incremental()
print(f"Standardized: {standardize_results}")

# Get stats
unified_stats = db_manager.get_unified_reviews_stats()
ls_stats = db_manager.get_ls_unified_reviews_stats()
print(f"Unified Stats: {unified_stats}")
print(f"Language Standardized Stats: {ls_stats}")

db_manager.close_connection()

# Common Workflows

## Adding New Establishments

In [None]:
# 1. Add new establishments to Excel file
# 2. Run full pipeline
python operations_controller.py full-pipeline --excel new_establishments.xlsx

# 3. Check results
python operations_controller.py stats

## Regular Data Updates

In [None]:
# Daily: Process any new reviews that were scraped
python operations_controller.py unify --quick
python operations_controller.py standardize --quick

# Weekly: Full statistics review
python operations_controller.py stats

# Monthly: Re-scrape existing establishments (use same Excel file)
python operations_controller.py scrape --excel ../database/establishments.xlsx
python operations_controller.py unify
python operations_controller.py standardize

## Troubleshooting

In [None]:
# Debug with verbose logging
python operations_controller.py standardize --verbose

# Check if processing is working properly
python operations_controller.py stats

# Process specific problematic establishments
python operations_controller.py standardize --establishments "problematic_id" --verbose

# Full rebuild if needed (after backing up data)
# 1. Delete unified_reviews and ls_unified_reviews collections in MongoDB
# 2. python operations_controller.py unify --verbose
# 3. python operations_controller.py standardize --verbose

## Managing Translation API Costs

In [None]:
# Check how many translations would be needed before running
python operations_controller.py stats

# Process in smaller batches to control costs
python operations_controller.py standardize --establishments "est1,est2,est3" --verbose

# Run standardization with careful monitoring
# The system automatically caches translations to avoid duplicates
# Language detection (langdetect) runs first to minimize API calls
python operations_controller.py standardize --quick

# Performance Tips

## General Performance
- **Use `--quick` for daily operations** to reduce output and improve speed
- **Process in batches**: The system automatically batches 1000 reviews at a time
- **Incremental by default**: Only new data is processed, making regular runs fast
- **Use verbose mode only for debugging** as it generates more I/O
- **Monitor with stats**: Regular stats checks help identify issues early
- **Indexes are auto-created**: The system creates optimal indexes automatically

## Language Standardization Specific
- **Language detection is fast**: langdetect runs locally with no API costs
- **Translation caching**: Identical texts are translated only once
- **Smart filtering**: Only non-English content is sent to translation API
- **Batch processing**: Short texts may be batched for API efficiency
- **Graceful failures**: If translation fails, original text is preserved

## API Cost Optimization
- **Length filtering**: Very short texts (<5 chars) are not processed
- **Language filtering**: English content is automatically skipped
- **Deduplication**: Identical responses are translated only once
- **Incremental processing**: Only new reviews need translation

# Database Collections

## Raw Collections
- **`google`**: Raw Google Maps reviews
- **`trustpilot`**: Raw Trustpilot reviews
- **`establishments`**: Business information

## Processed Collections
- **`unified_reviews`**: Standardized format, mixed languages
- **`ls_unified_reviews`**: Language standardized (English) format

## Key Differences in `ls_unified_reviews`
- **New field**: `response_from_owner_language` (detected language)
- **Translated content**: Non-English reviews and responses are in English
- **For Google**: Only `response_from_owner_text` is translated if needed
- **For Trustpilot**: Both review content and owner responses are translated if needed
- **Preservation**: Original language indicators are maintained for reference

# Requirements

Make sure to install the additional requirements for language standardization:

```bash
pip install langdetect google-generativeai
```

And ensure you have the Google API key file:
- `tokens/google_api_key.txt`