# **Problem Statement 1**  
### **Filtering the Noise: ML for Trustworthy Location Reviews**  
**Team 3Pandas** *(Tran Ha My, Diane Teo Min Xuan, Ng Yuen Ning)*  

---

## **Problem Statement**  
Design and implement an **ML-based system** to evaluate the **quality** and **relevancy** of Google location reviews. The system should:  

- **Gauge review quality:** Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.  
- **Assess relevancy:** Determine whether the content of a review is genuinely related to the location being reviewed.  
- **Enforce policies:** Automatically flag or filter out reviews that violate the following example policies:  
  - No advertisements or promotional content.  
  - No irrelevant content (e.g., reviews about unrelated topics).  
  - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).  

---

## **Motivation & Impact**  
- **For Users:** Increases trust in location-based reviews, leading to better decision-making.  
- **For Businesses:** Ensures fair representation and reduces the impact of malicious or irrelevant reviews.  
- **For Platforms:** Automates moderation, reduces manual workload, and enhances platform credibility.  

---

## **Data Sources**  

| **Data Sources**       | **Details** |
|-------------------------|-------------|
| **Public Datasets**    | - **Google Review Data:** Open datasets containing Google location reviews (e.g., [Google Local Reviews on Kaggle](https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews))<br>- **Google Local review data:** [UCSD Public Dataset](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/)<br>- **Alternative Sources:** Yelp, TripAdvisor, or other open review datasets for supplementary training. |
| **Student-Crawled Data** | - Students are encouraged to crawl additional reviews from Google Maps (in compliance with Google's terms of service).<br>- **Example:** [Scraping Google Reviews (YouTube)](https://www.youtube.com/watch?v=LYMdZ7W9bWQ) |


### Dependencies

In [12]:
# ! pip install transformers
! pip install tqdm




[notice] A new release of pip is available: 23.3.2 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
# ! pip install tldextract
import re
import tldextract

from transformers import pipeline
from tqdm import tqdm

# ! pip install textblob
from textblob import TextBlob
import pandas as pd

import torch


### 1. Load Data

### 2. Pre-Process Datafames

##### 2.1 Cleaning Functions

In [8]:
def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def clean_urls(text):
    url_pattern = re.compile(r'https?://[^\s]+')
    urls = url_pattern.findall(text)
    domains = [tldextract.extract(u).domain for u in urls]  # keep domains as tokens
    text_cleaned = url_pattern.sub(' '.join(domains), text)
    return text_cleaned

def clean_text(text):
    text = normalize_whitespace(text)
    text = clean_urls(text)
    return text

##### 2.2 Compute Basic Signals

In [9]:
def compute_basic_signals(row):
    url_count = len(re.findall(r'https?://\S+', row['text']))
    phone_count = len(re.findall(r'\+?\d[\d\s-]{7,}\d', row['text']))
    caps_ratio = sum(1 for c in row['text'] if c.isupper()) / max(len(row['text']), 1)
    return url_count, phone_count, caps_ratio

##### 2.3 Toxicity Signalling and Sentiment Analysis

In [17]:
toxicity_pipeline = pipeline("text-classification", model="unitary/toxic-bert", truncation=True)

def compute_toxicity_scores_batch(texts, batch_size=16):
    scores = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        results = toxicity_pipeline(batch, truncation=True)
        scores.extend([r['score'] for r in results])
    return scores

NameError: name 'torch' is not defined

In [None]:
def get_textblob_sentiment(text):
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        return 0.0, 0.0

    try:
        analysis = TextBlob(text)
        polarity = analysis.sentiment.polarity
        subjectivity = analysis.sentiment.subjectivity
        return polarity, subjectivity
    except Exception:
        return 0.0, 0.0

sentiment_results = df["clean_text"].apply(get_textblob_sentiment)
df["sentiment_polarity"], df["sentiment_subjectivity"] = zip(*sentiment_results)

# =====================
# FILTER OUT EXTREME SENTIMENTS
# =====================

positive_threshold = 0.8
negative_threshold = -0.8
df["is_extreme_sentiment"] = df["sentiment_polarity"].apply(
    lambda x: 1 if x >= positive_threshold or x <= negative_threshold else 0
)

##### Apply to Dataframe

In [None]:
def preprocess_reviews(df, timestamp_col="timestamp"):
    # Clean text
    df["clean_text"] = df["text"].apply(clean_text)

    # Compute basic signals
    signals = df.apply(lambda row: compute_basic_signals(row["clean_text"], row.get("distance_m", None)), axis=1)
    df["url_count"], df["phone_count"], df["caps_ratio"], df["distance_m"] = zip(*signals)

    # Compute toxicity
    df["toxicity_score"] = compute_toxicity_scores(df["clean_text"].tolist())

    # Ensure timestamp is datetime
    df[timestamp_col] = pd.to_datetime(df[timestamp_col])

    return df


### 3. Time-Based Split

In [None]:
def split_time_based(df, timestamp_col="timestamp"):
    max_time = df[timestamp_col].max()
    cut_train = max_time - pd.DateOffset(years=2)
    cut_val = max_time - pd.DateOffset(months=6)
    cut_test = max_time - pd.DateOffset(months=3)

    train_df = df[(df[timestamp_col] >= cut_train) & (df[timestamp_col] < cut_val)]
    val_df = df[(df[timestamp_col] >= cut_val) & (df[timestamp_col] < cut_test)]
    test_df = df[df[timestamp_col] >= cut_test]

    return train_df, val_df, test_df