# **Problem Statement 1**  
### **Filtering the Noise: ML for Trustworthy Location Reviews**  
**Team 3Pandas** *(Tran Ha My, Diane Teo Min Xuan, Ng Yuen Ning)*  

---

## **Problem Statement**  
Design and implement an **ML-based system** to evaluate the **quality** and **relevancy** of Google location reviews. The system should:  

- **Gauge review quality:** Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.  
- **Assess relevancy:** Determine whether the content of a review is genuinely related to the location being reviewed.  
- **Enforce policies:** Automatically flag or filter out reviews that violate the following example policies:  
  - No advertisements or promotional content.  
  - No irrelevant content (e.g., reviews about unrelated topics).  
  - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).  

---

## **Motivation & Impact**  
- **For Users:** Increases trust in location-based reviews, leading to better decision-making.  
- **For Businesses:** Ensures fair representation and reduces the impact of malicious or irrelevant reviews.  
- **For Platforms:** Automates moderation, reduces manual workload, and enhances platform credibility.  

---

## **Data Sources**  

| **Data Sources**       | **Details** |
|-------------------------|-------------|
| **Public Datasets**    | - **Google Review Data:** Open datasets containing Google location reviews (e.g., [Google Local Reviews on Kaggle](https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews))<br>- **Google Local review data:** [UCSD Public Dataset](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/)<br>- **Alternative Sources:** Yelp, TripAdvisor, or other open review datasets for supplementary training. |
| **Student-Crawled Data** | - Students are encouraged to crawl additional reviews from Google Maps (in compliance with Google's terms of service).<br>- **Example:** [Scraping Google Reviews (YouTube)](https://www.youtube.com/watch?v=LYMdZ7W9bWQ) |


### Dependencies

In [64]:
! pip install torch==2.7.1+cpu torchvision==0.22.1+cpu torchaudio==2.7.1+cpu --index-url https://download.pytorch.org/whl/cpu


Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch==2.7.1+cpu
  Downloading https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting torchvision==0.22.1+cpu
  Using cached https://download.pytorch.org/whl/cpu/torchvision-0.22.1%2Bcpu-cp312-cp312-win_amd64.whl.metadata (6.3 kB)
Collecting torchaudio==2.7.1+cpu
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-2.7.1%2Bcpu-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Downloading https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp312-cp312-win_amd64.whl (216.0 MB)
   ---------------------------------------- 0.0/216.0 MB ? eta -:--:--
   -- ------------------------------------- 15.7/216.0 MB 76.2 MB/s eta 0:00:03
   ------ --------------------------------- 35.1/216.0 MB 85.9 MB/s eta 0:00:03
   ---------- ----------------------------- 55.6/216.0 MB 90.8 MB/s eta 0:00:02
   -------------- ------------------------- 76.0/216.0 MB 91.5 MB/s eta 0:00:02


In [65]:
import yaml
import os
import json

# ! pip install tldextract
import re
import tldextract

from transformers import pipeline
from tqdm import tqdm

# ! pip install textblob
from textblob import TextBlob
import pandas as pd

import torch
from transformers import pipeline


### 1. Load Data

In [66]:
labeled_input_folder = config['labeled_input']

batch_files = [f"labels_batch{i}.csv" for i in range(1, 10)]

dfs = []
for file in batch_files:
    file_path = os.path.join(labeled_input_folder, file)
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        dfs.append(df)
    else:
        print(f"Warning: {file_path} does not exist!")

# Combine all batches
combined_df = pd.concat(dfs, ignore_index=True)

# Preview
combined_df.head()

Unnamed: 0,review_id,raw_json,comprehensive_review
0,1,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",
1,2,"{""is_ad"":false,""is_relevant"":true,""is_rant"":fa...",
2,3,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",
3,4,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",
4,5,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",


In [84]:
file_path = os.path.join(labeled_input_folder, "all_combined_reviews.json")
if os.path.exists(file_path):
    reviews_df = pd.read_json(file_path, lines=True)
else:
    print(f"Warning: {file_path} does not exist!")

# reviews_df.head()

nan_rows = reviews_df[pd.isna(reviews_df['review_text'])]
original_indices = nan_rows.index.tolist()
print("Rows with NaN review_text:", original_indices)
nan_rows

ranges = []
start = prev = original_indices[0]

for n in original_indices[1:]:
    if n == prev + 1:
        prev = n
    else:
        ranges.append((start, prev))
        start = prev = n
ranges.append((start, prev))

print(ranges)


Rows with NaN review_text: [11109, 11110, 11111, 11112, 11113, 11114, 11115, 11116, 11117, 11118, 11119, 11120, 11121, 11122, 11123, 11124, 11125, 11126, 11127, 11128, 11129, 11130, 11131, 11132, 11133, 11134, 11135, 11136, 11137, 11138, 11139, 11140, 11141, 11142, 11143, 11144, 11145, 11146, 11147, 11148, 11149, 11150, 11151, 11152, 11153, 11154, 11155, 11156, 11157, 11158, 11159, 11160, 11161, 11162, 11163, 11164, 11165, 11166, 11167, 11168, 11169, 11170, 11171, 11172, 11173, 11174, 11175, 11176, 11177, 11178, 11179, 11180, 11181, 11182, 11183, 11184, 11185, 11186, 11187, 11188, 11189, 11190, 11191, 11192, 11193, 11194, 11195, 11196, 11197, 11198, 11199, 11200, 11201, 11202, 11203, 11204, 11205, 11206, 11207, 11208, 11209, 11210, 11211, 11212, 11213, 11214, 11215, 11216, 11217, 11218, 11219, 11220, 11221, 11222, 11223, 11224, 11225, 11226, 11227, 11228, 11229, 11230, 11231, 11232, 11233, 11234, 11235, 11236, 11237, 11238, 11239, 11240, 11241, 11242, 11243, 11244, 11245, 11246, 11247,

In [68]:
labels_df = combined_df
parsed_labels = labels_df['raw_json'].apply(json.loads).apply(pd.Series)
full_df = pd.concat([reviews_df.reset_index(drop=True), parsed_labels.reset_index(drop=True)], axis=1)

full_df.head()

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,category,source,is_ad,is_relevant,is_rant,is_legit
0,"The store was clean and organized, and the cas...",5.0,False,Sarah Aulbach,1.0,Bass Pro Shops,"['Sporting goods store', 'Clothing store', 'Fi...",google,False,True,False,True
1,"Great food, good service, great atmosphere.",5.0,False,Ericka Woodall,1.0,Hooters,"['American restaurant', 'Bar & grill', 'Chicke...",google,False,True,False,True
2,Love going to Dollar Tree! Everything is a dol...,5.0,False,Roseanna Still,1.0,Dollar Tree,"['Dollar store', 'Craft store', 'Discount stor...",google,False,True,False,True
3,Great selection,5.0,False,William Ward,1.0,Half Price Books,"['Book store', 'Music store', 'Toy store']",google,False,True,False,True
4,Great customer service,3.0,False,Susanna Allen,1.0,McDonald's,"['Fast food restaurant', 'Breakfast restaurant...",google,False,True,False,True


### 2. Pre-Process Datafames

##### 2.1 Cleaning Functions

In [78]:
def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def clean_urls(text):
    url_pattern = re.compile(r'https?://[^\s]+')
    urls = url_pattern.findall(text)
    domains = [tldextract.extract(u).domain for u in urls]  # keep domains as tokens
    text_cleaned = url_pattern.sub(' '.join(domains), text)
    return text_cleaned

def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text)
    # Example: remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)
    # Add other cleaning rules here (optional)
    return text

full_df["clean_text"] = full_df["review_text"].apply(clean_text)

nan_rows = full_df[pd.isna(full_df['review_text'])]
original_indices = nan_rows.index.tolist()
# print("Rows with NaN review_text:", original_indices)
nan_rows

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,category,source,is_ad,is_relevant,is_rant,is_legit,clean_text
11109,,,False,Two Itchy Feet,,,,singapore,,,,,
11110,,,False,Stephen Fong,,,,singapore,,,,,
11111,,,False,Dawn Santa Maria,,,,singapore,,,,,
11112,,,False,Kandi leong,,,,singapore,,,,,
11113,,,False,Esther Lim,,,,singapore,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20607,,5.0,False,Thiam Hock Lee,1.0,FairPrice Ghim Moh Link,Supermarket,singapore,,,,,
20610,,4.0,False,Ankit Agrawal,9.0,FairPrice Ghim Moh Link,Supermarket,singapore,,,,,
20611,,5.0,False,Lewis Gan,1.0,FairPrice Ghim Moh Link,Supermarket,singapore,,,,,
20612,,5.0,False,Debaditya Roy,10.0,FairPrice Ghim Moh Link,Supermarket,singapore,,,,,


##### 2.2 Compute Basic Signals

In [70]:
def compute_basic_signals(row):
    url_count = len(re.findall(r'https?://\S+', row['text']))
    phone_count = len(re.findall(r'\+?\d[\d\s-]{7,}\d', row['text']))
    caps_ratio = sum(1 for c in row['text'] if c.isupper()) / max(len(row['text']), 1)
    return url_count, phone_count, caps_ratio

##### 2.3 Toxicity Signalling and Sentiment Analysis

In [71]:
import torch
from transformers import pipeline

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())



# Force CPU device
toxicity_pipeline = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    device=-1,   # -1 means CPU
    truncation=True
)

def compute_toxicity_scores_batch(texts, batch_size=16):
    scores = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        results = toxicity_pipeline(batch)  # truncation handled
        scores.extend([r['score'] for r in results])
    return scores

texts = ["I love this!", "You are awful!"]
toxicity_scores = compute_toxicity_scores_batch(texts)
print(toxicity_scores)

Torch version: 2.8.0+cpu
CUDA available: False


NameError: name 'torch' is not defined

In [72]:
def get_textblob_sentiment(text):
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        return 0.0, 0.0

    try:
        analysis = TextBlob(text)
        polarity = analysis.sentiment.polarity
        subjectivity = analysis.sentiment.subjectivity
        return polarity, subjectivity
    except Exception:
        return 0.0, 0.0

sentiment_results = df["clean_text"].apply(get_textblob_sentiment)
df["sentiment_polarity"], df["sentiment_subjectivity"] = zip(*sentiment_results)

# =====================
# FILTER OUT EXTREME SENTIMENTS
# =====================

positive_threshold = 0.8
negative_threshold = -0.8
df["is_extreme_sentiment"] = df["sentiment_polarity"].apply(
    lambda x: 1 if x >= positive_threshold or x <= negative_threshold else 0
)

KeyError: 'clean_text'

##### Apply to Dataframe

In [None]:
def preprocess_reviews(df, timestamp_col="timestamp"):
    # Clean text
    df["clean_text"] = df["text"].apply(clean_text)

    # Compute basic signals
    signals = df.apply(lambda row: compute_basic_signals(row["clean_text"], row.get("distance_m", None)), axis=1)
    df["url_count"], df["phone_count"], df["caps_ratio"], df["distance_m"] = zip(*signals)

    # Compute toxicity
    df["toxicity_score"] = compute_toxicity_scores(df["clean_text"].tolist())

    # Ensure timestamp is datetime
    df[timestamp_col] = pd.to_datetime(df[timestamp_col])

    return df


### 3. Time-Based Split

In [None]:
def split_time_based(df, timestamp_col="timestamp"):
    max_time = df[timestamp_col].max()
    cut_train = max_time - pd.DateOffset(years=2)
    cut_val = max_time - pd.DateOffset(months=6)
    cut_test = max_time - pd.DateOffset(months=3)

    train_df = df[(df[timestamp_col] >= cut_train) & (df[timestamp_col] < cut_val)]
    val_df = df[(df[timestamp_col] >= cut_val) & (df[timestamp_col] < cut_test)]
    test_df = df[df[timestamp_col] >= cut_test]

    return train_df, val_df, test_df