# **Problem Statement 1**  
### **Filtering the Noise: ML for Trustworthy Location Reviews**  
**Team 3Pandas** *(Tran Ha My, Diane Teo Min Xuan, Ng Yuen Ning)*  

---

## **Problem Statement**  
Design and implement an **ML-based system** to evaluate the **quality** and **relevancy** of Google location reviews. The system should:  

- **Gauge review quality:** Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.  
- **Assess relevancy:** Determine whether the content of a review is genuinely related to the location being reviewed.  
- **Enforce policies:** Automatically flag or filter out reviews that violate the following example policies:  
  - No advertisements or promotional content.  
  - No irrelevant content (e.g., reviews about unrelated topics).  
  - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).  

---

## **Motivation & Impact**  
- **For Users:** Increases trust in location-based reviews, leading to better decision-making.  
- **For Businesses:** Ensures fair representation and reduces the impact of malicious or irrelevant reviews.  
- **For Platforms:** Automates moderation, reduces manual workload, and enhances platform credibility.  

---

## **Data Sources**  

| **Data Sources**       | **Details** |
|-------------------------|-------------|
| **Public Datasets**    | - **Google Review Data:** Open datasets containing Google location reviews (e.g., [Google Local Reviews on Kaggle](https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews))<br>- **Google Local review data:** [UCSD Public Dataset](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/)<br>- **Alternative Sources:** Yelp, TripAdvisor, or other open review datasets for supplementary training. |
| **Student-Crawled Data** | - Students are encouraged to crawl additional reviews from Google Maps (in compliance with Google's terms of service).<br>- **Example:** [Scraping Google Reviews (YouTube)](https://www.youtube.com/watch?v=LYMdZ7W9bWQ) |


### Dependencies

In [64]:
! pip install torch==2.7.1+cpu torchvision==0.22.1+cpu torchaudio==2.7.1+cpu --index-url https://download.pytorch.org/whl/cpu


Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch==2.7.1+cpu
  Downloading https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting torchvision==0.22.1+cpu
  Using cached https://download.pytorch.org/whl/cpu/torchvision-0.22.1%2Bcpu-cp312-cp312-win_amd64.whl.metadata (6.3 kB)
Collecting torchaudio==2.7.1+cpu
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-2.7.1%2Bcpu-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Downloading https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp312-cp312-win_amd64.whl (216.0 MB)
   ---------------------------------------- 0.0/216.0 MB ? eta -:--:--
   -- ------------------------------------- 15.7/216.0 MB 76.2 MB/s eta 0:00:03
   ------ --------------------------------- 35.1/216.0 MB 85.9 MB/s eta 0:00:03
   ---------- ----------------------------- 55.6/216.0 MB 90.8 MB/s eta 0:00:02
   -------------- ------------------------- 76.0/216.0 MB 91.5 MB/s eta 0:00:02


In [104]:
import yaml
import os
import json

# ! pip install tldextract
import re
import tldextract

from transformers import pipeline
from tqdm import tqdm

# ! pip install textblob
from textblob import TextBlob
import pandas as pd

import torch
from transformers import pipeline


### 1. Load Data

In [144]:
labeled_input_folder = config['labeled_input']

batch_files = [f"labels_batch{i}.csv" for i in range(1, 14)]

dfs = []
for file in batch_files:
    file_path = os.path.join(labeled_input_folder, file)
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        dfs.append(df)
    else:
        print(f"Warning: {file_path} does not exist!")

# Combine all batches
labels_df = pd.concat(dfs, ignore_index=True)

# Preview
labels_df.head()



Unnamed: 0,review_id,raw_json,comprehensive_review
0,1,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",
1,2,"{""is_ad"":false,""is_relevant"":true,""is_rant"":fa...",
2,3,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",
3,4,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",
4,5,"{""is_ad"": false, ""is_relevant"": true, ""is_rant...",


In [145]:
file_path = os.path.join(labeled_input_folder, "all_combined_reviews.json")
if os.path.exists(file_path):
    reviews_df = pd.read_json(file_path, lines=True)
else:
    print(f"Warning: {file_path} does not exist!")

reviews_df["review_id"] = reviews_df.index + 1

reviews_df.head()

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,source,review_id
0,"The store was clean and organized, and the cas...",5.0,False,Sarah Aulbach,1.0,Bass Pro Shops,google,1
1,"Great food, good service, great atmosphere.",5.0,False,Ericka Woodall,1.0,Hooters,google,2
2,Love going to Dollar Tree! Everything is a dol...,5.0,False,Roseanna Still,1.0,Dollar Tree,google,3
3,Great selection,5.0,False,William Ward,1.0,Half Price Books,google,4
4,Great customer service,3.0,False,Susanna Allen,1.0,McDonald's,google,5


In [146]:
parsed_labels = labels_df['raw_json'].apply(json.loads).apply(pd.Series)
labels_df = pd.concat([labels_df[['review_id']], parsed_labels], axis=1)
full_df = reviews_df.merge(labels_df, on='review_id', how='left')

full_df

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,source,review_id,is_ad,is_relevant,is_rant,is_legit
0,"The store was clean and organized, and the cas...",5.0,False,Sarah Aulbach,1.0,Bass Pro Shops,google,1,False,True,False,True
1,"Great food, good service, great atmosphere.",5.0,False,Ericka Woodall,1.0,Hooters,google,2,False,True,False,True
2,Love going to Dollar Tree! Everything is a dol...,5.0,False,Roseanna Still,1.0,Dollar Tree,google,3,False,True,False,True
3,Great selection,5.0,False,William Ward,1.0,Half Price Books,google,4,False,True,False,True
4,Great customer service,3.0,False,Susanna Allen,1.0,McDonald's,google,5,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
20610,,4.0,False,Ankit Agrawal,9.0,FairPrice Ghim Moh Link,singapore,20611,False,True,False,True
20611,,5.0,False,Lewis Gan,1.0,FairPrice Ghim Moh Link,singapore,20612,False,True,False,True
20612,,5.0,False,Debaditya Roy,10.0,FairPrice Ghim Moh Link,singapore,20613,False,True,False,True
20613,It is convenient for those who live there,5.0,False,as low,35.0,FairPrice Ghim Moh Link,singapore,20614,False,True,False,True


In [147]:
full_df.isnull().sum()

review_text          6586
rating               5262
has_photo               0
author_name            10
user_review_count    6362
business_name        5009
source                  0
review_id               0
is_ad                8057
is_relevant          8057
is_rant              8057
is_legit             8057
dtype: int64

In [148]:
# Save as JSON
output_json_path = os.path.join(labeled_input_folder, "full_df.json")
full_df.to_json(output_json_path, orient="records", lines=True, force_ascii=False)
print(f"JSON file saved to: {output_json_path}")

# Save as Parquet
output_parquet_path = os.path.join(labeled_input_folder, "full_df.parquet")
full_df.to_parquet(output_parquet_path, index=False)
print(f"Parquet file saved to: {output_parquet_path}")

JSON file saved to: data/labeled\full_df.json
Parquet file saved to: data/labeled\full_df.parquet


In [149]:
to_clean_df = full_df.dropna(subset=['review_text', 'is_ad', 'is_relevant', 'is_rant', 'is_legit'])

to_clean_df.head()

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,source,review_id,is_ad,is_relevant,is_rant,is_legit
0,"The store was clean and organized, and the cas...",5.0,False,Sarah Aulbach,1.0,Bass Pro Shops,google,1,False,True,False,True
1,"Great food, good service, great atmosphere.",5.0,False,Ericka Woodall,1.0,Hooters,google,2,False,True,False,True
2,Love going to Dollar Tree! Everything is a dol...,5.0,False,Roseanna Still,1.0,Dollar Tree,google,3,False,True,False,True
3,Great selection,5.0,False,William Ward,1.0,Half Price Books,google,4,False,True,False,True
4,Great customer service,3.0,False,Susanna Allen,1.0,McDonald's,google,5,False,True,False,True


In [150]:
print(to_clean_df.shape)
print(to_clean_df.isnull().sum())

(10863, 12)
review_text            0
rating               253
has_photo              0
author_name            0
user_review_count    253
business_name          0
source                 0
review_id              0
is_ad                  0
is_relevant            0
is_rant                0
is_legit               0
dtype: int64


In [151]:
# Save as JSON
output_json_path = os.path.join(labeled_input_folder, "to_clean_df.json")
to_clean_df.to_json(output_json_path, orient="records", lines=True, force_ascii=False)
print(f"JSON file saved to: {output_json_path}")

# Save as Parquet
output_parquet_path = os.path.join(labeled_input_folder, "to_clean_df.parquet")
to_clean_df.to_parquet(output_parquet_path, index=False)
print(f"Parquet file saved to: {output_parquet_path}")

JSON file saved to: data/labeled\to_clean_df.json
Parquet file saved to: data/labeled\to_clean_df.parquet


### 2. Pre-Process Datafames

##### 2.1 Cleaning Functions

In [165]:
def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def clean_urls(text):
    url_pattern = re.compile(r'https?://[^\s]+')
    urls = url_pattern.findall(text)
    domains = [tldextract.extract(u).domain for u in urls]
    text_cleaned = url_pattern.sub(' '.join(domains), text)
    return text_cleaned

def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text)
    text = clean_urls(text)
    text = normalize_whitespace(text)
    return text

##### 2.2 Compute Basic Signals

In [171]:
def compute_basic_signals(text):
    url_count = len(re.findall(r'https?://\S+', text))
    phone_count = len(re.findall(r'\+?\d[\d\s-]{7,}\d', text))
    caps_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
    return url_count, phone_count, caps_ratio

##### 2.3 Sentiment Analysis

In [167]:
def add_textblob_sentiment(df, text_col="review_text", positive_threshold=0.9, negative_threshold=-0.9):
    def get_sentiment(text):
        if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
            return 0.0, 0.0
        try:
            analysis = TextBlob(text)
            return analysis.sentiment.polarity, analysis.sentiment.subjectivity
        except Exception:
            return 0.0, 0.0

    sentiment_results = df[text_col].apply(get_sentiment)
    df["sentiment_polarity"], df["sentiment_subjectivity"] = zip(*sentiment_results)

    df["is_extreme_sentiment"] = df["sentiment_polarity"].apply(
        lambda x: 1 if x >= positive_threshold or x <= negative_threshold else 0
    )

    return df

##### Apply to Dataframe

In [175]:
def preprocess_reviews(df, timestamp_col="timestamp"):
    # Clean text
    df["clean_text"] = df["review_text"].apply(clean_text)

    # Compute basic signals
    signals = df.apply(lambda row: compute_basic_signals(row["clean_text"]), axis=1)
    df["url_count"], df["phone_count"], df["caps_ratio"]= zip(*signals)

    return df

cleaned_df = preprocess_reviews(to_clean_df)
cleaned_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["clean_text"] = df["review_text"].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["url_count"], df["phone_count"], df["caps_ratio"]= zip(*signals)


Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,source,review_id,is_ad,is_relevant,is_rant,is_legit,clean_text,sentiment_polarity,sentiment_subjectivity,is_extreme_sentiment,url_count,phone_count,caps_ratio
0,"The store was clean and organized, and the cas...",5.0,False,Sarah Aulbach,1.0,Bass Pro Shops,google,1,False,True,False,True,"The store was clean and organized, and the cas...",0.370833,0.6,0,0,0,0.014085
1,"Great food, good service, great atmosphere.",5.0,False,Ericka Woodall,1.0,Hooters,google,2,False,True,False,True,"Great food, good service, great atmosphere.",0.766667,0.7,0,0,0,0.023256
2,Love going to Dollar Tree! Everything is a dol...,5.0,False,Roseanna Still,1.0,Dollar Tree,google,3,False,True,False,True,Love going to Dollar Tree! Everything is a dol...,0.625,0.6,0,0,0,0.070423
3,Great selection,5.0,False,William Ward,1.0,Half Price Books,google,4,False,True,False,True,Great selection,0.8,0.75,0,0,0,0.066667
4,Great customer service,3.0,False,Susanna Allen,1.0,McDonald's,google,5,False,True,False,True,Great customer service,0.8,0.75,0,0,0,0.045455


In [185]:
temp = cleaned_df[cleaned_df["is_extreme_sentiment"] > 0]
temp.shape

(773, 19)

### 3. Time-Based Split

In [None]:
def split_time_based(df, timestamp_col="timestamp"):
    max_time = df[timestamp_col].max()
    cut_train = max_time - pd.DateOffset(years=2)
    cut_val = max_time - pd.DateOffset(months=6)
    cut_test = max_time - pd.DateOffset(months=3)

    train_df = df[(df[timestamp_col] >= cut_train) & (df[timestamp_col] < cut_val)]
    val_df = df[(df[timestamp_col] >= cut_val) & (df[timestamp_col] < cut_test)]
    test_df = df[df[timestamp_col] >= cut_test]

    return train_df, val_df, test_df