# Dataset Information

Our team has chosen to use the data from: 

Google Local review data: https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/.

# Reason for Dataset Choice

The choice of the dataset is due to the following reasons:
- Our model should be generalized to any form of busineses that takes in reviews. This dataset contains review data for all businesses (not only restricted to Restaurants). <br><br>
- The dataset also contains useful metadata of the business. This makes it more convenient for us to use the data to enrich our dataset. <br><br>

# Reviews Dataset Sampling

Among the many states provided in the website, our team has decided to choose 'Nevada' state.

There were a total of 8,833,403 reviews, and our team decided to sample 10,000 reviews as our dataset.

The sampling process was as such:
- Since there were many reviews with only ratings, and no textual reviews, we filtered to keep only those with textual reviews. <br><br>
- Among the filtered reviews, we randomly sampled 10,000 reviews. 

In [None]:
import ijson
import pandas as pd
import random

in_path = "raw_data/review-Nevada.json" 
out_path = "raw_data/sample_100k.csv"
rng = random.Random(42)

k = 100000
reservoir = []
with open(in_path, "rb") as f:
    objects = ijson.items(f, "", multiple_values=True)  
    for i, obj in enumerate(objects, start=1):
        if i <= k:
            reservoir.append(obj)
        else:
            j = random.randint(1, i)
            if j <= k:
                reservoir[j - 1] = obj

df = pd.DataFrame(reservoir)
df.to_csv(out_path, index=False)

filtered_df = df[df["text"].notnull()]
len(filtered_df)

out = filtered_df.sample(10000)
out.to_csv("raw_data/sample_10k_alltext.csv")

# Business Metadata Collection

Our team loaded the metadata and dropped columns that were not relevant to our downstream analysis.

Using the 10,000 filtered reviews, we then performed an inner join on `gmap_id` to consolidate the review and metadata information into a single dataset, facilitating easier processing in subsequent steps.

In [None]:
df_metadata = pd.read_json("raw_data/meta-Nevada.json", lines=True) 
df_metadata = df_metadata.drop(columns=["address", "description", "price", "hours", "MISC", "state", "relative_results"])
df_metadata = df_metadata.drop_duplicates(subset=["gmap_id"])

merged = pd.merge(df, df_metadata, on='gmap_id', how='inner')
merged.to_csv("raw_data/sample_10k_alltext_full_metadata.csv", index=False)

# User Metadata Collection

In [48]:
import pandas as pd
fake_previous_reviews = pd.read_csv('raw_data/all_fake_previous_reviews_reduced.csv')
fake_previous_reviews['user_id'] = pd.to_numeric(fake_previous_reviews['user_id'], errors='coerce').astype('float64')
fake_previous_reviews['policy'] = 'relevant'
fake_previous_reviews['fake'] = 'no'
fake_previous_reviews.head()

Unnamed: 0,user_id,name_x,total_review_count,name_y,rating,text,years_since_posted,avg_rating,num_of_reviews,latitude,longitude,category,policy,fake
0,1e+20,liat/eran salmon,403.0,Haad Rin,5,The main beach where the full moon parties are...,1,4.5,1191.0,9.676891,100.06805,"['establishment', 'natural_feature']",relevant,no
1,1e+20,liat/eran salmon,403.0,Har Qatum,2,"A special mountain, without a regular route. L...",1,4.5,8.0,30.576944,34.894722,"['establishment', 'natural_feature']",relevant,no
2,1e+20,liat/eran salmon,403.0,AgroCafe,5,Professional coffee shop. Many special types o...,1,4.7,951.0,31.676095,34.9356,"['cafe', 'establishment', 'food', 'point_of_in...",relevant,no
3,1e+20,liat/eran salmon,403.0,ביטלס,4,"Stiflingly crowded, dark but good music",1,3.9,615.0,29.551125,34.952428,"['bar', 'establishment', 'point_of_interest']",relevant,no
4,1e+20,liat/eran salmon,403.0,U Magic Palace,5,We got rooms on the 3rd floor in the southwest...,1,4.5,8371.0,29.55442,34.960241,"['establishment', 'lodging', 'point_of_interest']",relevant,no


In [49]:
fake_previous_reviews.columns

Index(['user_id', 'name_x', 'total_review_count', 'name_y', 'rating', 'text',
       'years_since_posted', 'avg_rating', 'num_of_reviews', 'latitude',
       'longitude', 'category', 'policy', 'fake'],
      dtype='object')

# Synthesizing Data

# Generation of pseudo-labels

In [50]:
all_data = pd.read_csv('raw_data/all_data.csv')
all_data = all_data.drop(columns=["Unnamed: 0"])
all_data.head()

Unnamed: 0,user_id,name_x,time,rating,text,pics,resp,gmap_id,name_y,latitude,longitude,category,avg_rating,num_of_reviews,url,policy,fake
0,1.1e+20,Michelle Banks,1520000000000.0,5.0,It's a beautiful place to read books and have ...,,,0x80c8bf81f68a634f:0xe605b4c3043783c9,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719,https://www.google.com/maps/place//data=!4m2!3...,relevant,no
1,1.06e+20,Steven DeRyck [Staff],1540000000000.0,4.0,"As previous reviews have stated, two small pie...",,,0x80c8c415f0a42c77:0x55c554fdc4ad8b9c,Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706,https://www.google.com/maps/place//data=!4m2!3...,relevant,no
2,1.1e+20,Stevey Markovich,1600000000000.0,5.0,Absolutely love this office! Afton is truly am...,,"{'time': 1595600023968, 'text': 'Thank you! Ha...",0x80c8ce0f7732ee7b:0xea13348742f64327,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318,https://www.google.com/maps/place//data=!4m2!3...,relevant,no
3,1.02e+20,William Campbell,1540000000000.0,3.0,The food is as good as it usually is,,,0x80c8dc9da25847c7:0x27b862b824ac757c,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128,https://www.google.com/maps/place//data=!4m2!3...,relevant,no
4,1.12e+20,Beverly Thorman,1520000000000.0,5.0,We came in without an appointment on a Saturda...,,"{'time': 1523207982441, 'text': 'Thank you for...",0x80c8c03de37488fd:0xdc3302fd9f8f44a,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168,https://www.google.com/maps/place//data=!4m2!3...,relevant,no


In [51]:
all_data.columns

Index(['user_id', 'name_x', 'time', 'rating', 'text', 'pics', 'resp',
       'gmap_id', 'name_y', 'latitude', 'longitude', 'category', 'avg_rating',
       'num_of_reviews', 'url', 'policy', 'fake'],
      dtype='object')

Calculate Years Since Posted from 'time' Column

In [52]:
import pandas as pd
from datetime import datetime

def add_years_since_posted_column(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculates the number of full years between each timestamp in the 'time' column
    and the current date, adding this as a new column named 'years_since_posted'.

    Args:
        df (pd.DataFrame): The input DataFrame which must contain a 'time' column
                           with timestamp-like values.

    Returns:
        pd.DataFrame: The DataFrame with the new 'years_since_posted' column.
                      Invalid or unparseable timestamps will result in None for years.
    """
    # Get the current date and time
    current_time = datetime.now()

    # Ensure the 'time' column is in datetime format
    # Using errors='coerce' will turn unparseable dates into NaT (Not a Time)
    # We use a temporary column to not modify the original 'time' column type if it's needed later
    df['__parsed_time'] = pd.to_datetime(df['time'], unit='ms', errors='coerce')

    # Calculate the difference in days and then convert to years.
    # We use .dt.days for the total days difference.
    # Then divide by 365.25 to account for leap years for a more accurate average.
    # The result will be a float.
    df['years_since_posted_float'] = (current_time - df['__parsed_time']).dt.days / 365.25

    # Convert to integer for 'full completed years'.
    # If a timestamp was unparseable (NaT), the result for years_since_posted_float will be NaN.
    # We convert NaN to None when casting to int to maintain clarity.
    df['years_since_posted'] = df['years_since_posted_float'].apply(
        lambda x: int(x) if pd.notna(x) else None
    )

    # Drop the intermediate parsing and float columns as they are no longer needed
    df = df.drop(columns=['__parsed_time', 'years_since_posted_float'])

    return df


all_data = add_years_since_posted_column(all_data.copy())
all_data = all_data.drop(columns=['time','pics', 'resp', 'gmap_id','url'])
all_data.head()

Unnamed: 0,user_id,name_x,rating,text,name_y,latitude,longitude,category,avg_rating,num_of_reviews,policy,fake,years_since_posted
0,1.1e+20,Michelle Banks,5.0,It's a beautiful place to read books and have ...,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719,relevant,no,7
1,1.06e+20,Steven DeRyck [Staff],4.0,"As previous reviews have stated, two small pie...",Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706,relevant,no,6
2,1.1e+20,Stevey Markovich,5.0,Absolutely love this office! Afton is truly am...,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318,relevant,no,4
3,1.02e+20,William Campbell,3.0,The food is as good as it usually is,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128,relevant,no,6
4,1.12e+20,Beverly Thorman,5.0,We came in without an appointment on a Saturda...,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168,relevant,no,7


# Combine All Datasets

In [53]:
import pandas as pd
final_df = pd.concat([all_data, fake_previous_reviews], ignore_index=True)

print("Columns of fake_previous_reviews:", fake_previous_reviews.columns.tolist())
print("\nColumns of all_data:", all_data.columns.tolist())
print("\n--- Merged final_df ---")
final_df

Columns of fake_previous_reviews: ['user_id', 'name_x', 'total_review_count', 'name_y', 'rating', 'text', 'years_since_posted', 'avg_rating', 'num_of_reviews', 'latitude', 'longitude', 'category', 'policy', 'fake']

Columns of all_data: ['user_id', 'name_x', 'rating', 'text', 'name_y', 'latitude', 'longitude', 'category', 'avg_rating', 'num_of_reviews', 'policy', 'fake', 'years_since_posted']

--- Merged final_df ---


Unnamed: 0,user_id,name_x,rating,text,name_y,latitude,longitude,category,avg_rating,num_of_reviews,policy,fake,years_since_posted,total_review_count
0,1.100000e+20,Michelle Banks,5.0,It's a beautiful place to read books and have ...,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719.0,relevant,no,7,
1,1.060000e+20,Steven DeRyck [Staff],4.0,"As previous reviews have stated, two small pie...",Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706.0,relevant,no,6,
2,1.100000e+20,Stevey Markovich,5.0,Absolutely love this office! Afton is truly am...,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318.0,relevant,no,4,
3,1.020000e+20,William Campbell,3.0,The food is as good as it usually is,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128.0,relevant,no,6,
4,1.120000e+20,Beverly Thorman,5.0,We came in without an appointment on a Saturda...,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168.0,relevant,no,7,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117029,1.184439e+20,jessica moxley,1.0,Horrible,The New Pioneer Hotel and Casino,35.155498,-114.572716,"['casino', 'establishment', 'food', 'lodging',...",3.8,6127.0,relevant,no,2,67.0
117030,1.184439e+20,jessica moxley,4.0,First time going here last night and it actual...,Gila River Resorts & Casinos - Lone Butte,33.289489,-111.943017,"['bar', 'casino', 'establishment', 'food', 'ni...",4.1,11477.0,relevant,no,2,67.0
117031,1.184439e+20,jessica moxley,1.0,Horrible. Terrible. Awful. All of the above. ...,Whataburger,36.106250,-115.173503,"['establishment', 'food', 'meal_takeaway', 'po...",4.2,1010.0,relevant,no,2,67.0
117032,1.184439e+20,jessica moxley,3.0,Machines are alright,Casino Arizona,33.454347,-111.885611,"['casino', 'establishment', 'food', 'point_of_...",4.0,21012.0,relevant,no,2,67.0


Randomise total_review_count for fake data

In [54]:
import random
import numpy as np

# Isolate the 'spam' rows and count them
spam_rows = (final_df['fake'] == 'yes') & (final_df['policy'] == 'spam')
num_spam_rows = spam_rows.sum()
# Generate a list of different random integers
spam_values = np.random.randint(21, 151, size=num_spam_rows) # The high value is exclusive, so 151
# Assign the new values
final_df.loc[spam_rows, 'total_review_count'] = spam_values

# Repeat the process for 'irrelevant'
irrelevant_rows = (final_df['fake'] == 'yes') & (final_df['policy'] == 'irrelevant')
num_irrelevant_rows = irrelevant_rows.sum()
irrelevant_values = np.random.randint(1, 11, size=num_irrelevant_rows)
final_df.loc[irrelevant_rows, 'total_review_count'] = irrelevant_values

# Repeat the process for 'rant'
rant_rows = (final_df['fake'] == 'yes') & (final_df['policy'] == 'rant')
num_rant_rows = rant_rows.sum()
rant_values = np.random.randint(5, 31, size=num_rant_rows)
final_df.loc[rant_rows, 'total_review_count'] = rant_values

# Repeat the process for 'ads'
ads_rows = (final_df['fake'] == 'yes') & (final_df['policy'] == 'ads')
num_ads_rows = ads_rows.sum()
ads_values = np.random.randint(100, 502, size=num_ads_rows)
final_df.loc[ads_rows, 'total_review_count'] = ads_values
final_df.head()

Unnamed: 0,user_id,name_x,rating,text,name_y,latitude,longitude,category,avg_rating,num_of_reviews,policy,fake,years_since_posted,total_review_count
0,1.1e+20,Michelle Banks,5.0,It's a beautiful place to read books and have ...,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719.0,relevant,no,7,
1,1.06e+20,Steven DeRyck [Staff],4.0,"As previous reviews have stated, two small pie...",Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706.0,relevant,no,6,
2,1.1e+20,Stevey Markovich,5.0,Absolutely love this office! Afton is truly am...,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318.0,relevant,no,4,
3,1.02e+20,William Campbell,3.0,The food is as good as it usually is,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128.0,relevant,no,6,
4,1.12e+20,Beverly Thorman,5.0,We came in without an appointment on a Saturda...,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168.0,relevant,no,7,


Fill Missing total_review_count

In [55]:
def fill_total_review_count_by_user(df: pd.DataFrame) -> pd.DataFrame:
    """
    Fills NaN values in the 'total_review_count' column for each user_id
    with an existing non-NaN value from other rows of the same user_id.
    If a user_id has multiple conflicting non-NaN values for 'total_review_count',
    it will use the maximum value found for that user.

    Args:
        df (pd.DataFrame): The input DataFrame, expected to have 'user_id' and
                           'total_review_count' columns.

    Returns:
        pd.DataFrame: The DataFrame with 'total_review_count' NaNs filled.
    """
    # First, let's ensure 'total_review_count' is a numeric type,
    # coercing errors to NaN in case there are non-numeric entries
    df['total_review_count'] = pd.to_numeric(df['total_review_count'], errors='coerce')

    # Group by 'user_id' and get the maximum 'total_review_count' for each user.
    # We use .transform('max') to broadcast this max value back to all rows
    # within each group. This ensures that even if there are multiple non-NaN
    # values for a user, we pick a consistent (maximum) one.
    # If a user_id only has NaNs, the max will also be NaN.
    df['total_review_count_filled'] = df.groupby('user_id')['total_review_count'].transform('max')

    # Now, fill the original 'total_review_count' NaNs with the values from
    # our newly created 'total_review_count_filled' column.
    # This ensures that only the NaNs are filled, preserving existing non-NaN values.
    df['total_review_count'] = df['total_review_count'].fillna(df['total_review_count_filled'])

    # Drop the temporary column
    df = df.drop(columns=['total_review_count_filled'])

    return df

final_df = fill_total_review_count_by_user(final_df.copy())
final_df.head()

Unnamed: 0,user_id,name_x,rating,text,name_y,latitude,longitude,category,avg_rating,num_of_reviews,policy,fake,years_since_posted,total_review_count
0,1.1e+20,Michelle Banks,5.0,It's a beautiful place to read books and have ...,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719.0,relevant,no,7,148.0
1,1.06e+20,Steven DeRyck [Staff],4.0,"As previous reviews have stated, two small pie...",Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706.0,relevant,no,6,143.0
2,1.1e+20,Stevey Markovich,5.0,Absolutely love this office! Afton is truly am...,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318.0,relevant,no,4,148.0
3,1.02e+20,William Campbell,3.0,The food is as good as it usually is,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128.0,relevant,no,6,145.0
4,1.12e+20,Beverly Thorman,5.0,We came in without an appointment on a Saturda...,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168.0,relevant,no,7,138.0


In [56]:
column_rename_map = {
    'name_x': 'user_name',
    'rating': 'review_rating',
    'text': 'review_text',
    'name_y': 'location_name',
    'latitude': 'location_latitude',
    'longitude': 'location_longitude',
    'category': 'location_category',
    'avg_rating': 'location_average_rating',
    'num_of_reviews': 'location_reviews_count',
    'total_review_count': 'user_reviews_count',
    'years_since_posted': 'review_years_since_posted',
    'policy': 'review_policy',
    'fake': 'review_fake'
}
# Rename the columns
final_df = final_df.rename(columns=column_rename_map)
final_df.head()

Unnamed: 0,user_id,user_name,review_rating,review_text,location_name,location_latitude,location_longitude,location_category,location_average_rating,location_reviews_count,review_policy,review_fake,review_years_since_posted,user_reviews_count
0,1.1e+20,Michelle Banks,5.0,It's a beautiful place to read books and have ...,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719.0,relevant,no,7,148.0
1,1.06e+20,Steven DeRyck [Staff],4.0,"As previous reviews have stated, two small pie...",Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706.0,relevant,no,6,143.0
2,1.1e+20,Stevey Markovich,5.0,Absolutely love this office! Afton is truly am...,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318.0,relevant,no,4,148.0
3,1.02e+20,William Campbell,3.0,The food is as good as it usually is,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128.0,relevant,no,6,145.0
4,1.12e+20,Beverly Thorman,5.0,We came in without an appointment on a Saturda...,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168.0,relevant,no,7,138.0


In [62]:
final_df.isna().sum()

user_id                         74
user_name                       29
review_rating                   74
review_text                      0
location_name                   17
location_latitude            14466
location_longitude           14466
location_category            14476
location_average_rating      14923
location_reviews_count       14923
review_policy                    0
review_fake                      0
review_years_since_posted        0
user_reviews_count             112
dtype: int64

In [None]:
final_df['user_id'].nue

Unnamed: 0,user_id,user_name,review_rating,review_text,location_name,location_latitude,location_longitude,location_category,location_average_rating,location_reviews_count,review_policy,review_fake,review_years_since_posted,user_reviews_count
0,1.100000e+20,Michelle Banks,5.0,It's a beautiful place to read books and have ...,Barnes & Noble,36.157754,-115.289418,"['Book store', 'Cafe', 'Childrens book store',...",4.6,1719.0,relevant,no,7,148.0
1,1.060000e+20,Steven DeRyck [Staff],4.0,"As previous reviews have stated, two small pie...",Carnegie Deli,36.120556,-115.173611,"['Deli', 'Takeout Restaurant', 'Sandwich shop']",4.1,706.0,relevant,no,6,143.0
2,1.100000e+20,Stevey Markovich,5.0,Absolutely love this office! Afton is truly am...,Center for Cosmetic and Family Dentistry,36.001929,-115.107484,['Dentist'],4.9,318.0,relevant,no,4,148.0
3,1.020000e+20,William Campbell,3.0,The food is as good as it usually is,Asian Garden,36.168901,-115.060601,"['Restaurant', 'Asian restaurant', 'Chinese re...",3.8,128.0,relevant,no,6,145.0
4,1.120000e+20,Beverly Thorman,5.0,We came in without an appointment on a Saturda...,Great Clips,36.191055,-115.258969,"['Hair salon', 'Beauty salon']",4.3,168.0,relevant,no,7,138.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117029,1.184439e+20,jessica moxley,1.0,Horrible,The New Pioneer Hotel and Casino,35.155498,-114.572716,"['casino', 'establishment', 'food', 'lodging',...",3.8,6127.0,relevant,no,2,67.0
117030,1.184439e+20,jessica moxley,4.0,First time going here last night and it actual...,Gila River Resorts & Casinos - Lone Butte,33.289489,-111.943017,"['bar', 'casino', 'establishment', 'food', 'ni...",4.1,11477.0,relevant,no,2,67.0
117031,1.184439e+20,jessica moxley,1.0,Horrible. Terrible. Awful. All of the above. ...,Whataburger,36.106250,-115.173503,"['establishment', 'food', 'meal_takeaway', 'po...",4.2,1010.0,relevant,no,2,67.0
117032,1.184439e+20,jessica moxley,3.0,Machines are alright,Casino Arizona,33.454347,-111.885611,"['casino', 'establishment', 'food', 'point_of_...",4.0,21012.0,relevant,no,2,67.0


In [None]:
# final_df.to_csv('raw_data/final_data.csv', index=False)