# Dataset Information

Our team has chosen to use the data from: 

Google Local review data: https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/.

# Reason for Dataset Choice

The choice of the dataset is due to the following reasons:
- Our model should be generalized to any form of busineses that takes in reviews. This dataset contains review data for all businesses (not only restricted to Restaurants). <br><br>
- The dataset also contains useful metadata of the business. This makes it more convenient for us to use the data to enrich our dataset. <br><br>

# Reviews Dataset Sampling

Among the many states provided in the website, our team has decided to choose 'Nevada' state.

There were a total of 8,833,403 reviews, and our team decided to sample 10,000 reviews as our dataset.

The sampling process was as such:
- Since there were many reviews with only ratings, and no textual reviews, we filtered to keep only those with textual reviews. <br><br>
- Among the filtered reviews, we randomly sampled 10,000 reviews. 

In [None]:
import ijson
import pandas as pd
import random

in_path = "raw_data/review-Nevada.json" 
out_path = "raw_data/sample_100k.csv"
rng = random.Random(42)

k = 100000
reservoir = []
with open(in_path, "rb") as f:
    objects = ijson.items(f, "", multiple_values=True)  
    for i, obj in enumerate(objects, start=1):
        if i <= k:
            reservoir.append(obj)
        else:
            j = random.randint(1, i)
            if j <= k:
                reservoir[j - 1] = obj

df = pd.DataFrame(reservoir)
df.to_csv(out_path, index=False)

filtered_df = df[df["text"].notnull()]
len(filtered_df)

out = filtered_df.sample(10000)
out.to_csv("raw_data/sample_10k_alltext.csv")

# Business Metadata Collection

Our team loaded the metadata and dropped columns that were not relevant to our downstream analysis.

Using the 10,000 filtered reviews, we then performed an inner join on `gmap_id` to consolidate the review and metadata information into a single dataset, facilitating easier processing in subsequent steps.

In [None]:
df_metadata = pd.read_json("raw_data/meta-Nevada.json", lines=True) 
df_metadata = df_metadata.drop(columns=["address", "description", "price", "hours", "MISC", "state", "relative_results"])
df_metadata = df_metadata.drop_duplicates(subset=["gmap_id"])

merged = pd.merge(df, df_metadata, on='gmap_id', how='inner')
merged.to_csv("raw_data/sample_10k_alltext_full_metadata.csv", index=False)

# User Metadata Collection

# Synthesizing Data

# Generation of pseudo-labels