# Data Pipeline - South Dakota Business Reviews Analysis

This notebook processes South Dakota business review data and metadata, performing data cleaning and preparation for analysis.

## 1. Import Required Libraries

Import the necessary libraries for data processing and analysis.

In [96]:
import gzip, json
import pandas as pd
from tabulate import tabulate
from typing import Dict, List
from pydantic import BaseModel
import time

## 2. Helper Functions

Define utility functions for parsing compressed JSON data.

In [25]:
def parse(path):
    """Parse gzipped JSON lines file and yield JSON objects."""
    with gzip.open(path, "rt", encoding="utf-8") as g:
        for line in g:
            yield json.loads(line)

## 3. Data Loading

Load the review data and business metadata from compressed JSON files.

In [26]:
# Load data from compressed JSON files
reviews_data = pd.read_json(
    "review_South_Dakota.json.gz", lines=True, compression="gzip"
)  # or .json/.parquet
biz_meta = pd.read_json("meta_South_Dakota.json.gz", lines=True, compression="gzip")

print(f"Reviews data shape: {reviews_data.shape}")
print(f"Business metadata shape: {biz_meta.shape}")

Reviews data shape: (673048, 8)
Business metadata shape: (14257, 15)


## 4. Data Standardization

Standardize column names for consistency.

In [27]:
# Standardize column names
biz_meta.columns = biz_meta.columns.str.lower().str.strip()
reviews_data.columns = reviews_data.columns.str.lower().str.strip()

print("Reviews data columns:", list(reviews_data.columns))
print("Business metadata columns:", list(biz_meta.columns))

Reviews data columns: ['user_id', 'name', 'time', 'rating', 'text', 'pics', 'resp', 'gmap_id']
Business metadata columns: ['name', 'address', 'gmap_id', 'description', 'latitude', 'longitude', 'category', 'avg_rating', 'num_of_reviews', 'price', 'hours', 'misc', 'state', 'relative_results', 'url']


## 5. Data Preview

Display the first few rows of both datasets to understand the data structure.

In [28]:
print("\nBusiness Metadata Sample:")
print(tabulate(biz_meta.head(20), headers="keys", tablefmt="psql"))


Business Metadata Sample:
+----+--------------------------------------+--------------------------------------------------------------------------------------------+---------------------------------------+---------------+------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+--------------+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+----------------------------------------------------------

In [29]:
# Count missing per column
counts = reviews_data.isna().sum()
# Percent missing
percent = (reviews_data.isna().mean() * 100).round(2)
summary = pd.concat([counts.rename("missing_count"), percent.rename("missing_pct")], axis=1)
summary = summary.sort_values("missing_count", ascending=False)
print(f"\nMissing summary for reviews_data (n_rows={len(reviews_data)}):")
display(summary)   # if in notebook


# Remove reviews with missing essential columns
print(f"Reviews before cleaning: {len(reviews_data)}")
reviews_data = reviews_data.dropna(subset=["rating", "time", "gmap_id"])
print(f"Reviews after cleaning: {len(reviews_data)}")



# Create boolean feature for pictures
reviews_data["has_pics"] = reviews_data["pics"].notna()
print(f"Reviews with pictures: {reviews_data['has_pics'].sum()}")


Missing summary for reviews_data (n_rows=673048):


Unnamed: 0,missing_count,missing_pct
pics,657233,97.65
resp,589840,87.64
text,325966,48.43
user_id,0,0.0
rating,0,0.0
time,0,0.0
name,0,0.0
gmap_id,0,0.0


Reviews before cleaning: 673048
Reviews after cleaning: 673048
Reviews with pictures: 15815
Reviews after cleaning: 673048
Reviews with pictures: 15815


In [30]:
# Remove businesses without gmap_id
print(f"Businesses before cleaning: {len(biz_meta)}")
biz_meta = biz_meta.dropna(subset=["gmap_id"])
print(f"Businesses after cleaning: {len(biz_meta)}")

# Convert price symbols to numeric levels ($ → 1, $$ → 2, etc.)
biz_meta["price_level"] = biz_meta["price"].str.len()
# Fill missing with 0 = unknown
biz_meta["price_level"] = biz_meta["price_level"].fillna(0).astype("int8")

print("Price level distribution:")
print(biz_meta["price_level"].value_counts().sort_index())

Businesses before cleaning: 14257
Businesses after cleaning: 14257
Price level distribution:
price_level
0    12021
1     1147
2     1050
3       38
4        1
Name: count, dtype: int64


In [31]:
# Define columns to keep for merging
keep_cols = [
    "gmap_id",  # join key
    "name",  # business name
    "category",  # type of business
    "avg_rating",  # business-level avg
    "num_of_reviews",  # business popularity
    "latitude",
    "longitude",  # optional
    "state",  # active/closed
]

# Filter to only existing columns
keep_cols = [c for c in keep_cols if c in biz_meta.columns]
print(f"Columns to keep: {keep_cols}")

# Create filtered metadata dataset
meta_small = biz_meta[keep_cols].drop_duplicates(subset=["gmap_id"]).copy()
print(f"Unique businesses in metadata: {len(meta_small)}")

Columns to keep: ['gmap_id', 'name', 'category', 'avg_rating', 'num_of_reviews', 'latitude', 'longitude', 'state']
Unique businesses in metadata: 14167


## Data Summary

Understand the merged data

In [None]:
print("Final Data Summary:")
print(f"- Reviews data: {reviews_data.shape}")
print(f"- Business metadata: {meta_small.shape}")
print(f"- Unique businesses with reviews: {reviews_data['gmap_id'].nunique()}")
print(f"- Average rating: {reviews_data['rating'].mean():.2f}")
print(f"- Reviews with pictures: {reviews_data['has_pics'].sum()} ({reviews_data['has_pics'].mean()*100:.1f}%)")

print("\nTable after cleaning:")
print(tabulate(reviews_data.head(20), headers="keys", tablefmt="psql"))

Final Data Summary:
- Reviews data: (673048, 9)
- Business metadata: (14167, 8)
- Unique businesses with reviews: 7255
- Average rating: 4.33
- Reviews with pictures: 15815 (2.3%)

Table after cleaning:
+----+-------------+-----------------------+---------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------+---------------------------------------+------------+
|    |     user_id | name                  |          time |   rating | text                                                                                                                                                                                                                   | pics   | resp   | gmap_id                               | has_pics   |
|----+-------------+-----------------------+---------------+----------+------

In [34]:
# Merge reviews with business metadata
merged_data = reviews_data.merge(meta_small, on="gmap_id", how="inner")
print(f"\nMerged data shape: {merged_data.shape}")

print("Merged Data Sample:")
print(tabulate(merged_data.head(20), headers="keys", tablefmt="psql"))
print(merged_data.info())



Merged data shape: (673048, 16)
Merged Data Sample:
+----+-------------+-----------------------+---------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------+---------------------------------------+------------+------------------------------+----------------------------------------+--------------+------------------+------------+-------------+--------------------+
|    |     user_id | name_x                |          time |   rating | text                                                                                                                                                                                                                   | pics   | resp   | gmap_id                               | has_pics   | name_y                       | category                               |  

## Text classification with LLM (e.g., OpenAI GPT-4)

Will remove non text rows from dataset and classify reviews into categories such as complaint, praise, suggestion, etc.

In [None]:
# Remove reviews with no text

print(f"Reviews before removing non-text: {len(merged_data)}")

# Soft filter: keep rows where 'text' is not null/empty after stripping whitespace
merged_data_og = merged_data.copy()
md_wtext = merged_data[merged_data["text"].str.strip().astype(bool)]

print(f"Reviews after removing non-text: {len(md_wtext)}")

print(md_wtext.info())

Reviews before removing non-text: 347082
Reviews after removing non-text: 347082
<class 'pandas.core.frame.DataFrame'>
Index: 347082 entries, 0 to 673041
Data columns (total 16 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         347082 non-null  float64
 1   name_x          347082 non-null  object 
 2   time            347082 non-null  int64  
 3   rating          347082 non-null  int64  
 4   text            347082 non-null  object 
 5   pics            14238 non-null   object 
 6   resp            49330 non-null   object 
 7   gmap_id         347082 non-null  object 
 8   has_pics        347082 non-null  bool   
 9   name_y          347082 non-null  object 
 10  category        347002 non-null  object 
 11  avg_rating      347082 non-null  float64
 12  num_of_reviews  347082 non-null  int64  
 13  latitude        347082 non-null  float64
 14  longitude       347082 non-null  float64
 15  state           214448 non

In [103]:
import os
from dotenv import load_dotenv
from openai import OpenAI
import openai


Create ReviewClassifier class to handle classification using LLM.

In [104]:
load_dotenv()  # take environment variables from .env file


class ReviewSchema(BaseModel):
    advertisement: bool
    advertisement_confidence: float
    irrelevant: bool
    irrelevant_confidence: float
    fake_rant: bool
    fake_rant_confidence: float


class ReviewClassifier():
    def __init__(self, api_key, isLocalLLM, model_name):
        self.api_key = api_key
        self.isLocalLLM = isLocalLLM
        self.modelName = model_name
        if isLocalLLM:
            self.client = OpenAI(api_key="lm_studio", base_url="http://localhost:1234/v1")  # Placeholder for local LLM client
        else:
            self.client = OpenAI(api_key=api_key)

    def getModelName(self):
        return self.modelName

    def classify_review(self, review_text) -> dict:
        """
        Classify a single review for policy violations using OpenAI API
        """
        
        prompt = f"""
                    You are a review moderator. Analyze this review for policy violations:

                    Review: "{review_text}"

                    Check for these violations:
                    1. ADVERTISEMENT: Contains promotional content, links, phone numbers, or marketing
                    2. IRRELEVANT: Not about the business (talks about personal life, politics, etc.)
                    3. FAKE_RANT: Negative review from someone who clearly never visited

                    Respond in JSON format:
                    {{
                        "advertisement": true/false,
                        "advertisement_confidence": 0.0-1.0,
                        "irrelevant": true/false, 
                        "irrelevant_confidence": 0.0-1.0,
                        "fake_rant": true/false,
                        "fake_rant_confidence": 0.0-1.0
                    }}
                    """
        
        review_schema = {
                "type": "json_schema",
                "json_schema": {
                    "name": "classify_review",
                    "schema": {
                        "type": "object",
                        "properties": {
                                "advertisement": {"type": "boolean", "description": "True if the review is an advertisement"},
                                "advertisement_confidence": {"type": "number", "description": "Confidence level for advertisement classification (0-1)"},
                                "irrelevant": {"type": "boolean", "description": "True if the review is irrelevant"},
                                "irrelevant_confidence": {"type": "number", "description": "Confidence level for irrelevant classification (0-1)"},
                                "fake_rant": {"type": "boolean", "description": "True if the review is a fake rant"},
                                "fake_rant_confidence": {"type": "number", "description": "Confidence level for fake rant classification (0-1)"}
                            },
                        "required": ["advertisement", "advertisement_confidence", "irrelevant", "irrelevant_confidence", "fake_rant", "fake_rant_confidence"]
                    }
                }
            }

        try:
            if (not self.isLocalLLM):
                print("Using ChatGPT model for classification")
                response = self.client.chat.completions.create(
                model=self.modelName,
                messages=[
                    {"role": "user", "content": prompt}
                ],
                # temperature=0.1,
                max_completion_tokens=100,
                tools=[openai.pydantic_function_tool(ReviewSchema)],
                # tools=[{
                #     "type": "function",
                #     "function": {
                #         "name": "classify_review",
                #         "description": "Classify review for policy violations",
                #         "parameters": {
                #             "type": "object",
                #             "properties": {
                #                 "advertisement": {"type": "boolean", "description": "True if the review is an advertisement"},
                #                 "advertisement_confidence": {"type": "number", "description": "Confidence level for advertisement classification (0-1)"},
                #                 "irrelevant": {"type": "boolean", "description": "True if the review is irrelevant"},
                #                 "irrelevant_confidence": {"type": "number", "description": "Confidence level for irrelevant classification (0-1)"},
                #                 "fake_rant": {"type": "boolean", "description": "True if the review is a fake rant"},
                #                 "fake_rant_confidence": {"type": "number", "description": "Confidence level for fake rant classification (0-1)"}
                #             },
                #             "required": ["advertisement", "advertisement_confidence", "irrelevant", "irrelevant_confidence", "fake_rant", "fake_rant_confidence"]
                #         }
                #     }
                # }]
            )

            else:
                print("Using local model for classification")
              

                response = self.client.chat.completions.create(
                    model=self.modelName,
                    messages=[
                        {"role": "user", "content": prompt}
                    ],
                    temperature=0.1,
                    max_tokens=100,
                    response_format=review_schema # type: ignore
                    )
            
            # Parse the JSON response
            result_text = response.choices[0].message.content
            result = json.loads(result_text) if result_text else {}
            
            return {

                'advertisement': result.get('advertisement', False),
                'advertisement_confidence': result.get('advertisement_confidence', 0.0),
                'irrelevant': result.get('irrelevant', False),
                'irrelevant_confidence': result.get('irrelevant_confidence', 0.0),
                'fake_rant': result.get('fake_rant', False),
                'fake_rant_confidence': result.get('fake_rant_confidence', 0.0)
            }
            
        except Exception as e:
            print(f"Error classifying review: {e}")
            return {
                'advertisement': False,
                'irrelevant': False,
                'fake_rant': False
            }
            

classifier = ReviewClassifier(api_key=os.getenv('OPENAI_API_KEY') or "", isLocalLLM=False, model_name="gpt-5-nano")

qwen14b: 2 min 40 sec for 100 reviews  
qwen4b: 1 min 5 sec for 100 reviews  
minimax: 1 min 45 sec for 100 reviews  

In [105]:
# Test the function on a few reviews
print("🧪 Testing classification function...")

# Get some sample reviews from your data
sample_reviews = merged_data['text'].dropna().head(5).tolist()
result_list = []

start_time = time.time()

for i, review in enumerate(sample_reviews):
    
    print(f"\n--- Review {i+1} ---") 
    print(f"Text: {review[:200]}...")
        
    result = classifier.classify_review(review)
    print(result)
    result_list.append(result)
    
    print(f"\nClassification Result {i+1}:")
    print(f"Advertisement: {'❌ YES' if result['advertisement'] else '✅ NO'}")
    print(f"Advertisement Confidence: {result.get('advertisement_confidence', 0.0):.2f}")
    print(f"Irrelevant: {'❌ YES' if result['irrelevant'] else '✅ NO'}")
    print(f"Irrelevant Confidence: {result.get('irrelevant_confidence', 0.0):.2f}")
    print(f"Fake Rant: {'❌ YES' if result['fake_rant'] else '✅ NO'}")
    print(f"Fake Rant Confidence: {result.get('fake_rant_confidence', 0.0):.2f}")

end_time = time.time()
time_taken = end_time - start_time

# Convert results to DataFrame
results_df = pd.DataFrame(result_list)
print("\nSample classification results:")

# Save results to CSV
results_df.to_csv(f"classification_results_{classifier.getModelName().replace('/', '-')}_{time_taken:2f}_{len(sample_reviews)}_{time.time()}.csv", index=True)

🧪 Testing classification function...

--- Review 1 ---
Text: Great place to care for our children....
Using ChatGPT model for classification
{'advertisement': False, 'advertisement_confidence': 0.0, 'irrelevant': False, 'irrelevant_confidence': 0.0, 'fake_rant': False, 'fake_rant_confidence': 0.0}

Classification Result 1:
Advertisement: ✅ NO
Advertisement Confidence: 0.00
Irrelevant: ✅ NO
Irrelevant Confidence: 0.00
Fake Rant: ✅ NO
Fake Rant Confidence: 0.00

--- Review 2 ---
Text: Th sw y are so nice...
Using ChatGPT model for classification
{'advertisement': False, 'advertisement_confidence': 0.0, 'irrelevant': False, 'irrelevant_confidence': 0.0, 'fake_rant': False, 'fake_rant_confidence': 0.0}

Classification Result 2:
Advertisement: ✅ NO
Advertisement Confidence: 0.00
Irrelevant: ✅ NO
Irrelevant Confidence: 0.00
Fake Rant: ✅ NO
Fake Rant Confidence: 0.00

--- Review 3 ---
Text: Went with my daughter...
Using ChatGPT model for classification
{'advertisement': False, 'advertisement