# Step 5 — Data Sanity Check

Quick validation of the Amazon Electronics reviews dataset to confirm data is accessible and has expected structure.

## Import Required Libraries

Import gzip and json libraries needed to read the compressed JSON data file.

In [1]:
import gzip
import json

## Load and Inspect Raw Data

Open the gzipped JSON file from the data/raw/ directory and read the first line to examine the data structure.

In [2]:
path = "../data/raw/reviews_Electronics_5.json.gz"

with gzip.open(path, 'rt', encoding='utf-8') as f:
    for i, line in enumerate(f):
        data = json.loads(line)
        print(data)
        break

{'reviewerID': 'AO94DHGC771SJ', 'asin': '0528881469', 'reviewerName': 'amazdnu', 'helpful': [0, 0], 'reviewText': 'We got this GPS for my husband who is an (OTR) over the road trucker.  Very Impressed with the shipping time, it arrived a few days earlier than expected...  within a week of use however it started freezing up... could of just been a glitch in that unit.  Worked great when it worked!  Will work great for the normal person as well but does have the "trucker" option. (the big truck routes - tells you when a scale is coming up ect...)  Love the bigger screen, the ease of use, the ease of putting addresses into memory.  Nothing really bad to say about the unit with the exception of it freezing which is probably one in a million and that\'s just my luck.  I contacted the seller and within minutes of my email I received a email back with instructions for an exchange! VERY impressed all the way around!', 'overall': 5.0, 'summary': 'Gotta have GPS!', 'unixReviewTime': 1370131200, 

## Verify Data Structure

Parse the JSON line and print the data to confirm the presence of expected keys such as:
- `reviewerID`
- `asin`
- `overall`
- `unixReviewTime`

This confirms everything is wired correctly.

In [3]:
# Print the keys to verify data structure
print("Expected keys present:")
expected_keys = ['reviewerID', 'asin', 'overall', 'unixReviewTime']
for key in expected_keys:
    if key in data:
        print(f"  ✓ {key}: {data[key]}")
    else:
        print(f"  ✗ {key}: MISSING")

Expected keys present:
  ✓ reviewerID: AO94DHGC771SJ
  ✓ asin: 0528881469
  ✓ overall: 5.0
  ✓ unixReviewTime: 1370131200


# Step 5 — Understand the Data (This Is the Thinking Part)

Load the processed event data and analyze key characteristics:

In [4]:
import pandas as pd

# Load processed data
events = pd.read_csv("../data/processed/events_raw.csv", parse_dates=["timestamp"])
items = pd.read_csv("../data/processed/items_raw.csv")

print("Events data:")
print(events.head())
print("\nItems data:")
print(items.head())

Events data:
          user_id     item_id event_type                 timestamp
0   AO94DHGC771SJ  0528881469   purchase 2013-06-02 00:00:00+00:00
1   AMO214LNFCEI4  0528881469   purchase 2010-11-25 00:00:00+00:00
2  A3N7T0DY83Y4IG  0528881469   purchase 2010-09-09 00:00:00+00:00
3  A1H8PY3QHMQQA0  0528881469   purchase 2010-11-24 00:00:00+00:00
4  A24EV6RXELQZ63  0528881469   purchase 2011-09-29 00:00:00+00:00

Items data:
Empty DataFrame
Columns: [item_id, title, price, category]
Index: []


## 1️⃣ Sparsity: Understanding User-Item Interactions

Recommendation systems are inherently sparse: most users interact with a tiny fraction of items.

In [5]:
# Count unique users, items, and total interactions
n_users = events["user_id"].nunique()
n_items = events["item_id"].nunique()
n_interactions = len(events)
sparsity = 1 - (n_interactions / (n_users * n_items))

print(f"Total users: {n_users:,}")
print(f"Total items: {n_items:,}")
print(f"Total interactions: {n_interactions:,}")
print(f"Sparsity: {sparsity:.4f} ({sparsity*100:.2f}%)")
print(f"\nMax possible interactions: {n_users * n_items:,}")
print(f"Actual density: {(n_interactions / (n_users * n_items))*100:.4f}%")

Total users: 155,870
Total items: 19,839
Total interactions: 500,000
Sparsity: 0.9998 (99.98%)

Max possible interactions: 3,092,304,930
Actual density: 0.0162%


In [6]:
# Analyze user interaction distribution
user_counts = events.groupby("user_id")["item_id"].count()

print("User interaction distribution:")
print(user_counts.describe())
print(f"\nMost active user has {user_counts.max():,.0f} interactions")
print(f"Median user has {user_counts.median():.0f} interactions")
print(f"Mean user has {user_counts.mean():.1f} interactions")

User interaction distribution:
count    155870.000000
mean          3.207801
std           3.518480
min           1.000000
25%           1.000000
50%           2.000000
75%           4.000000
max         275.000000
Name: item_id, dtype: float64

Most active user has 275 interactions
Median user has 2 interactions
Mean user has 3.2 interactions


## 2️⃣ Long-Tail Items: Power Law Distribution

A tiny number of items get most interactions. Majority of items are rarely interacted with.
This is the long-tail problem recommendation systems exist to solve.

In [7]:
# Analyze item popularity distribution
item_counts = events.groupby("item_id")["user_id"].count()

print("Item popularity distribution:")
print(item_counts.describe())

# Show top 10 items
print("\n\nTop 10 most popular items:")
print(item_counts.sort_values(ascending=False).head(10))

Item popularity distribution:
count    19839.000000
mean        25.202883
std         66.406892
min          5.000000
25%          6.000000
50%         10.000000
75%         21.000000
max       3435.000000
Name: user_id, dtype: float64


Top 10 most popular items:
item_id
B0019EHU8G    3435
B0002L5R78    2599
B000LRMS66    1960
B000QUUFRW    1890
B000VX6XL6    1556
B000S5Q9CA    1393
B000BQ7GW8    1388
B0012S4APK    1295
B00007E7JU    1279
B00004ZCJE    1258
Name: user_id, dtype: int64


In [10]:
# Quantify the long-tail effect
top_10_pct = item_counts.sort_values(ascending=False).head(int(len(item_counts) * 0.1)).sum()
total_interactions = item_counts.sum()

print(f"Top 10% of items account for {(top_10_pct/total_interactions)*100:.1f}% of interactions")

# Show item distribution
print("\nItems with just 1 interaction:", (item_counts == 1).sum())
print("Items with 1-5 interactions:", ((item_counts >= 1) & (item_counts <= 5)).sum())
print("Items with 5-20 interactions:", ((item_counts > 5) & (item_counts <= 20)).sum())
print("Items with >20 interactions:", (item_counts > 20).sum())

Top 10% of items account for 52.9% of interactions

Items with just 1 interaction: 0
Items with 1-5 interactions: 2912
Items with 5-20 interactions: 11729
Items with >20 interactions: 5198


## 3️⃣ User Skew: Power Users Can Dominate

A small number of highly active users can dominate interaction signals.
This requires normalization in later modeling steps.

In [11]:
# Analyze top users
print("Top 10 most active users:")
print(user_counts.sort_values(ascending=False).head(10))

print(f"\n\nTop 10 users account for {(user_counts.sort_values(ascending=False).head(10).sum() / user_counts.sum() * 100):.1f}% of all interactions")

# User distribution
print("\nUsers with just 1 interaction:", (user_counts == 1).sum())
print("Users with 1-5 interactions:", ((user_counts >= 1) & (user_counts <= 5)).sum())
print("Users with 5-20 interactions:", ((user_counts > 5) & (user_counts <= 20)).sum())
print("Users with >20 interactions:", (user_counts > 20).sum())

Top 10 most active users:
user_id
A5JLAU2ARJ0BO     275
A6FIAB28IS79      223
A3OXHLG6DIBRW8    170
A680RUE1FDO8B     159
A231WM2Z2JL0U3    159
A17BUUBOU0598B    129
ADLVFFE4VBT8      114
A250AXLRBVYKB4    107
A1F9Z42CFF9IAY    106
A1ODOGXEYECQQ8    105
Name: item_id, dtype: int64


Top 10 users account for 0.3% of all interactions

Users with just 1 interaction: 45069
Users with 1-5 interactions: 136742
Users with 5-20 interactions: 18377
Users with >20 interactions: 751
