##### 00 - Project config and first data check

This notebook sets up project paths, imports core utilities, defines a shared text normalization function, and loads a small sample JSON file to confirm the raw data structure.

In [1]:
# Import standard libraries and add the src folder to Python path

from pathlib import Path
import sys

# Locate the project root (one level above the notebooks folder)
PROJECT_ROOT = Path.cwd().parents[0]

# Add src folder so we can import our own modules
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

PROJECT_ROOT, SRC_DIR

(PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1'),
 PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/src'))

##### Import project config and text normalization

This cell imports the shared configuration (paths) and the text normalization function from the src folder so we can reuse them in all notebooks.

In [6]:
# Make sure the `src` folder is on Python's import path,
# then import project paths and the text normalization function.

from pathlib import Path
import sys

# Try a few possible locations for the project root, relative to this notebook
cwd = Path.cwd()

possible_roots = [
    cwd,          # e.g. if you opened the project root directly
    cwd.parent,   # e.g. if you're in notebooks/
    cwd.parent.parent,  # just in case
]

SRC_DIR = None
for root in possible_roots:
    candidate = root / "src"
    if candidate.is_dir():
        SRC_DIR = candidate
        if str(SRC_DIR) not in sys.path:
            sys.path.append(str(SRC_DIR))
        break

if SRC_DIR is None:
    raise FileNotFoundError(
        "Could not find a 'src' folder near this notebook. "
        "Please make sure you have 'src/config.py' and 'src/preprocessing.py' "
        "at the project root."
    )

# Now import from our project modules
from config import (
    PROJECT_ROOT,
    RAW_DATA_DIR,
    LABELS_DIR,
    INTERIM_DATA_DIR,
    PROCESSED_DATA_DIR,
    EXPORT_GLOB_PATTERN,
)
from preprocessing import normalize_text

PROJECT_ROOT, RAW_DATA_DIR, LABELS_DIR

(PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation'),
 PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/raw'),
 PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/labels'))

##### Find a sample JSON file

This cell searches the raw data folder for one JSON file that matches the expected pattern (for example `export_shopper=AUG-24/0000_part_00.json`) so we can inspect a small sample of rows.

In [7]:
from typing import List

# Find candidate JSON files matching the pattern
json_files: List[Path] = list(RAW_DATA_DIR.glob(EXPORT_GLOB_PATTERN))

len(json_files), json_files[:3]

(10,
 [PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/raw/export_shopper=SEP-24/0000_part_00.json'),
  PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/raw/export_shopper=JUL-24/0000_part_00.json'),
  PosixPath('/Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/raw/export_shopper=AUG-24/0000_part_00.json')])

##### Load a small sample from the first JSON file

This cell reads a small chunk of the first JSON file and shows a few columns so we can confirm that `remove_amazon` exists and contains product names.

In [8]:
import pandas as pd

if not json_files:
    raise FileNotFoundError(
        f"No JSON files found with pattern {EXPORT_GLOB_PATTERN!r} in {RAW_DATA_DIR}"
    )

sample_json_path = json_files[0]
print("Using sample JSON file:", sample_json_path)

# Read a small chunk using pandas read_json with a chunksize
json_iter = pd.read_json(
    sample_json_path,
    lines=True,
    chunksize=5000,
)

sample_df = next(json_iter)
sample_df.head()

Using sample JSON file: /Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/raw/export_shopper=SEP-24/0000_part_00.json


Unnamed: 0,event_id,panelist_id,event_name,event_type,start_time_local,end_time_local,search_term,page_view_id,product_id,remove_amazon,purchase_price,purchase_quantity,retailer_property_name,currency
0,b4ec4cf59c187896cd5e28072da84819,a9f388b6-472c-4c9e-b133-a3a0c9ea9771,retailer_event,Product Detail,2024-09-27 18:46:50.509,2024-09-27 18:46:51.53,,,,"[""""Hey Dude Women's Wendy Loafer"""",""""Hey Dude ...",,,Amazon,USD
1,0f8875a93bd5b023d603d106199ec2cd,dc70738f-dfcd-415e-b3fe-3f90bfd4a0a9,retailer_event,search,2024-09-16 03:49:52.126,2024-09-16 03:49:55.478,rear view mirrors for bike,,,,,,Amazon,USD
2,4035c0e33081d786fbc803dc7b23a2bc,3fdbd16e-64fd-4716-80ff-2c2ac61400bb,retailer_event,search,2024-09-15 16:40:05.502,2024-09-15 16:40:06.119,silicone spatula,,,,,,Amazon,USD
3,d898bf4c3d40688838b570df04174682,3f9ea1db-c342-4ab9-b393-459d3c9fa1d2,retailer_event,Basket View,2024-09-14 20:39:24.67,2024-09-14 20:39:28.303,,,,RAISEVERN Toddler Girl Jumpsuit Baby Girl Romp...,8.99,1.0,Amazon,USD
4,ec256bdaca419761f0af3b0a5652989d,17222567-d85e-4c37-bc45-fa94b30b6189,retailer_event,Basket View,2024-09-16 13:58:34.985,2024-09-16 13:58:36.423,,,,simplehuman 9 oz. Touch-Free Rechargeable Sens...,41.0,1.0,Amazon,USD


##### Add raw and normalized product text columns

This cell creates two new columns:

- `product_text_raw`: the original product name from the JSON (`remove_amazon`).
- `product_text_norm`: the normalized text using our shared normalization function.

This confirms that our normalization works on real data.

In [9]:
# Copy the original product name into a clearer column
if "remove_amazon" not in sample_df.columns:
    raise KeyError("Expected column 'remove_amazon' not found in sample JSON data.")

sample_df = sample_df.copy()
sample_df["product_text_raw"] = sample_df["remove_amazon"]
sample_df["product_text_norm"] = sample_df["product_text_raw"].apply(normalize_text)

sample_df[["product_text_raw", "product_text_norm"]].head(10)

Unnamed: 0,product_text_raw,product_text_norm
0,"[""""Hey Dude Women's Wendy Loafer"""",""""Hey Dude ...",hey dude women s wendy loafer hey dude women s...
1,,
2,,
3,RAISEVERN Toddler Girl Jumpsuit Baby Girl Romp...,raisevern toddler girl jumpsuit baby girl romp...
4,simplehuman 9 oz. Touch-Free Rechargeable Sens...,simplehuman 9 oz touch free rechargeable senso...
5,Besolor Women Plus Size Cotton Linen Wide Leg ...,besolor women plus size cotton linen wide leg ...
6,"["""""""",""""""""]",
7,"ERKEISEHN Bluetooth Speaker, IPX7 Waterproof W...",erkeisehn bluetooth speaker ipx7 waterproof wi...
8,MYLUCKYTAG NFC & QR Code Smart Pet ID Tag Pers...,myluckytag nfc qr code smart pet id tag person...
9,Slim Fit Full Coverage Long Sleeve Swimming Su...,slim fit full coverage long sleeve swimming su...
