## Tensorflow Amazon Watch Reviews Dataset
* https://www.tensorflow.org/datasets/catalog/amazon_us_reviews#amazon_us_reviewswatches_v1_00 (originally what I wanted to use, but access no longer available)
* https://amazon-reviews-2023.github.io/
    * Will use this dataset and filter down so that only watch reviews are left
* This dataset was found as a failsafe to allow the LLM project to proceed within the limited timeframe
* While the Watch U Seek reviews were successfully scraped, looking at the csv file shows extensive preprocessing would need to be done
    * Clean up the text that was captured
    * Remove reviews for watch related items (but not actual watch reviews)
    * Remove reviews that are just links to video reviews
    * Create a function to pull out the name of the watch model being reviewed (or manually do that for the 3501 reviews scraped)
* Due to fulltime work obligations, I cannot dedicate that much time for this project; thus, I have included my notebooks showing my Watch U Seek scraper and glancing at the created dataset to show how this project could evolve given more time.

In [1]:
import pandas as pd
import numpy as np
import gzip
import json
import csv
from tqdm import tqdm

In [4]:
#checking categories to see how watches show up
with gzip.open('meta_Clothing_Shoes_and_Jewelry.jsonl.gz', "rt", encoding="utf-8") as f:
    for _ in range(100):
        product = json.loads(next(f))
        print(product.get("categories", []))

['Clothing, Shoes & Jewelry', 'Women', 'Clothing', 'Swimsuits & Cover Ups', 'Cover-Ups']
['Clothing, Shoes & Jewelry', 'Women', 'Shoes', 'Outdoor', 'Hiking & Trekking', 'Hiking Shoes']
['Clothing, Shoes & Jewelry', 'Women', 'Shoes', 'Sandals', 'Flats']
['Clothing, Shoes & Jewelry', 'Novelty & More', 'Clothing', 'Novelty', 'Women', 'Skirts']
['Clothing, Shoes & Jewelry', 'Women', 'Handbags & Wallets', 'Crossbody Bags']
['Clothing, Shoes & Jewelry', 'Women', 'Jewelry', 'Rings']
['Clothing, Shoes & Jewelry', 'Women', 'Shoes', 'Loafers & Slip-Ons']
['Clothing, Shoes & Jewelry', 'Novelty & More', 'Clothing', 'Novelty', 'Boys', 'Accessories', 'Gloves & Mittens']
['Clothing, Shoes & Jewelry', 'Women', 'Clothing', 'Dresses', 'Formal']
['Clothing, Shoes & Jewelry', 'Luggage & Travel Gear', 'Travel Accessories']
['Clothing, Shoes & Jewelry', 'Women', 'Clothing', 'Dresses', 'Casual']
['Clothing, Shoes & Jewelry', 'Women', 'Clothing', 'Active', 'Active Shorts']
['Clothing, Shoes & Jewelry', 'Men',

In [22]:
#jsonl.gz file (meta)
watch_meta = {}

def is_watch_product(product):
    #'watch' in main_category
    main_cat=(product.get("main_category") or "").lower()
    if "watch" in main_cat:
        return True

    #'watch' in categories (lowest level; end of list)
    categories = product.get("categories", [])
    if isinstance(categories, list) and categories:
        last_category = categories[-1]
        if "watch" in last_category.lower():
            return True

with gzip.open('meta_Clothing_Shoes_and_Jewelry.jsonl.gz', "rt", encoding="utf-8") as meta_file:
    for line in tqdm(meta_file, desc="Filtering metadata"):
        product=json.loads(line) #loading product meta data
        if is_watch_product(product):
            parent_asin=product.get("parent_asin")
            if parent_asin:
                watch_meta[parent_asin]=product  

print(f'{len(watch_meta)} watch-related products.')

Filtering metadata: 7218481it [01:31, 78969.00it/s]

173763 watch-related products.





In [23]:
#opening jsonl.gz file (reviews)
reviews=[]
watch_parent_asins=set(watch_meta.keys()) #create a set that has watch parent asins; set to help processing speed

with gzip.open('Clothing_Shoes_and_Jewelry.jsonl.gz', "rt", encoding="utf-8") as f:
    for i, line in enumerate(f):
        review=json.loads(line)
        if review.get("parent_asin") in watch_parent_asins:
            reviews.append(review)

In [24]:
#combining reviews and meta information
full_ds=[]

for review in reviews:
    parent_asin=review.get("parent_asin")
    product_info=watch_meta.get(parent_asin)

    if product_info:
        #combine review columns + meta columns
        merged={**{f"review_{k}": v for k, v in review.items()},
                  **{f"meta_{k}": v for k, v in product_info.items()}}
        full_ds.append(merged)


In [25]:
#csv; created a new file so previous steps do not need to be repeated

#keys
all_keys = set()
for row in full_ds:
    all_keys.update(row.keys())

fieldnames=sorted(all_keys)

with open('amazon_watch_reviews.csv', "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(full_ds)
    print(f"{len(full_ds)} rows in CSV.") #result = 7.99 gb csv file


2015895 rows in CSV.


In [26]:
#loading as pandas dataframe
df=pd.read_csv('amazon_watch_reviews.csv')
df.head()

  df=pd.read_csv('amazon_watch_reviews.csv')


Unnamed: 0,meta_author,meta_average_rating,meta_bought_together,meta_categories,meta_description,meta_details,meta_features,meta_images,meta_main_category,meta_parent_asin,...,review_asin,review_helpful_vote,review_images,review_parent_asin,review_rating,review_text,review_timestamp,review_title,review_user_id,review_verified_purchase
0,,4.5,,"['Clothing, Shoes & Jewelry', 'Women', 'Watche...","['Speidel Petite Scrub Watch for Nurse, Doctor...",{'Product Dimensions': '7.1 x 1.1 x 0.31 inche...,"['PERFECT FOR NURSES, DOCTORS, EMT, SURGEONS &...",[{'thumb': 'https://m.media-amazon.com/images/...,AMAZON FASHION,B07HJ84J9M,...,B07HJ9DJGQ,0,[],B07HJ84J9M,5.0,Love this watch! It's perfectly petite! It's r...,1624587865782,Great watch for nurses,AGGZ357AO26RQZVRLGU4D4N52DZQ,True
1,,4.4,,"['Clothing, Shoes & Jewelry', 'Women', 'Watches']","[""The Casio Women's Digital White Resin Strap ...","{'Is Discontinued By Manufacturer': 'No', 'Pro...","['Imported', 'Round sport watch featuring pink...",[{'thumb': 'https://m.media-amazon.com/images/...,AMAZON FASHION,B001UHMUXC,...,B001UHMUXC,0,[],B001UHMUXC,5.0,Does its job nicely for work - Nursing!,1452650548000,Five Stars,AGKASBHYZPGTEPO6LWZPVJWB2BVA,True
2,,4.7,,"['Clothing, Shoes & Jewelry', 'Men', 'Watches'...",[],"{'Is Discontinued By Manufacturer': 'No', 'Pro...",['Timeless styling that effortlessly takes you...,[{'thumb': 'https://m.media-amazon.com/images/...,AMAZON FASHION,B0BYT41CVB,...,B00PXVUKBK,0,[],B0BYT41CVB,5.0,"Got this for my boyfriend for Christmas, he lo...",1578527465894,"Nice watch, good gift",AGBFYI2DDIKXC5Y4FARTYDTQBMFQ,True
3,,4.7,,"['Clothing, Shoes & Jewelry', 'Women', 'Watche...",[],"{'Is Discontinued By Manufacturer': 'No', 'Pro...",['Timeless styling that effortlessly takes you...,[{'thumb': 'https://m.media-amazon.com/images/...,AMAZON FASHION,B09ZL4W5T8,...,B004JKDLVW,0,[],B09ZL4W5T8,5.0,"I love this watch and wear it often, absolutel...",1535058677377,"pretty watch, well made",AGBFYI2DDIKXC5Y4FARTYDTQBMFQ,True
4,,4.2,,"['Clothing, Shoes & Jewelry', 'Women', 'Watche...",['Peugeot watches have been redefining fashion...,"{'Is Discontinued By Manufacturer': 'No', 'Pac...","['Imported', ""The perfect women's watch for sm...",[{'thumb': 'https://m.media-amazon.com/images/...,AMAZON FASHION,B01D8MNCPG,...,B0016OOXMA,1,[],B01D8MNCPG,5.0,"I bought this watch to wear to an interview, a...",1481817389000,and it's perfect for that,AGBFYI2DDIKXC5Y4FARTYDTQBMFQ,True


In [27]:
#checking that file is capturing mainly watch/watch related content
df['meta_categories'].unique()

array(["['Clothing, Shoes & Jewelry', 'Women', 'Watches', 'Wrist Watches']",
       "['Clothing, Shoes & Jewelry', 'Women', 'Watches']",
       "['Clothing, Shoes & Jewelry', 'Men', 'Watches', 'Wrist Watches']",
       "['Clothing, Shoes & Jewelry', 'Women', 'Watches', 'Watch Bands']",
       "['Clothing, Shoes & Jewelry', 'Shoe, Jewelry & Watch Accessories', 'Watch Accessories', 'Pocket Watch Chains']",
       "['Clothing, Shoes & Jewelry', 'Men', 'Watches', 'Watch Bands']",
       "['Clothing, Shoes & Jewelry', 'Novelty & More', 'Watches']",
       "['Clothing, Shoes & Jewelry', 'Boys', 'Watches', 'Wrist Watches']",
       '[\'Clothing, Shoes & Jewelry\', "Men\'s Watches Under $50"]',
       "['Clothing, Shoes & Jewelry', 'Jewelry & Watches Outlet']",
       "['Clothing, Shoes & Jewelry', 'Men', 'Watches', 'Smartwatches']",
       "['Clothing, Shoes & Jewelry', 'Girls', 'Watches', 'Wrist Watches']",
       "['Clothing, Shoes & Jewelry', 'Shoe, Jewelry & Watch Accessories', 'Watch Acc

In [28]:
df.columns

Index(['meta_author', 'meta_average_rating', 'meta_bought_together',
       'meta_categories', 'meta_description', 'meta_details', 'meta_features',
       'meta_images', 'meta_main_category', 'meta_parent_asin', 'meta_price',
       'meta_rating_number', 'meta_store', 'meta_subtitle', 'meta_title',
       'meta_videos', 'review_asin', 'review_helpful_vote', 'review_images',
       'review_parent_asin', 'review_rating', 'review_text',
       'review_timestamp', 'review_title', 'review_user_id',
       'review_verified_purchase'],
      dtype='object')

Want to keep the following columns
* meta_average_rating
* meta_description
* meta_details
* meta_features
* meta_price
* meta_title
* meta_subtitle
* review_rating
* review_text
* review_title
* review_verified_purchase

In [29]:
df=df[['meta_title','meta_subtitle','meta_average_rating','meta_description','meta_details','meta_features','meta_price',
                    'review_rating','review_title','review_text','review_verified_purchase']]
df.head()

Unnamed: 0,meta_title,meta_subtitle,meta_average_rating,meta_description,meta_details,meta_features,meta_price,review_rating,review_title,review_text,review_verified_purchase
0,"Speidel Petite Scrub Watch™ for Nurse, Doctors...",,4.5,"['Speidel Petite Scrub Watch for Nurse, Doctor...",{'Product Dimensions': '7.1 x 1.1 x 0.31 inche...,"['PERFECT FOR NURSES, DOCTORS, EMT, SURGEONS &...",49.0,5.0,Great watch for nurses,Love this watch! It's perfectly petite! It's r...,True
1,Casio Women's LW200-7AV Digital Watch with Whi...,,4.4,"[""The Casio Women's Digital White Resin Strap ...","{'Is Discontinued By Manufacturer': 'No', 'Pro...","['Imported', 'Round sport watch featuring pink...",19.42,5.0,Five Stars,Does its job nicely for work - Nursing!,True
2,Citizen Men's Classic Eco-Drive Leather Strap ...,,4.7,[],"{'Is Discontinued By Manufacturer': 'No', 'Pro...",['Timeless styling that effortlessly takes you...,163.9,5.0,"Nice watch, good gift","Got this for my boyfriend for Christmas, he lo...",True
3,"Citizen Eco-Drive Chandler Womens Watch, Stain...",,4.7,[],"{'Is Discontinued By Manufacturer': 'No', 'Pro...",['Timeless styling that effortlessly takes you...,182.96,5.0,"pretty watch, well made","I love this watch and wear it often, absolutel...",True
4,Peugeot Women's Petite Round Wrist Watch with ...,,4.2,['Peugeot watches have been redefining fashion...,"{'Is Discontinued By Manufacturer': 'No', 'Pac...","['Imported', ""The perfect women's watch for sm...",52.99,5.0,and it's perfect for that,"I bought this watch to wear to an interview, a...",True


In [30]:
df.isnull().sum()

meta_title                      230
meta_subtitle               2015422
meta_average_rating               0
meta_description                  0
meta_details                      0
meta_features                     0
meta_price                   956518
review_rating                     0
review_title                    492
review_text                    1009
review_verified_purchase          0
dtype: int64

In [31]:
#double checking that there are less unique titles than rows (watches should have multiple reviews; thus title values should repeat)
len(df['meta_title'].unique())

158899

In [32]:
#grouping reviews by watch model
reviews_grouped=df.groupby("meta_title")["review_text"].apply(list).reset_index()
reviews_grouped['review_text'][0]

['Cutest watch ever , bought for myself, little small but i can still wear. Love it.']

In [33]:
reviews_grouped['review_text'][1]

['Two kids toy way too expensive for that']

In [34]:
reviews_grouped['review_text'][2]

["I really like this watch. It is very quirky. It is mechanical and must be wound daily. It is fairly accurate. Setting the date is a hassle but there are ways to do it(see YouTube). It is bullet-proof. Not Casio bullet-proof but its tough. Put it on a NATO strap and its good to go.<br />3 stars because I really haven't had it long enough, but I really like it.<br />Prob gonna get an Amphibian next.",
 'I like this as I do all 8 V/Bostoc of the company. A tank of a watch !',
 "Very nice watch, but it doesn't work properly",
 "I love the watch. It's quirky time-piece that catches the eyes of many. The story is a great one, especially for a sub-$100 watch. I had to change the strap because it was utter garbage but once the strap was changed to a quality leather, it wears great. I would recommend this to anyone who wants a watch with a story but is on a budget.",
 "Beautiful watch didn't know its manual winding nice watch and good seller thanks",
 'Awesome watch, shipping was much faster 

In [35]:
len(reviews_grouped)

158898