<a href="https://colab.research.google.com/github/tanvikurade/JASON-TO-CSV-CONVERSION/blob/main/find_product_by_name_and_align_with_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In this colab, we present an example code snippet to **find target products** from the metadata we provide, e.g., based on the product titles.

In addition, we also show how to **align products with their reviews**, and find out the time spans that the products are on market based on their review times.

In [None]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

import random
import numpy as np
from tqdm import tqdm_notebook as tqdm
from collections import defaultdict

In [None]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_meta_Home_and_Kitchen.json
!wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_Home_and_Kitchen_5.json

--2020-08-05 18:45:02--  http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_meta_Home_and_Kitchen.json
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9741929 (9.3M) [application/json]
Saving to: ‘sample_meta_Home_and_Kitchen.json’


2020-08-05 18:45:04 (7.22 MB/s) - ‘sample_meta_Home_and_Kitchen.json’ saved [9741929/9741929]

--2020-08-05 18:45:05--  http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_Home_and_Kitchen_5.json
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29033132 (28M) [application/json]
Saving to: ‘sample_Home_and_Kitchen_5.json’


2020-08-05 18:45:20 (1.83 MB/s) - ‘sample_Home_and_Kitchen_5.json’ saved [29033132/29033132]



In [None]:
# load all metadata
data = []
with open('sample_meta_Home_and_Kitchen.json', 'r') as f:
    for l in tqdm(f):
        data.append(json.loads(l))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [None]:
# show data
print(data[0])

{'category': ['Home & Kitchen', 'Vacuums & Floor Care'], 'description': ['Eureka Replacement Vacuum Belt'], 'title': 'Eureka 54312-12 Vacuum Cleaner Belt', 'brand': 'Eureka', 'feature': ['Limit 1 per order', 'Returns will not be honored on this closeout item'], 'rank': '>#1,098,930 in Home & Kitchen (See Top 100 in Home & Kitchen)>#17,327 in Home & Kitchen > Vacuums & Floor Care', 'also_view': ['B004B54FM4', 'B014N37IBI', 'B00VH79FH4', 'B008MKNG6U', 'B001AO1VBW', 'B00TM8XQK2', 'B001EZIEOO', 'B013KYDLJY', 'B013JKGOH0', 'B0195UJPGU', 'B001ANZQSM', 'B00BY3VYFC', 'B00007E7OH'], 'main_cat': 'Amazon Home', 'price': '$4.36', 'asin': 'B00002N62Y'}


In [None]:
# find out all products whose title includes a specified term
cands = []
cands2 = []
for d in data:
    if 'title' in d and 'vacuum cleaner' in d['title'].lower():
        cands.append(d)
    if 'title' in d and 'vacuum' in d['title'].lower():
        cands2.append(d)

print(len(cands))
print(len(cands2))

3025
3025


In [None]:
# show some example products
for d in cands[:10]:
    print(d['title'])

Eureka 54312-12 Vacuum Cleaner Belt
Eureka Mighty Mite 3670G Corded Canister Vacuum Cleaner, Yellow
Hoover U5253-900 Breathe Easy Upright Vacuum Cleaner
Hoover S3639 WindTunnel Canister Vacuum Cleaner
Orgill Hoover S1147-900 Twist & Vac Hand-Held Vacuum Cleaner
Hoover 3 Pack, Style K Vacuum Cleaner Bag
Hoover S3607 Powermax Deluxe Canister Vacuum Cleaner
Hoover S3510 Powermax Canister Vacuum Cleaner
Hoover S3410 Spirit Canister Vacuum Cleaner
Eureka 4870 Ultra Smart Vac Upright Vacuum Cleaner with True HEPA Filter


In [None]:
# build asin set
cands_asin = set([d['asin'] for d in cands])
cands2_asin = set([d['asin'] for d in cands2])
len(cands_asin)

2963

In [None]:
# align products with reviews
reviews = defaultdict(list)
with open('sample_Home_and_Kitchen_5.json', 'r') as f:
    for l in f:
        r = json.loads(l)
        a = r['asin']
        if a in cands_asin:
            reviews[a].append(r)

In [None]:
#  find out their time span on the market
reviews_times = defaultdict(list)

for k,vs in reviews.items():
    ts = []
    for v in vs:
        t = v['reviewTime']
        ts.append(t)
    # sort time span
    ts = sorted(ts)
    reviews_times[k] = ts

In [None]:
# print the start of time span for each vacuum cleaner product
for k,v in reviews_times.items():
    print(k, v[0])

B00002N62Y 03 8, 2013
B00002N8CX 01 1, 2013
B00004U9TI 01 15, 2007
B000050686 01 3, 2015
B000050B6F 01 5, 2008
B000050HCV 01 11, 2014
B00005LVV6 01 11, 2003
B00005NWXF 02 26, 2006
B00005V9E3 01 1, 2003
B000079R7E 01 1, 2005
B000096JFW 01 15, 2011
B0000D83BR 02 14, 2018
B0000SWDR0 01 10, 2009
B0000SWAC8 01 13, 2005
B000246E2W 03 22, 2013
B00028I1M4 01 13, 2015
B0002EB670 01 14, 2016
B0002MM5AO 01 29, 2006
B0002UW0FG 01 11, 2010
B0006OLGAS 02 21, 2016
B0006OLG0S 01 1, 2013
B0007D9S9E 01 1, 2013
B0007D9QQE 01 15, 2014
B0007LJO2C 01 10, 2010
B0007WT9IA 01 18, 2014
B0007XY6KK 01 20, 2013
B000981H6O 01 10, 2010
B000981H6Y 01 1, 2007
B0009GZNIY 01 1, 2014
B0009GZNT8 03 24, 2015
B0009H7D6I 01 15, 2016
B0009H63D2 02 22, 2018
B0009HNH2W 01 28, 2015
B0009ONZ8Q 01 19, 2007
B0009ONZ8G 01 1, 2015
B0009RF81A 01 12, 2008
B000A6TOEM 01 19, 2009
B000AAJVC8 01 1, 2015
B000AAWEK4 01 11, 2012
B000AAWEJU 01 15, 2007
B000B649VO 01 28, 2007
B000BGO7KW 01 10, 2006
B000BU1H5G 01 12, 2006
B000BWEOOA 01 21, 2007
