# Boxcoxrox Amazon Product Recommender

The purpose of this notebook is to walk through the steps of building the Electronics database used with the boxcoxrox project.  Please note that:

- This notebook **WILL DESTROY AND REBUILD THE ENTIRE DATABASE**.  This can take long time 1-2 hours.
- This notebook will check if reviews and products archives and json files have been downloaded before downloading them and decompressing them again.

The code to perform these tasks are contained in the ReviewHelper and ProductHelper classes in the associated python files in the same directory of this notebook.  Remember, if you make changes to these files, you will need to restart the kernel of this notebook if you wish to access those changes from here.  (The notebook does not dynamically reload classes.)


In [1]:
from unittest import TestCase
from review_helper import ReviewHelper
from product_helper import ProductHelper
import sqlite3
from datetime import datetime
notebook_start_time = datetime.now()

In [2]:
review_source = input("Please enter the source file for amazon reviews: (Enter for default.)")
if review_source == "":
    review_source = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Pet_Supplies.json.gz"
    
product_source = input("Please enter the corresponding metadata source file for the reviews. (Enter for default.)")
if product_source == "":
    product_source = "http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Pet_Supplies.json.gz"
    


Please enter the source file for amazon reviews: (Enter for default.)
Please enter the corresponding metadata source file for the reviews. (Enter for default.)


In [3]:


from download_helper import DownloadHelper
dh = DownloadHelper()
dh.download(review_source, product_source)


print("Time to download and decompress: {}".format(datetime.now()-notebook_start_time))
print("Finished.")



File reviews.gz has been downloaded.
File reviews.json has been uncompressed.
File products.gz has been downloaded.
File products.json has been uncompressed.
Time to download and decompress: 0:04:35.047790
Finished.


In [4]:
start_time = datetime.now()

rh = ReviewHelper()
conn = rh.create_db_connection("pets.db")
rh.drop_review_table_sql()
print("Table Dropped.")
rh.create_review_table_sql()
print("Table Created.")
rh.DEBUG = False

print("Reading data...")
with open("reviews.json","r") as reviews:
    review_lines = reviews.readlines()
print("Elapsed time to load data: {}".format(datetime.now() - start_time))
print("Found {} reviews.".format(len(review_lines)))
    
print("Inserting reviews...")
insert_start = datetime.now()
rh.insert_json_lines(review_lines)
print("Data inserted in: {}".format(datetime.now()-insert_start))

print("Creating index.  Be patient.")
index_start = datetime.now()
rh.create_index()
rh.close_db()
print("Indexing Done.  Elapsed time: {}".format(datetime.now() - index_start))
print("Total elapsed time: {}".format(datetime.now() - start_time))
print("Total reviews imported: {}".format(len(review_lines)))

Table Dropped.
Table Created.
Reading data...
Elapsed time to load data: 0:00:02.876056
Found 6542483 reviews.
Inserting reviews...
Data inserted in: 0:05:09.990287
Creating index.  Be patient.
Indexing Done.  Elapsed time: 0:00:03.757688
Total elapsed time: 0:05:16.624629
Total reviews imported: 6542483


In [3]:
ph = ProductHelper()
conn = ph.create_db_connection("pets.db")
ph.drop_product_table()
print("Table Dropped.")
ph.create_product_table()
print("Table Created.")
ph.DEBUG = False
start_time = datetime.now()
print("Reading data...")
with open("products.json","r") as products:
    product_lines = products.readlines()
print("Elapsed time to load data: {}".format(datetime.now() - start_time))
print("Found {} products.".format(len(product_lines)))
    
print("Inserting products...")
insert_start = datetime.now()
ph.insert_json_lines(product_lines)
print("Data inserted in: {}".format(datetime.now()-insert_start))

print("Creating index.  Be patient.")
index_start = datetime.now()
ph.create_index()
ph.close_db()
print("Indexing Done.  Elapsed time: {}".format(datetime.now() - index_start))
print("Total elapsed time: {}".format(datetime.now() - start_time))
print("Total product imported: {}".format(len(product_lines)))
print("Full time to download notebook: {}".format(datetime.now()-notebook_start_time))

Table Dropped.
Table Created.
Reading data...
Elapsed time to load data: 0:01:39.866054
Found 786445 products.
Inserting products...
Data inserted in: 0:02:33.295672
Creating index.  Be patient.
Indexing Done.  Elapsed time: 0:01:44.933846
Total elapsed time: 0:05:58.153480
Total product imported: 786445
Full time to download notebook: 0:31:11.810832
