# Amazon salesrank data books analysis

This notebook performs some exploratory analysis on the [Amazon sales rank data for print and kindle books dataset](https://www.kaggle.com/ucffool/amazon-sales-rank-data-for-print-and-kindle-books) found on Kaggle.

This notebook just looks at the the salesrank data which is partitioned by ASIN. Inside each file is a JSON object with unix timestamp keys and the values are the salesranks.

## Set up and load data

In [1]:
import json
import os

import findspark
findspark.init()

from dotenv import load_dotenv
import matplotlib.pyplot as plt
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, FloatType, StringType, StructField, StructType
from pyspark_dist_explore import hist

import helpers as H

%matplotlib inline

load_dotenv()

AMZ_SALESRANK_000724519X_PATH = os.getenv("AMZ_SALESRANK_000724519X_PATH")

spark = SparkSession.builder.appName("ExploreAmazonBooks").getOrCreate()
sc = spark.sparkContext


In [6]:
def transpose_data(data):
    json_data = json.loads(data)
    return sorted([(timestamp, sales_rank) for timestamp, sales_rank in json_data.items()])
amz_rdd = sc.wholeTextFiles(AMZ_SALESRANK_000724519X_PATH).values().flatMap(transpose_data)
amz_df = spark.createDataFrame(amz_rdd).toDF("timestamp", "sales_rank")



In [9]:
amz_df = amz_df.withColumn("timestamp", F.from_unixtime("timestamp"))
amz_df.show()

+-------------------+----------+
|          timestamp|sales_rank|
+-------------------+----------+
|2017-10-30 09:07:50|    327588|
|2017-10-30 11:41:41|    348041|
|2017-10-30 13:30:42|    353297|
|2017-10-30 16:22:01|    369732|
|2017-10-30 17:19:56|    373189|
|2017-10-30 19:11:45|    386346|
|2017-10-30 19:49:16|     99600|
|2017-10-30 22:54:03|    135141|
|2017-10-30 23:48:52|    155116|
|2017-10-31 02:23:40|    160114|
|2017-10-31 03:14:06|    165424|
|2017-10-31 05:23:58|    166173|
|2017-10-31 06:04:41|    170119|
|2017-10-31 07:56:18|    173354|
|2017-10-31 09:50:45|    188661|
|2017-10-31 11:53:39|    224768|
|2017-10-31 14:05:50|    237030|
|2017-10-31 15:52:22|    262014|
|2017-10-31 17:35:11|     85465|
|2017-10-31 18:28:25|     99272|
+-------------------+----------+
only showing top 20 rows



In [11]:
H.get_basic_counts(amz_df, amz_df.timestamp)
H.check_nulls(amz_df, amz_df.timestamp, amz_df.sales_rank)
H.basic_stats(amz_df, amz_df.timestamp)

+----------------+-------------------------+
|count(timestamp)|count(DISTINCT timestamp)|
+----------------+-------------------------+
|            3236|                     3236|
+----------------+-------------------------+

+--------------------+
|Has Null (timestamp)|
+--------------------+
|                   0|
+--------------------+

+----------------+--------------+-------------------+-------------------+
|count(timestamp)|avg(timestamp)|     min(timestamp)|     max(timestamp)|
+----------------+--------------+-------------------+-------------------+
|            3236|          null|2017-10-30 09:07:50|2018-06-29 19:37:17|
+----------------+--------------+-------------------+-------------------+



In [12]:
H.get_basic_counts(amz_df, amz_df.sales_rank)
H.check_nulls(amz_df, amz_df.sales_rank, amz_df.timestamp)
H.basic_stats(amz_df, amz_df.sales_rank)

+-----------------+--------------------------+
|count(sales_rank)|count(DISTINCT sales_rank)|
+-----------------+--------------------------+
|             3236|                      3216|
+-----------------+--------------------------+

+---------------------+
|Has Null (sales_rank)|
+---------------------+
|                    0|
+---------------------+

+-----------------+-----------------+---------------+---------------+
|count(sales_rank)|  avg(sales_rank)|min(sales_rank)|max(sales_rank)|
+-----------------+-----------------+---------------+---------------+
|             3236|208137.0457354759|          25856|         607513|
+-----------------+-----------------+---------------+---------------+

