# Amazon salesrank data books analysis

This notebook performs some exploratory analysis on the [Amazon sales rank data for print and kindle books dataset](https://www.kaggle.com/ucffool/amazon-sales-rank-data-for-print-and-kindle-books) found on Kaggle.

This notebook just looks at the amazon_com_extras.csv table which looks like a lookup table for based on ASIN.

## Set up and load data

In [1]:
import os

import findspark
findspark.init()

from dotenv import load_dotenv
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, FloatType, StringType, StructField, StructType
from pyspark_dist_explore import hist

import helpers as H

%matplotlib inline

spark = SparkSession.builder.appName("ExploreAmazonBooks").getOrCreate()

In [2]:
load_dotenv()

AMZ_BOOKS_PATH = os.getenv("AMZ_BOOKS_PATH")

In [3]:
amz_df = spark.read.option("header", "true").csv(AMZ_BOOKS_PATH)

## Inspect table

In [4]:
amz_df.limit(5).toPandas().head()

Unnamed: 0,ASIN,GROUP,FORMAT,TITLE,AUTHOR,PUBLISHER
0,1250150183,book,hardcover,The Swamp: Washington's Murky Pool of Corrupti...,Eric Bolling,St. Martin's Press
1,778319997,book,hardcover,"Rise and Shine, Benedict Stone: A Novel",Phaedra Patrick,Park Row Books
2,1608322564,book,hardcover,Sell or Be Sold: How to Get Your Way in Busine...,Grant Cardone,Greenleaf Book Group Press
3,310325331,book,hardcover,Christian Apologetics: An Anthology of Primary...,"Khaldoun A. Sweis, Chad V. Meister",Zondervan
4,312616295,book,hardcover,Gravity: How the Weakest Force in the Universe...,Brian Clegg,St. Martin's Press


## ASIN

- Some invalid ASINs
- How many ISBNS?
- About half our ISBNS (doesn't seem to have ISBN13s)

In [5]:
H.get_basic_counts(amz_df, amz_df.ASIN)
H.check_nulls(amz_df, amz_df.ASIN, amz_df.TITLE)
H.check_lengths(amz_df, amz_df.ASIN)

+-----------+--------------------+
|count(ASIN)|count(DISTINCT ASIN)|
+-----------+--------------------+
|      63755|               63750|
+-----------+--------------------+

+---------------+
|Has Null (ASIN)|
+---------------+
|              0|
+---------------+

+----------+------------+
|      ASIN|length(ASIN)|
+----------+------------+
|         "|           1|
|        L"|           2|
|       M."|           3|
|B07C8H79J2|          10|
|B07C5NLH68|          10|
|B0032AMDIW|          10|
|B00P86KQX2|          10|
|B0145038EA|          10|
|B07CB569MB|          10|
|B0080OYR52|          10|
+----------+------------+
only showing top 10 rows

+--------------------+------------+
|                ASIN|length(ASIN)|
+--------------------+------------+
|			 - Classics Il...|          43|
|			 - Classics Il...|          41|
|","Stichting Kuns...|          29|
|* Avoiding a Big ...|          24|
|* Talking With th...|          24|
|* Finding Lasting...|          23|
|          B0145038

In [6]:
(
    amz_df
    .select(amz_df.ASIN)
    .distinct()
    .filter(amz_df.ASIN.startswith("B") == False)
    .sort(amz_df.ASIN.asc()).show()
)
(
    amz_df
    .filter(amz_df.ASIN.startswith("B") == False)
    .agg(F.count(amz_df.ASIN)).show()
)

+--------------------+
|                ASIN|
+--------------------+
|			 - Classics Il...|
|			 - Classics Il...|
|                   "|
|","Stichting Kuns...|
|* Avoiding a Big ...|
|* Finding Lasting...|
|* Talking With th...|
|          0002247399|
|          0006276482|
|          0006391702|
|          0006513379|
|          0007116985|
|          0007177771|
|          0007184700|
|          0007189885|
|          0007204493|
|          000721393X|
|          0007224885|
|          0007230206|
|          0007232241|
+--------------------+
only showing top 20 rows

+-----------+
|count(ASIN)|
+-----------+
|      33532|
+-----------+



## GROUP

- A few nulls
- Some strange groups

In [7]:
H.get_basic_counts(amz_df, amz_df.GROUP)
H.check_nulls(amz_df, amz_df.GROUP, amz_df.ASIN)
H.check_empty_strings(amz_df, amz_df.GROUP)
H.check_lengths(amz_df, amz_df.GROUP)

+------------+---------------------+
|count(GROUP)|count(DISTINCT GROUP)|
+------------+---------------------+
|       63751|                    7|
+------------+---------------------+

+----------------+
|Has Null (GROUP)|
+----------------+
|               4|
+----------------+

+-----------------+
|Has Empty (GROUP)|
+-----------------+
|                0|
+-----------------+

+-----+-------------+
|GROUP|length(GROUP)|
+-----+-------------+
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| book|            4|
| book|            4|
| book|            4|
| book|            4|
| book|            4|
| book|            4|
+-----+-------------+
only showing top 10 rows

+--------------------+-------------+
|               GROUP|length(GROUP)|
+--------------------+-------------+
|Albert Lewis Kant...|           42|
|Lorenz Graham, Jr...|           36|
|Ron L. Deal, Denn...|           26|
|Paul H Brookes Pu...|           21|
|   Stuart Bauer M.D.|  

In [8]:
(
    amz_df
    .select(amz_df.GROUP)
    .distinct()
    .sort(amz_df.GROUP.desc()).show()
)

+--------------------+
|               GROUP|
+--------------------+
|              kindle|
|                book|
|Ron L. Deal, Denn...|
|Paul H Brookes Pu...|
|Lorenz Graham, Jr...|
|Albert Lewis Kant...|
|   Stuart Bauer M.D.|
|                null|
+--------------------+



## FORMAT

- A few nulls

In [9]:
H.get_basic_counts(amz_df, amz_df.FORMAT)
H.check_nulls(amz_df, amz_df.FORMAT, amz_df.ASIN)
H.check_empty_strings(amz_df, amz_df.FORMAT)
H.check_lengths(amz_df, amz_df.FORMAT)

+-------------+----------------------+
|count(FORMAT)|count(DISTINCT FORMAT)|
+-------------+----------------------+
|        63750|                     7|
+-------------+----------------------+

+-----------------+
|Has Null (FORMAT)|
+-----------------+
|                5|
+-----------------+

+------------------+
|Has Empty (FORMAT)|
+------------------+
|                 0|
+------------------+

+---------+--------------+
|   FORMAT|length(FORMAT)|
+---------+--------------+
|     null|          null|
|     null|          null|
|     null|          null|
|     null|          null|
|     null|          null|
|hardcover|             9|
|hardcover|             9|
|hardcover|             9|
|hardcover|             9|
|hardcover|             9|
+---------+--------------+
only showing top 10 rows

+--------------------+--------------+
|              FORMAT|length(FORMAT)|
+--------------------+--------------+
|Bethany House Pub...|            24|
|mass market paper...|            21|
|ma

In [10]:
(
    amz_df
    .select(amz_df.FORMAT)
    .distinct()
    .sort(amz_df.FORMAT.desc()).show()
)

+--------------------+
|              FORMAT|
+--------------------+
|           paperback|
|mass market paper...|
|      kindle edition|
|           hardcover|
|Classics Illustrated|
|Bethany House Pub...|
|  Joan Beasley Ph.D.|
|                null|
+--------------------+



## TITLE

- Some duplicate titles different ASIN

In [11]:
H.get_basic_counts(amz_df, amz_df.TITLE)
H.check_nulls(amz_df, amz_df.TITLE, amz_df.ASIN)
H.check_empty_strings(amz_df, amz_df.TITLE)
H.check_lengths(amz_df, amz_df.TITLE)

+------------+---------------------+
|count(TITLE)|count(DISTINCT TITLE)|
+------------+---------------------+
|       63747|                58283|
+------------+---------------------+

+----------------+
|Has Null (TITLE)|
+----------------+
|               8|
+----------------+

+-----------------+
|Has Empty (TITLE)|
+-----------------+
|                0|
+-----------------+

+-----+-------------+
|TITLE|length(TITLE)|
+-----+-------------+
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
|   Es|            2|
|   It|            2|
+-----+-------------+
only showing top 10 rows

+--------------------+-------------+
|               TITLE|length(TITLE)|
+--------------------+-------------+
|Vampire Diaries C...|          249|
|NEW VOICES: A MYS...|          243|
|The Power of Posi...|          222|
|Eiweiß Diät - Sch...|          221|
|Jack's Wagers (A ...|  

## AUTHOR

- Has nulls
- Has duplicates
- Has multiple authors listed. Separated by comma it seems
- Maxes out at a length of 255 (VARCHAR limit?)

In [12]:
H.get_basic_counts(amz_df, amz_df.AUTHOR)
H.check_nulls(amz_df, amz_df.AUTHOR, amz_df.ASIN)
H.check_empty_strings(amz_df, amz_df.AUTHOR)
H.check_lengths(amz_df, amz_df.AUTHOR)

+-------------+----------------------+
|count(AUTHOR)|count(DISTINCT AUTHOR)|
+-------------+----------------------+
|        63672|                 34211|
+-------------+----------------------+

+-----------------+
|Has Null (AUTHOR)|
+-----------------+
|               83|
+-----------------+

+------------------+
|Has Empty (AUTHOR)|
+------------------+
|                 0|
+------------------+

+------+--------------+
|AUTHOR|length(AUTHOR)|
+------+--------------+
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
+------+--------------+
only showing top 10 rows

+--------------------+--------------+
|              AUTHOR|length(AUTHOR)|
+--------------------+--------------+
|J. Callicott, Mic...|           255|
|Julie Taylor, Mar...|           255|
|Rhonda Parrish, S...|           255|
|Golden

## PUBLISHER

- Has nulls
- Possible duplicates due to different spelling

In [13]:
H.get_basic_counts(amz_df, amz_df.PUBLISHER)
H.check_nulls(amz_df, amz_df.PUBLISHER, amz_df.ASIN)
H.check_empty_strings(amz_df, amz_df.PUBLISHER)
H.check_lengths(amz_df, amz_df.PUBLISHER)

+----------------+-------------------------+
|count(PUBLISHER)|count(DISTINCT PUBLISHER)|
+----------------+-------------------------+
|           57263|                     9060|
+----------------+-------------------------+

+--------------------+
|Has Null (PUBLISHER)|
+--------------------+
|                6492|
+--------------------+

+---------------------+
|Has Empty (PUBLISHER)|
+---------------------+
|                    0|
+---------------------+

+---------+-----------------+
|PUBLISHER|length(PUBLISHER)|
+---------+-----------------+
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
+---------+-----------------+
only showing top 10 rows

+--------------------+-----------------+
|           PUBLISHER|length(PUBLISHER)|
+--------

In [14]:
(
    amz_df
    .select(amz_df.PUBLISHER)
    .distinct()
    .filter(F.lower(amz_df.PUBLISHER).startswith("sch"))
    .sort(amz_df.PUBLISHER.asc()).show()
)

+--------------------+
|           PUBLISHER|
+--------------------+
|          SCHOLASTIC|
|SCHOTT MUSIK INTL...|
|  Schandtaten Verlag|
|      Schardt Verlag|
|          Schattauer|
|            Schiffer|
|        Schiffer LTD|
|Schiffer Military...|
|    Schiffer Pub Ltd|
| Schiffer Publishing|
|Schiffer Publishi...|
|Schiffer Publishi...|
|     Schirner Verlag|
|Schmidt Hermann V...|
|            Schocken|
|   Schoeffling + Co.|
|Schoenhofsforeign...|
|Schoeningh Verlag Im|
|     Scholars' Press|
|          Scholastic|
+--------------------+
only showing top 20 rows

