# Seattle Public Library (SPL) data analysis (inventory)

This notebook performs some exploratory analysis on the [Seattle Public Library Checkout Records dataset](https://www.kaggle.com/seattle-public-library/seattle-library-checkout-records) found on Kaggle.

This notebook will be exploring the inventory file that contains a list of books at the SPL.

## Set up and load data

In [1]:
import os

import findspark
findspark.init()

from dotenv import load_dotenv
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, FloatType, StringType, StructField, StructType
from pyspark_dist_explore import hist

import helpers as H
 
%matplotlib inline

spark = SparkSession.builder.appName("ExploreSplInventory").getOrCreate()

In [2]:
load_dotenv()

SPL_INVENTORY_PATH = os.getenv("SPL_INVENTORY_PATH")

In [3]:
spl_df = spark.read.option("header", "true").csv(SPL_INVENTORY_PATH)

# Add a column that stores report date as a datetime object
spl_df = spl_df.withColumn(
    "ReportDateTS", 
    F.to_timestamp(spl_df.ReportDate, "MM/dd/yyyy"),
)

## Inspect table

In [4]:
spl_df.limit(5).toPandas().head()

Unnamed: 0,BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount,ReportDateTS
0,3011076,A tale of two friends / adapted by Ellie O'Rya...,"O'Ryan, Ellie","1481425730, 1481425749, 9781481425735, 9781481...",2014.,"Simon Spotlight,","Musicians Fiction, Bullfighters Fiction, Best ...",jcbk,ncrdr,Floating,qna,09/01/2017,1,2017-09-01
1,2248846,"Naruto. Vol. 1, Uzumaki Naruto / story and art...","Kishimoto, Masashi, 1974-",1569319006,"2003, c1999.","Viz,","Ninja Japan Comic books strips etc, Comic book...",acbk,nycomic,,lcy,09/01/2017,1,2017-09-01
2,3209270,"Peace, love & Wi-Fi : a ZITS treasury / by Jer...","Scott, Jerry, 1955-","144945867X, 9781449458676",2014.,"Andrews McMeel Publishing,",Duncan Jeremy Fictitious character Comic books...,acbk,nycomic,,bea,09/01/2017,1,2017-09-01
3,1907265,The Paris pilgrims : a novel / Clancy Carlile.,"Carlile, Clancy, 1930-",0786706155,c1999.,"Carroll & Graf,","Hemingway Ernest 1899 1961 Fiction, Biographic...",acbk,cafic,,cen,09/01/2017,1,2017-09-01
4,1644616,"Erotic by nature : a celebration of life, of l...",,094020813X,"1991, c1988.","Red Alder Books/Down There Press,","Erotic literature American, American literatur...",acbk,canf,,cen,09/01/2017,1,2017-09-01


## BibNum

- What is the reason for so few distinct BibNums?
- Is BibNum sequentially?

In [5]:
H.get_basic_counts(spl_df, spl_df.BibNum)
H.check_nulls(spl_df, spl_df.BibNum, spl_df.Title)
H.check_empty_strings(spl_df, spl_df.BibNum)
H.check_lengths(spl_df, spl_df.BibNum)

+-------------+----------------------+
|count(BibNum)|count(DISTINCT BibNum)|
+-------------+----------------------+
|      2687149|                584391|
+-------------+----------------------+

+-----------------+
|Has Null (BibNum)|
+-----------------+
|                0|
+-----------------+

+------------------+
|Has Empty (BibNum)|
+------------------+
|                 0|
+------------------+

+------+--------------+
|BibNum|length(BibNum)|
+------+--------------+
|     4|             1|
|     7|             1|
|     4|             1|
|     7|             1|
|    84|             2|
|    73|             2|
|    91|             2|
|    91|             2|
|    47|             2|
|    63|             2|
+------+--------------+
only showing top 10 rows

+-------+--------------+
| BibNum|length(BibNum)|
+-------+--------------+
|2683667|             7|
|2843472|             7|
|3225705|             7|
|2512885|             7|
|2947442|             7|
|2606479|             7|
|2556677| 

## Title

- Interestingly a lot of duplicate titles. Also appears to be close to the number of distinct BibNums.
- Some titles have nulls

In [6]:
H.get_basic_counts(spl_df, spl_df.Title)
H.check_nulls(spl_df, spl_df.Title, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.Title)
H.check_lengths(spl_df, spl_df.Title)

+------------+---------------------+
|count(Title)|count(DISTINCT Title)|
+------------+---------------------+
|     2672825|               567617|
+------------+---------------------+

+----------------+
|Has Null (Title)|
+----------------+
|           14324|
+----------------+

+-----------------+
|Has Empty (Title)|
+-----------------+
|                0|
+-----------------+

+-----+-------------+
|Title|length(Title)|
+-----+-------------+
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
| null|         null|
+-----+-------------+
only showing top 10 rows

+--------------------+-------------+
|               Title|length(Title)|
+--------------------+-------------+
|Nation's forests ...|         1228|
|Nation's forests ...|         1228|
|Nominations of Jo...|         1165|
|Nominations of Jo...|         1165|
|El apóstata [vide...|  

## Author

- Some null authors 
- Mostly distinct authors (messy field, may need to clean up)

In [7]:
H.get_basic_counts(spl_df, spl_df.Author)
H.check_nulls(spl_df, spl_df.Author, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.Author)
H.check_lengths(spl_df, spl_df.Author)

+-------------+----------------------+
|count(Author)|count(DISTINCT Author)|
+-------------+----------------------+
|      2260911|                218757|
+-------------+----------------------+

+-----------------+
|Has Null (Author)|
+-----------------+
|           426238|
+-----------------+

+------------------+
|Has Empty (Author)|
+------------------+
|                 0|
+------------------+

+------+--------------+
|Author|length(Author)|
+------+--------------+
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
|  null|          null|
+------+--------------+
only showing top 10 rows

+--------------------+--------------+
|              Author|length(Author)|
+--------------------+--------------+
| famous scout and...|           217|
| famous scout and...|           217|
| 1792-1811 : a br...|           205|
| 1792-

# ISBN

- Some ISBNs are NULL which will make it harder to link up a book
- Some very long ISBN lists

In [8]:
H.get_basic_counts(spl_df, spl_df.ISBN)
H.check_nulls(spl_df, spl_df.ISBN, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.ISBN)
H.check_lengths(spl_df, spl_df.ISBN)

+-----------+--------------------+
|count(ISBN)|count(DISTINCT ISBN)|
+-----------+--------------------+
|    2099924|              397501|
+-----------+--------------------+

+---------------+
|Has Null (ISBN)|
+---------------+
|         587225|
+---------------+

+----------------+
|Has Empty (ISBN)|
+----------------+
|               0|
+----------------+

+----+------------+
|ISBN|length(ISBN)|
+----+------------+
|null|        null|
|null|        null|
|null|        null|
|null|        null|
|null|        null|
|null|        null|
|null|        null|
|null|        null|
|null|        null|
|null|        null|
+----+------------+
only showing top 10 rows

+--------------------+------------+
|                ISBN|length(ISBN)|
+--------------------+------------+
|0788403923, 07884...|        1186|
|0788403923, 07884...|        1186|
|0691015856, 06910...|        1159|
|0691015856, 06910...|        1159|
|0806352604, 08063...|         997|
|0806352604, 08063...|         997|
|080930

## Publication Year

- Has NULLs
- Not very clean

In [9]:
H.get_basic_counts(spl_df, spl_df.PublicationYear)
H.check_nulls(spl_df, spl_df.PublicationYear, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.PublicationYear)
H.check_lengths(spl_df, spl_df.PublicationYear)

+----------------------+-------------------------------+
|count(PublicationYear)|count(DISTINCT PublicationYear)|
+----------------------+-------------------------------+
|               2654773|                          16225|
+----------------------+-------------------------------+

+--------------------------+
|Has Null (PublicationYear)|
+--------------------------+
|                     32376|
+--------------------------+

+---------------------------+
|Has Empty (PublicationYear)|
+---------------------------+
|                          0|
+---------------------------+

+---------------+-----------------------+
|PublicationYear|length(PublicationYear)|
+---------------+-----------------------+
|           null|                   null|
|           null|                   null|
|           null|                   null|
|           null|                   null|
|           null|                   null|
|           null|                   null|
|           null|                   nul

## Publisher

- Has NULLs

In [10]:
H.get_basic_counts(spl_df, spl_df.Publisher)
H.check_nulls(spl_df, spl_df.Publisher, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.Publisher)
H.check_lengths(spl_df, spl_df.Publisher)

+----------------+-------------------------+
|count(Publisher)|count(DISTINCT Publisher)|
+----------------+-------------------------+
|         2649210|                    96894|
+----------------+-------------------------+

+--------------------+
|Has Null (Publisher)|
+--------------------+
|               37939|
+--------------------+

+---------------------+
|Has Empty (Publisher)|
+---------------------+
|                    0|
+---------------------+

+---------+-----------------+
|Publisher|length(Publisher)|
+---------+-----------------+
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
|     null|             null|
+---------+-----------------+
only showing top 10 rows

+--------------------+-----------------+
|           Publisher|length(Publisher)|
+--------

### Check possibility of duplicates

- There may be duplicate publishers

In [11]:
(
    spl_df
    .select(spl_df.Publisher)
    .distinct()
    .filter(F.lower(spl_df.Publisher).startswith("scho"))
    .sort(spl_df.Publisher.asc()).show()
)

+--------------------+
|           Publisher|
+--------------------+
|Schocken : Distri...|
|Schocken : Nextbook,|
|      Schocken Books|
|Schocken Books : ...|
|Schocken Books : ...|
|Schocken Books ; ...|
|Schocken Books ; ...|
|     Schocken Books,|
|Schocken Books, O...|
|Schocken Publishi...|
|           Schocken,|
|Schoenberg, Arnol...|
|Schoharie County ...|
|            Schoken,|
|      Scholar Press,|
|Scholar's Facsimi...|
|     Scholarly Press|
|    Scholarly Press,|
|Scholarly Resourc...|
|Scholarly Resources,|
+--------------------+
only showing top 20 rows



## Subjects

- Has NULLs

In [12]:
H.get_basic_counts(spl_df, spl_df.Subjects)
H.check_nulls(spl_df, spl_df.Subjects, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.Subjects)
H.check_lengths(spl_df, spl_df.Subjects)

+---------------+------------------------+
|count(Subjects)|count(DISTINCT Subjects)|
+---------------+------------------------+
|        2621314|                  439996|
+---------------+------------------------+

+-------------------+
|Has Null (Subjects)|
+-------------------+
|              65835|
+-------------------+

+--------------------+
|Has Empty (Subjects)|
+--------------------+
|                   0|
+--------------------+

+--------+----------------+
|Subjects|length(Subjects)|
+--------+----------------+
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
+--------+----------------+
only showing top 10 rows

+--------------------+----------------+
|            Subjects|length(Subjects)|
+--------------------+----------------+
|National parks an.

## Item Type

- Has NULLs

In [13]:
H.get_basic_counts(spl_df, spl_df.ItemType)
H.check_nulls(spl_df, spl_df.ItemType, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.ItemType)
H.check_lengths(spl_df, spl_df.ItemType)

+---------------+------------------------+
|count(ItemType)|count(DISTINCT ItemType)|
+---------------+------------------------+
|        2686132|                    1876|
+---------------+------------------------+

+-------------------+
|Has Null (ItemType)|
+-------------------+
|               1017|
+-------------------+

+--------------------+
|Has Empty (ItemType)|
+--------------------+
|                   0|
+--------------------+

+--------+----------------+
|ItemType|length(ItemType)|
+--------+----------------+
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
|    null|            null|
+--------+----------------+
only showing top 10 rows

+--------------------+----------------+
|            ItemType|length(ItemType)|
+--------------------+----------------+
|Sound recording i.

## Item Collection

- Has NULLs

In [14]:
H.get_basic_counts(spl_df, spl_df.ItemCollection)
H.check_nulls(spl_df, spl_df.ItemCollection, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.ItemCollection)
H.check_lengths(spl_df, spl_df.ItemCollection)

+---------------------+------------------------------+
|count(ItemCollection)|count(DISTINCT ItemCollection)|
+---------------------+------------------------------+
|              2686323|                          1150|
+---------------------+------------------------------+

+-------------------------+
|Has Null (ItemCollection)|
+-------------------------+
|                      826|
+-------------------------+

+--------------------------+
|Has Empty (ItemCollection)|
+--------------------------+
|                         0|
+--------------------------+

+--------------+----------------------+
|ItemCollection|length(ItemCollection)|
+--------------+----------------------+
|          null|                  null|
|          null|                  null|
|          null|                  null|
|          null|                  null|
|          null|                  null|
|          null|                  null|
|          null|                  null|
|          null|                  nul

## Floating Item

- Has NULLs
- What is a floating item?

In [15]:
H.get_basic_counts(spl_df, spl_df.FloatingItem)
H.check_nulls(spl_df, spl_df.FloatingItem, spl_df.BibNum)
H.check_empty_strings(spl_df, spl_df.FloatingItem)
H.check_lengths(spl_df, spl_df.FloatingItem)

+-------------------+----------------------------+
|count(FloatingItem)|count(DISTINCT FloatingItem)|
+-------------------+----------------------------+
|            2686366|                         605|
+-------------------+----------------------------+

+-----------------------+
|Has Null (FloatingItem)|
+-----------------------+
|                    783|
+-----------------------+

+------------------------+
|Has Empty (FloatingItem)|
+------------------------+
|                       0|
+------------------------+

+------------+--------------------+
|FloatingItem|length(FloatingItem)|
+------------+--------------------+
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
|        null|                null|
+--------

## Report Date

- Has NULLs
- Only two reporting dates. I'm assuming this is when the report was generated. Though why would there be two dates?
  - Check distribution of dates
  - The distribution of stats looks even for both timestamps


In [16]:
H.get_basic_counts(spl_df, spl_df.ReportDateTS)
H.check_nulls(spl_df, spl_df.ReportDateTS, spl_df.BibNum)
H.basic_stats(spl_df, spl_df.ReportDateTS)

+-------------------+----------------------------+
|count(ReportDateTS)|count(DISTINCT ReportDateTS)|
+-------------------+----------------------------+
|            2679824|                           2|
+-------------------+----------------------------+

+-----------------------+
|Has Null (ReportDateTS)|
+-----------------------+
|                   7325|
+-----------------------+

+-------------------+--------------------+-------------------+-------------------+
|count(ReportDateTS)|   avg(ReportDateTS)|  min(ReportDateTS)|  max(ReportDateTS)|
+-------------------+--------------------+-------------------+-------------------+
|            2679824|1.5055450984410915E9|2017-09-01 00:00:00|2017-10-01 00:00:00|
+-------------------+--------------------+-------------------+-------------------+



## ItemCount

- Has NULLs
- Should convert this to integer

In [17]:
H.get_basic_counts(spl_df, spl_df.ItemCount)
H.check_nulls(spl_df, spl_df.ItemCount, spl_df.BibNum)

+----------------+-------------------------+
|count(ItemCount)|count(DISTINCT ItemCount)|
+----------------+-------------------------+
|         2686403|                      356|
+----------------+-------------------------+

+--------------------+
|Has Null (ItemCount)|
+--------------------+
|                 746|
+--------------------+

