# Link SPL Library Books with Goodreads books

This notebook attempts to link SPL library books with Goodreads books using ISBNs.

The SPL dataset will list multiple ISBNs under one BibNumber. This is because the same book can come in different editions and formats (such as paperback or hardcover). It is unclear if the same ISBN can appear across multiple BibNumbers.

There are also a lot of duplicate BibNumbers. What consitutes a unique row?

The Goodreads data contains an ISBN/ISBN13 pair per row. It seems possible that the same book can appear twice in this dataset,
so we may need to consolidate the duplicate results.

## Set up and load data

In [1]:
import os

import findspark
findspark.init()

from dotenv import load_dotenv
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, FloatType, StringType, StructField, StructType
from pyspark_dist_explore import hist

import helpers as H

%matplotlib inline

spark = SparkSession.builder.appName("LinkSPLBooksGoodreads").getOrCreate()

In [2]:
load_dotenv()

GOODREADS_BOOKS_PATH = os.getenv("GOODREADS_BOOKS_PATH")
SPL_INVENTORY_PATH = os.getenv("SPL_INVENTORY_PATH")

In [3]:
book_schema = StructType([
    StructField("bookID", StringType(), True),
    StructField("title", StringType(), True),
    StructField("authors", StringType(), True),
    StructField("average_rating", FloatType(), True),
    StructField("isbn", StringType(), True),
    StructField("isbn13", StringType(), True),
    StructField("language_code", StringType(), True),
    StructField("num_pages", IntegerType(), True),    
    StructField("ratings_count", IntegerType(), True),    
    StructField("text_reviews_count", IntegerType(), True),
    StructField("publication_date", StringType(), True),
    StructField("publicater", StringType(), True),            
])
goodreads_df = spark.read.schema(book_schema).option("header", "true").csv(GOODREADS_BOOKS_PATH)
# Replace publication date with a datetime
goodreads_df = goodreads_df.withColumn(
    "publication_date", 
    F.to_timestamp(goodreads_df.publication_date, "M/d/yyyy"),
)

spl_df = spark.read.option("header", "true").csv(SPL_INVENTORY_PATH)

In [4]:
# Why does the same BibNumber appear hundreds of times?
(
    spl_df
    .groupBy(spl_df.BibNum)
    .agg(
        F.count(spl_df.BibNum).alias("total"),
     )
     .sort(F.desc("total"))
     .show(5)
)

+-------+-----+
| BibNum|total|
+-------+-----+
|1923072|  128|
|1909740|  112|
|2176912|   92|
|3168629|   82|
| 514265|   79|
+-------+-----+
only showing top 5 rows



In [5]:
# ItemLocation can change, but that isn't specified in the checkouts (maybe take uniques by bibnum, item type, and, item collection)
(
    spl_df
    .select(
        spl_df.Title, 
        spl_df.Author, 
        spl_df.ISBN, 
        spl_df.ItemType, 
        spl_df.ItemCollection, 
        spl_df.FloatingItem,	
        spl_df.ItemLocation,	
    )
    .filter(spl_df.BibNum == "2106734")
    .show(20)
)

+--------------------+--------------------+--------------------+--------+--------------+------------+------------+
|               Title|              Author|                ISBN|ItemType|ItemCollection|FloatingItem|ItemLocation|
+--------------------+--------------------+--------------------+--------+--------------+------------+------------+
|International wil...|Burton, Maurice, ...|0761472665, 07614...|    jrbk|         ncref|          NA|         glk|
|International wil...|Burton, Maurice, ...|0761472665, 07614...|    jrbk|         ncref|          NA|         mon|
|International wil...|Burton, Maurice, ...|0761472665, 07614...|    jrbk|        ccdesk|          NA|         cen|
|International wil...|Burton, Maurice, ...|0761472665, 07614...|    jcbk|          ncnf|          NA|         glk|
|International wil...|Burton, Maurice, ...|0761472665, 07614...|    jcbk|          ncnf|          NA|         gwd|
|International wil...|Burton, Maurice, ...|0761472665, 07614...|    jrbk|       

In [6]:
spl_isbn_df = spl_df.withColumn(
    "ISBN", 
    F.explode(F.split(spl_df.ISBN, ", ")),
)

In [7]:
# A lot of duplicate ISBNs
H.get_basic_counts(spl_isbn_df, spl_isbn_df.ISBN)

+-----------+--------------------+
|count(ISBN)|count(DISTINCT ISBN)|
+-----------+--------------------+
|    4511565|              735965|
+-----------+--------------------+



In [8]:
(
    spl_df
    .groupBy(spl_df.BibNum)
    .agg(
        F.count(spl_df.BibNum).alias("total"),
     )
     .sort(F.desc("total"))
     .show(5)
)

+-------+-----+
| BibNum|total|
+-------+-----+
|1923072|  128|
|1909740|  112|
|2176912|   92|
|3168629|   82|
| 514265|   79|
+-------+-----+
only showing top 5 rows



In [9]:
spl_goodreads_df = (
    spl_isbn_df
    .select(
        spl_isbn_df.BibNum,
        spl_isbn_df.ISBN.alias("spl_isbn"),
        spl_isbn_df.Title.alias("spl_title"),
        spl_isbn_df.Author.alias("spl_author"),
        spl_isbn_df.Publisher.alias("spl_publisher"),
        spl_isbn_df.PublicationYear.alias("spl_publication_year"),
    )
    .join(goodreads_df, (goodreads_df.isbn == F.col("spl_isbn")) | (goodreads_df.isbn13 == F.col("spl_isbn")))
)

In [10]:
spl_goodreads_df.show()

+-------+-------------+--------------------+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------+----------+-------------+-------------+---------+-------------+------------------+-------------------+--------------------+
| BibNum|     spl_isbn|           spl_title|          spl_author|       spl_publisher|spl_publication_year|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|num_pages|ratings_count|text_reviews_count|   publication_date|          publicater|
+-------+-------------+--------------------+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------+----------+-------------+-------------+---------+-------------+------------------+-------------------+--------------------+
|2882964|   0140259104|The nuclear age /...| O'Brien, Tim, 1946-|      Penguin Books,|               1996.|  3449| 

In [11]:
spl_goodreads_dupe_df = spl_goodreads_df.dropDuplicates()