## Cleaning the books sale and book rating data.

This notebook outlines the cleaning process of the book sales data and the rating data
so it can be used for EDA and analysis later on.

The below process will be using 3 set of data:
1. [publishers](https://corgis-edu.github.io/corgis/csv/publishers/)
2. [BX-Book-Rating](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
3. [BX-Books](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

After cleaning, publisher (sales) will contain all sale data while 2 BX dataset will be combined
into one (rating) that contains characteristic and rating for books.

#### Dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('seaborn-white')

#### Goal

The sales data should include snake case columns with multiple data leaking columns 
such as data revenue and sale rank. Unit sold will be the only response.

#### Sales data

In [2]:
# Publisher dataset 
data_path = 'D:\\PycharmProjects\\springboard\\data\\'
sales = pd.read_csv(f'{data_path}publishers.csv')

# Replace dot and space in columns name. Remove the word statistic in column name
sales.columns = sales.columns.str.replace(r'[\.\s]', '_').str.replace('statistics_', '')

# Remove multiple revenues and gross sales columns as these will create Multicollinearity 
# We only want units_sold in this case
sales = sales.drop(sales.columns[2:6], axis=1)
sales = sales.drop('sales_rank', axis=1)

# Save the cleaned data
sales.to_csv(f'{data_path}book_sales.csv')

# First look
print(sales.info())
sales.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27027 entries, 0 to 27026
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   genre                     27027 non-null  object 
 1   sold_by                   27027 non-null  object 
 2   daily_average_units_sold  27027 non-null  int64  
 3   publisher_name            27027 non-null  object 
 4   publisher_type            27027 non-null  object 
 5   average_rating            27027 non-null  float64
 6   sale_price                27027 non-null  float64
 7   total_reviews             27027 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 1.6+ MB
None


Unnamed: 0,genre,sold_by,daily_average_units_sold,publisher_name,publisher_type,average_rating,sale_price,total_reviews
0,genre fiction,HarperCollins Publishers,7000,Katherine Tegen Books,big five,4.57,4.88,9604
1,genre fiction,HarperCollins Publishers,6250,HarperCollins e-books,big five,4.47,1.99,450
2,genre fiction,"Amazon Digital Services, Inc.",5500,(Small or Medium Publisher),small/medium,4.16,8.69,30
3,fiction,Hachette Book Group,5500,"Little, Brown and Company",big five,3.84,7.5,3747
4,genre fiction,Penguin Group (USA) LLC,4750,Dutton Children's,big five,4.75,7.99,9174


#### Books and reviews data

In [3]:
# Load books data set and clean up column names. Omitted last 3 columns since they are
# links only
books = pd.read_csv(f'{data_path}BX-Books.csv', sep=';', error_bad_lines=True,
                    usecols=[0,1,2,3,4], encoding='ISO-8859-1', index_col='ISBN',
                    low_memory=False)
books.columns = books.columns.str.lower().str.replace('-','_')

# Count
print(books.count())

book_title             271379
book_author            271378
year_of_publication    271379
publisher              271377
dtype: int64


There are multiples reviews of the same book (isbn) from different users. Thus, we will
get the mean rating as the metric to merge into books reviews.

In [4]:
# Load reviews data. We also lower case and snake_case column names
reviews = pd.read_csv(f'{data_path}BX-Book-Ratings.csv', sep=';', error_bad_lines=True,
                      encoding='ISO-8859-1', usecols=[1,2])
reviews.columns = reviews.columns.str.lower().str.replace('-', '_')

# Group by isbn and get mean rating
reviews = reviews.groupby('isbn').mean()

# print info on reviews
print(reviews.count())

book_rating    340556
dtype: int64


In [5]:
# Merge books and reviews on isbn. Leave reviews without the isbn
rating = pd.merge(books, reviews, how='left', left_index=True, right_index=True)

# save for future use
rating.to_csv(f'{data_path}book_rating.csv')

# first look
rating.info()
rating.head()

<class 'pandas.core.frame.DataFrame'>
Index: 271379 entries, 0195153448 to 0767409752
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   book_title           271379 non-null  object 
 1   book_author          271378 non-null  object 
 2   year_of_publication  271379 non-null  object 
 3   publisher            271377 non-null  object 
 4   book_rating          270170 non-null  float64
dtypes: float64(1), object(4)
memory usage: 22.4+ MB


Unnamed: 0_level_0,book_title,book_author,year_of_publication,publisher,book_rating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,0.0
2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,4.928571
60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,5.0
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,4.272727
393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,0.0


In [6]:
# Clear out unused data frames
del books
del reviews




