## Cleaning the books sale and book rating data.

This notebook outlines the cleaning process of multiple dataset involving book
sales and rating so it can be used for EDA and analysis later on.

The following notebook will clean and transform the 6 set of data below:
1. [publishers](https://corgis-edu.github.io/corgis/csv/publishers/)
    * Ebook sales data from Amazon for 27k titles in 2015
2. [BX-Book-Rating](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
    * Rating info on over 270k titles
3. [BX-Books](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
    * Books info on over 270k title above. Lacking isbn!
4. [kindle](https://bigml.com/dashboard/dataset/5e7999ae59f5c368a40037e0)
    * Books info on 45k kindle books. Including price
5. [goodreads](https://www.kaggle.com/jealousleopard/goodreadsbooks#books.csv)
    * Goodreads book dataset including rating and reviews
6. [nyt_fiction](https://www.kaggle.com/cmenca/new-york-times-hardcover-fiction-best-sellers)

After cleaning, publisher (sales) will contain all sale data while 2 BX dataset will be combined
into one (rating) that contains characteristic and rating for books.

#### Dependencies

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn-white')

## Goal

Clean and transform all the dataset above to easy to work with format.

## Sales data

Used for Genre, daily_average_units_sold and sale_price

In [2]:
# Publisher dataset 
data_path = 'D:\\PycharmProjects\\springboard\\data\\'
sales = pd.read_csv(f'{data_path}publishers.csv')

# Replace dot and space in columns name. Remove the word statistic in column name
sales.columns = sales.columns.str.replace(r'[\.\s]', '_').str.replace('statistics_', '')

# Remove multiple revenues and gross sales columns as these will create Multicollinearity 
# We only want units_sold in this case
sales = sales.drop(sales.columns[2:6], axis=1)
sales = sales.drop('sales_rank', axis=1)

# Cut prices into range for further analysis
sales['price_range'] = pd.cut(sales.sale_price, bins=[0, 2.99,9.99,19.99, max(sales.sale_price)], 
                           labels=['cheap','normal','high','extra'])

# Save the cleaned data
sales.to_csv(f'{data_path}book_sales.csv')

# First look
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27027 entries, 0 to 27026
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   genre                     27027 non-null  object  
 1   sold_by                   27027 non-null  object  
 2   daily_average_units_sold  27027 non-null  int64   
 3   publisher_name            27027 non-null  object  
 4   publisher_type            27027 non-null  object  
 5   average_rating            27027 non-null  float64 
 6   sale_price                27027 non-null  float64 
 7   total_reviews             27027 non-null  int64   
 8   price_range               27027 non-null  category
dtypes: category(1), float64(2), int64(2), object(4)
memory usage: 1.7+ MB


## Books and reviews data

Used to EDA for book rating 

In [3]:
# Load books data set and clean up column names. Omitted last 3 columns since they are
# links only
books = pd.read_csv(f'{data_path}BX-Books.csv', sep=';', error_bad_lines=True,
                    usecols=[0,1,2,3,4], encoding='ISO-8859-1', index_col='ISBN',
                    low_memory=False)
books.columns = books.columns.str.lower().str.replace('-','_')

# Count
books.count()

book_title             271379
book_author            271378
year_of_publication    271379
publisher              271377
dtype: int64

There are multiples reviews of the same book (isbn) from different users. Thus, we will
get the mean rating as the metric to merge into books reviews.

In [4]:
# Load reviews data. We also lower case and snake_case column names
reviews = pd.read_csv(f'{data_path}BX-Book-Ratings.csv', sep=';', error_bad_lines=True,
                      encoding='ISO-8859-1', usecols=[1,2])
reviews.columns = reviews.columns.str.lower().str.replace('-', '_')

# Group by isbn and get the number of rating
total_rating = reviews.groupby('isbn').count()

# Group by isbn and get mean rating
reviews = reviews.groupby('isbn').mean()
reviews['total_rating'] = total_rating

# print info on reviews
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 340556 entries,  0330299891 to Ô½crosoft
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   book_rating   340556 non-null  float64
 1   total_rating  340556 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 7.8+ MB


In [5]:
# Merge books and reviews on isbn. Leave reviews without the isbn
rating = pd.merge(books, reviews, how='left', left_index=True, right_index=True)
rating = rating.dropna()

# save for future use
rating.to_csv(f'{data_path}book_rating.csv')

# first look
rating.info()
rating.head()

# Clear out unused data frames
del books
del reviews

<class 'pandas.core.frame.DataFrame'>
Index: 270167 entries, 0195153448 to 0767409752
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   book_title           270167 non-null  object 
 1   book_author          270167 non-null  object 
 2   year_of_publication  270167 non-null  object 
 3   publisher            270167 non-null  object 
 4   book_rating          270167 non-null  float64
 5   total_rating         270167 non-null  float64
dtypes: float64(2), object(4)
memory usage: 14.4+ MB


## Kindle books data

Used for price prediction of kindle book format. 

In [6]:
kindle = pd.read_csv(f'{data_path}kindle.csv').drop('url', axis=1)

# kindle only books columns
kindle['kindle_only'] = kindle.save.isnull()

# Transform NaN price save into 0 and Nan publisher to self
kindle['save'] = kindle.save.fillna(0)
kindle['publisher'] = kindle.publisher.fillna('Self Publishing')

# Title length and drop description
kindle['title_length'] = kindle.title.str.len()
kindle = kindle.drop(['description'], axis=1)

# Info
kindle.info()

# Save for later
kindle.to_csv(f'{data_path}/book_kindle.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49197 entries, 0 to 49196
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             49196 non-null  object 
 1   author            49187 non-null  object 
 2   price             48642 non-null  float64
 3   save              49197 non-null  float64
 4   pages             48987 non-null  float64
 5   size              48155 non-null  float64
 6   publisher         49197 non-null  object 
 7   language          49100 non-null  object 
 8   text_to_speech    47675 non-null  object 
 9   x_ray             47686 non-null  object 
 10  lending           47767 non-null  object 
 11  customer_reviews  47751 non-null  float64
 12  stars             47752 non-null  float64
 13  kindle_only       49197 non-null  bool   
 14  title_length      49196 non-null  float64
dtypes: bool(1), float64(7), object(7)
memory usage: 5.3+ MB


## Goodread books data

Used for rating prediction

There are some problems with a few lines in the dataset that needs correction since they use
comma instead of dot-comma in the author name. However, there are less than 5 of these so manual 
fixing is efficient. 

In [7]:
# Load in the data
goodread_books = pd.read_csv(f'{data_path}goodread_books.csv')

# Remove bookID, isbn, isbn13, and publication_date
goodread_books = goodread_books.drop(['bookID', 'isbn','isbn13'], axis=1)

# remove space from column name
goodread_books.columns = goodread_books.columns.str.strip()

# Title length
goodread_books['title_length'] = goodread_books.title.str.len()

# Convert publishing date to datetime
goodread_books['publication_date'] = pd.to_datetime(goodread_books.publication_date, errors='coerce')

# This two book has a very bad publication date
print(goodread_books[goodread_books.publication_date.isnull()])

# Remove na 
goodread_books = goodread_books.dropna()

# info
goodread_books.info()

# Save for later
goodread_books.to_csv(f'{data_path}/book_goodread.csv')

                                                   title  \
8180   In Pursuit of the Proper Sinner (Inspector Lyn...   
11098         Montaillou  village occitan de 1294 à 1324   

                                               authors  average_rating  \
8180                                 Elizabeth  George            4.10   
11098  Emmanuel Le Roy Ladurie/Emmanuel Le Roy-Ladurie            3.96   

      language_code  num_pages  ratings_count  text_reviews_count  \
8180            eng        718          10608                 295   
11098           fre        640             15                   2   

      publication_date       publisher  title_length  
8180               NaT    Bantam Books            55  
11098              NaT  Folio histoire            42  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11125 entries, 0 to 11126
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0

## New York Times fiction bestseller list

Used for learning about New York Time best seller FICTIONS. 

A list of new york time best seller from 2008 to 2018. This require some minor cleaning.
ID columns is omit. Over 60% of our data doesn't have the price  

In [8]:
nyt_fiction = pd.read_json(f'{data_path}/nyt_fiction.json', lines=True)
nyt_fiction.head()

# Drop the first _id columns
nyt_fiction = nyt_fiction.drop(['amazon_product_url','_id'], axis=1)

# Normalize the dictionary in each date column and then convert to normal time
for i in ['bestsellers_date', 'published_date']:
    # Normalize
    nyt_fiction[i] = pd.json_normalize(nyt_fiction[i])
    # Convert to datetime
    nyt_fiction[i] = pd.to_datetime(nyt_fiction[i], unit='ms')

for i in ['price', 'rank', 'rank_last_week', 'weeks_on_list']:
    # Normalize
    nyt_fiction[i] = pd.json_normalize(nyt_fiction[i])

nyt_fiction.info()

# Save the data 
nyt_fiction.to_csv(f'{data_path}/book_nyt_fiction.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10195 entries, 0 to 10194
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   bestsellers_date  10195 non-null  datetime64[ns]
 1   published_date    10195 non-null  datetime64[ns]
 2   author            10195 non-null  object        
 3   description       10195 non-null  object        
 4   price             7162 non-null   object        
 5   publisher         10195 non-null  object        
 6   title             10195 non-null  object        
 7   rank              10195 non-null  object        
 8   rank_last_week    10195 non-null  object        
 9   weeks_on_list     10195 non-null  object        
dtypes: datetime64[ns](2), object(8)
memory usage: 796.6+ KB


In [9]:
nyt_fiction.head()

Unnamed: 0,bestsellers_date,published_date,author,description,price,publisher,title,rank,rank_last_week,weeks_on_list
0,2008-05-24,2008-06-08,Dean R Koontz,"Odd Thomas, who can communicate with the dead,...",27.0,Bantam,ODD HOURS,1,0,1
1,2008-05-24,2008-06-08,Stephenie Meyer,Aliens have taken control of the minds and bod...,,"Little, Brown",THE HOST,2,1,3
2,2008-05-24,2008-06-08,Emily Giffin,A woman's happy marriage is shaken when she en...,,St. Martin's,LOVE THE ONE YOU'RE WITH,3,2,2
3,2008-05-24,2008-06-08,Patricia Cornwell,A Massachusetts state investigator and his tea...,,Putnam,THE FRONT,4,0,1
4,2008-05-24,2008-06-08,Chuck Palahniuk,An aging porn queens aims to cap her career by...,,Doubleday,SNUFF,5,0,1
