### 01 - Cleaning the datasets

This notebook outlines the cleaning process for the publishers data set which
contains Amazon ebook daily sales records from 2015. The data is taken from
Corgis project. Fortunately, the data is clearly label and preprocessed.
The main objectives would be removing redundant columns and reformatting
some names.

Data:
1. [publishers](https://corgis-edu.github.io/corgis/csv/publishers/)
    * Ebook sales data from Amazon for 27k titles in 2015

After cleaning, the data is renamed to book_sales for easy reference.

In [9]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn-white')

## book_sales data

Steps:
1. lowercase columns' name and change them to snake cases.
2. since multiple sales metrics are recorded which serve the same purpose in our
case, we keep only the unit sales column.
3. sold by column has come level with different names for the same companies.
All the sub companies name are grouped into one.
    * Harper Collins Christian Publishings, Publishers and Publishing become HarperCollins
    * Random House LLC and Mondadori become Random House

In [10]:
# loading Publisher dataset
data_path = 'D:\\PycharmProjects\\springboard\\data'
sales = pd.read_csv(f'{data_path}\\publishers.csv')

# replace dot and space in columns name. Remove the word statistic in column name
sales.columns = sales.columns.str.replace(r'[\.\s]', '_').str.replace('statistics_', '')

# remove multiple revenues and gross sales columns   as these will create multicollinearity
# only keep units_sold column
sales = sales.drop(sales.columns[2:6], axis=1)
sales = sales.drop('sales_rank', axis=1)

# rename sold_by sub companies name
# list of repeated companies name
harper = ['HarperCollins Christian Publishing', 'HarperCollins Publishers','HarperCollins Publishing']
randomhouse = ['Random House LLC', 'Random House Mondadori']

# finding HarperCollins
sales['sold_by'] = sales['sold_by'].replace(harper, 'HarperCollins')
sales['sold_by'] = sales['sold_by'].replace(randomhouse, 'Random House')

# save the cleaned data with new name
sales.to_csv(f'{data_path}\\book_sales.csv')

# cleaned data set
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27027 entries, 0 to 27026
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   genre                     27027 non-null  object 
 1   sold_by                   27027 non-null  object 
 2   daily_average_units_sold  27027 non-null  int64  
 3   publisher_name            27027 non-null  object 
 4   publisher_type            27027 non-null  object 
 5   average_rating            27027 non-null  float64
 6   sale_price                27027 non-null  float64
 7   total_reviews             27027 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 1.6+ MB
