## Books Dashbooard 
### Yasmin Azizi

[ Dataset: Books Sales and Ratings](https://www.kaggle.com/datasets/thedevastator/books-sales-and-ratings)

My dashboard showcases book data that can be used to identify highly rated books from different genres. Picking a book can be lengthy and difficult due to the large selection and limited tools to make desicions. Depending on what perspective is taken this dataset can be used to lead reading choices or inventory selection amongst other purposes. A bookestore owner can use this data to enhance sales and manage inventory. A novice reader can use this information to guide where to start.

The data is sourced from kaggle. The kaggle user sourced the dataset from data world, a reliable data catalog that requires a paid trial to be accessed. It displays data gained from Amazon. 





### Clean Data

In [31]:
import pandas as pd
import seaborn as sns
import plotly.express as px

In [32]:
# read in data
data = pd.read_csv("/Users/yasminazizi/Desktop/DS 4003/Final Project Data.csv")


In [33]:
# analyze info
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1070 entries, 0 to 1069
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                1070 non-null   int64  
 1   Publishing Year      1069 non-null   float64
 2   Book Name            1047 non-null   object 
 3   Author               1070 non-null   object 
 4   language_code        1017 non-null   object 
 5   Author_Rating        1070 non-null   object 
 6   Book_average_rating  1070 non-null   float64
 7   Book_ratings_count   1070 non-null   int64  
 8   genre                1070 non-null   object 
 9   gross sales          1070 non-null   float64
 10  publisher revenue    1070 non-null   float64
 11  sale price           1070 non-null   float64
 12  sales rank           1070 non-null   int64  
 13  Publisher            1070 non-null   object 
 14  units sold           1070 non-null   int64  
dtypes: float64(5), int64(4), object(6)
mem

In [34]:
#change int to float
data[['index', 'Book_ratings_count', 'sales rank', 'units sold']] = data[['index', 'Book_ratings_count','sales rank', 'units sold']].astype(float)


In [35]:
# see the first 5 values of each variable
print(data.head(5))

   index  Publishing Year                        Book Name  \
0    0.0           1975.0                          Beowulf   
1    1.0           1987.0                 Batman: Year One   
2    2.0           2015.0                Go Set a Watchman   
3    3.0           2008.0  When You Are Engulfed in Flames   
4    4.0           2011.0         Daughter of Smoke & Bone   

                                              Author language_code  \
0                             Unknown, Seamus Heaney         en-US   
1  Frank Miller, David Mazzucchelli, Richmond Lew...           eng   
2                                         Harper Lee           eng   
3                                      David Sedaris         en-US   
4                                       Laini Taylor           eng   

  Author_Rating  Book_average_rating  Book_ratings_count          genre  \
0        Novice                 3.42            155903.0  genre fiction   
1  Intermediate                 4.23            145267.0

In [36]:
# print basic statistics of the variables
print(data.describe())

             index  Publishing Year  Book_average_rating  Book_ratings_count  \
count  1070.000000      1069.000000          1070.000000         1070.000000   
mean    534.500000      1971.377923             4.007000        94909.913084   
std     309.026698       185.080257             0.247244        31513.242518   
min       0.000000      -560.000000             2.970000        27308.000000   
25%     267.250000      1985.000000             3.850000        70398.000000   
50%     534.500000      2003.000000             4.015000        89309.000000   
75%     801.750000      2010.000000             4.170000       113906.500000   
max    1069.000000      2016.000000             4.770000       206792.000000   

        gross sales  publisher revenue   sale price   sales rank    units sold  
count   1070.000000        1070.000000  1070.000000  1070.000000   1070.000000  
mean    1856.622944         843.281030     4.869561   611.652336   9676.980374  
std     3936.924240        2257.5967

In [37]:
# see how many values are null
print(data.isnull().sum())

index                   0
Publishing Year         1
Book Name              23
Author                  0
language_code          53
Author_Rating           0
Book_average_rating     0
Book_ratings_count      0
genre                   0
gross sales             0
publisher revenue       0
sale price              0
sales rank              0
Publisher               0
units sold              0
dtype: int64


In [38]:
# drop the missing values
data.dropna(inplace=True)

In [39]:
# check to make sure they were dropped
print(data.isnull().sum())

index                  0
Publishing Year        0
Book Name              0
Author                 0
language_code          0
Author_Rating          0
Book_average_rating    0
Book_ratings_count     0
genre                  0
gross sales            0
publisher revenue      0
sale price             0
sales rank             0
Publisher              0
units sold             0
dtype: int64


In [40]:
# drop a the language code variable, all books ar in english

data.drop(columns=['language_code'], inplace=True)


In [41]:
# change year into object
data['Publishing Year'] = data['Publishing Year'].astype(str)


In [42]:
# check to make sure the changes werre made
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 998 entries, 0 to 1069
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                998 non-null    float64
 1   Publishing Year      998 non-null    object 
 2   Book Name            998 non-null    object 
 3   Author               998 non-null    object 
 4   Author_Rating        998 non-null    object 
 5   Book_average_rating  998 non-null    float64
 6   Book_ratings_count   998 non-null    float64
 7   genre                998 non-null    object 
 8   gross sales          998 non-null    float64
 9   publisher revenue    998 non-null    float64
 10  sale price           998 non-null    float64
 11  sales rank           998 non-null    float64
 12  Publisher            998 non-null    object 
 13  units sold           998 non-null    float64
dtypes: float64(8), object(6)
memory usage: 117.0+ KB
None


### Exploratory Data Analysis

In [43]:
# Number of observations
num_observations = len(data)
print(num_observations)

998


In [44]:
# the amount of uniique characterisics for categorical variables
num_unique_categories = data.select_dtypes(include=['object', 'category']).apply(lambda x: x.value_counts().count())

print(num_unique_categories)

Publishing Year    147
Book Name          996
Author             698
Author_Rating        4
genre                4
Publisher            9
dtype: int64


In [45]:
# describe the data 
print(data.head())

   index Publishing Year                        Book Name  \
0    0.0          1975.0                          Beowulf   
1    1.0          1987.0                 Batman: Year One   
2    2.0          2015.0                Go Set a Watchman   
3    3.0          2008.0  When You Are Engulfed in Flames   
4    4.0          2011.0         Daughter of Smoke & Bone   

                                              Author Author_Rating  \
0                             Unknown, Seamus Heaney        Novice   
1  Frank Miller, David Mazzucchelli, Richmond Lew...  Intermediate   
2                                         Harper Lee        Novice   
3                                      David Sedaris  Intermediate   
4                                       Laini Taylor  Intermediate   

   Book_average_rating  Book_ratings_count          genre  gross sales  \
0                 3.42            155903.0  genre fiction      34160.0   
1                 4.23            145267.0  genre fiction      1

In [46]:
# check to make sure the changes were made
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 998 entries, 0 to 1069
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                998 non-null    float64
 1   Publishing Year      998 non-null    object 
 2   Book Name            998 non-null    object 
 3   Author               998 non-null    object 
 4   Author_Rating        998 non-null    object 
 5   Book_average_rating  998 non-null    float64
 6   Book_ratings_count   998 non-null    float64
 7   genre                998 non-null    object 
 8   gross sales          998 non-null    float64
 9   publisher revenue    998 non-null    float64
 10  sale price           998 non-null    float64
 11  sales rank           998 non-null    float64
 12  Publisher            998 non-null    object 
 13  units sold           998 non-null    float64
dtypes: float64(8), object(6)
memory usage: 117.0+ KB
None


In [47]:
# Get summary statistics for numerical columns
print(data.describe())

             index  Book_average_rating  Book_ratings_count  gross sales  \
count   998.000000           998.000000          998.000000    998.00000   
mean    529.619238             4.003056        95500.622244   1885.08515   
std     308.616860             0.247360        31650.845116   4023.26877   
min       0.000000             2.970000        27308.000000    104.94000   
25%     263.250000             3.850000        70946.500000    370.88250   
50%     530.500000             4.010000        89901.000000    806.25000   
75%     793.750000             4.170000       115596.000000   1492.96500   
max    1069.000000             4.770000       206792.000000  47795.00000   

       publisher revenue  sale price   sales rank    units sold  
count         998.000000  998.000000   998.000000    998.000000  
mean          848.897952    4.839649   605.750501   9802.312625  
std          2303.504061    3.585046   369.174705  15503.088302  
min             0.000000    0.990000     1.000000  

In [48]:
# number of rows and columns
print(data.shape)

(998, 14)


Checking distributions and outliers for continous variables

In [53]:
# Book Average Rating historgram to check normality
fig = px.histogram(data, x='Book_average_rating', title='Histogram Book Avg Rating')
fig.show()
# looks pretty normal

In [56]:
# box plot to check for outliers
fig = px.box(data, x='Book_average_rating', title='Boxplot of Book Avg. Rating')
fig.show()

In [57]:
# gross sales historogram

fig = px.histogram(data, x='gross sales', title='Histogram of Gross sales')
fig.show()

fig = px.box(data, x='gross sales', title='Boxplot of Gross Sales')
fig.show()

#skewed to the right
# some outliers, can be telling when making graphs

In [58]:
# publisher revenue 
# skewed to the right

fig = px.histogram(data, x='publisher revenue', title='Histogram of Publisher Revenue')
fig.show()

fig = px.box(data, x='publisher revenue', title='Boxplot of Publisher Revenue')
fig.show()

In [60]:
# sale price
# skewed to the right
# one significant outlier - can  be insightful
fig = px.histogram(data, x='sale price', title='Histogram of Sales Price')
fig.show()

fig = px.box(data, x='sale price', title='Boxplot of Sales Price')
fig.show()

In [61]:
# units sold histogram 
# weird distribution, gap in the middle
# box plot reflect the weird distribution
fig = px.histogram(data, x='units sold', title='Histogram of Units Sold')
fig.show()

fig = px.box(data, x='units sold', title='Boxplot of Units Sold')
fig.show()

In [63]:
data.to_csv('data.csv') 


## Data Dictionary

Index: Unique identifying number


Publishing Year: The year that the book was published in 


Book Name: The title of the book


Author: The author of the book


Author Rating: The rating for an author based on their previous work


Book average rating: The average of the rating given by readers


Book ratings count: The amount of ratings a book has


Genre: The genre the book belongs to 


Gross Sales: The total sales revenue generated per boook


Publisher Revenue: The total revenue the publisher gains by the books sales


Sale Price: The price the book is sold at


Sales Rank: The books rank based on its sales perfromance compared within its category (genre)


Publisher: The company that published thee book


Units Sold: Count of books sold per title


### UI Components
1. Dropdown menus to filter between certain authors
2. Published year slider to pick ranges in time or a certain year
3. Interactive data dictionary visual to increase accesibiltiy and understanding
4. Search bar to look for specific book titles



### Data Visualizations

1. Author vs. Book Average Rating bar chart - which authors have high/ low book ratings
2. Book Ratings Count vs. Gross Sales scatter plot - display relationshiip between the book rating and the sales 
3. Units sold vs. Author Name heat map - display top selling author
4. Gross sales vs. Publisher bar chart - display top selling publishers