# 1 Background

The dataset used in this project is a Kaggle open dataset named [Goodreads-books](https://www.kaggle.com/jealousleopard/goodreadsbooks). It provides a comprehensive list of all books listed in goodreads. This dataset was last updated on March 09, 2020. 

We are trying to answer the following questions:


- Which top 5 books have the highest average rating and from which publisher?
- How many books were published since 1990?

# 2 Data Preparation
There are 4 lines with unmatched column size, to successfully load using pandas, these 4 lines are skipped.

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('../input/goodreadsbooks/books.csv', error_bad_lines = False)
df.head()

## 2.1 The Dataset
The following provides a basic information on the interested dataset. Below shows that there are 11,123 entries.

In [None]:
df['bookID'].nunique()

In [None]:
df.info()

## 2.2 Data Cleaning
Notice the data type of 'publication_date' column is type object, since we are going to work with dates, it is better to convert the data type to datetime format.

In [None]:
df['publication_date'] = pd.to_datetime(df['publication_date'], format='%m/%d/%Y', errors='coerce')
df.info()

# 3 Simple Statistics
## 3.1 Top 5 books with the highest average rating and from which publisher?
To taking into account that some books may have a few rating counts but high reivews, we will be only considering books with at least 50 reviews. 

In [None]:
tempDf = df.loc[df['ratings_count'] >= 50]
hi_avg_rating = tempDf.groupby(['title','publisher'])['average_rating'].max().sort_values(ascending=False).head(5).reset_index()
hi_avg_rating

In [None]:
fig, ax = plt.subplots(figsize = (5,3))
ax.set_title('Top 5 books with their ratings', fontsize = 12, weight = 'bold')
ax.set_ylabel('Book Title')
ax.set_xlabel('Average ratings')
ax.barh(hi_avg_rating['title'],hi_avg_rating['average_rating'])
for i in ['top','bottom','left','right']:
    ax.spines[i].set_visible(False)
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.grid(False)
ax.set_facecolor("white")
ax.invert_yaxis()
for j in ax.patches:
    plt.text(j.get_width()+0.2, j.get_y()+0.5, 
             str(round((j.get_width()),2)), fontsize = 10, color = 'black')
plt.show()

## 3.2 How many books were published since 1990?

As shown below, 10,168 unique books were published since year 1990. A 'publication_year' column is added to provide a better visualization chart.

In [None]:
df_pub = df.loc[df['publication_date'] >= '01/01/1990']
df_pub['publication_year'] = pd.DatetimeIndex(df_pub['publication_date']).year
df_pub.shape

In [None]:
df_pub_temp = df_pub.groupby(['publication_year'])['bookID'].count().reset_index()
df_pub_temp

In [None]:
df_pub_temp.describe()

In [None]:
fig_pub, ax_pub = plt.subplots()
fig_pub = plt.figure(figsize = (5,3))
ax_pub.set_title('Number of published books since 1990 ', fontsize = 12, weight = 'bold')
ax_pub.set_xlabel('Publication Year')
ax_pub.set_ylabel('Number of Books')
ax_pub.grid(False)
ax_pub.set_facecolor("white")
ax_pub.bar(df_pub_temp['publication_year'],df_pub_temp['bookID'])
plt.show()

One interesting observation from the above bar chart is that the rate at which the books were published was at an increasing slope between 1990 to 2006. At around 2007, the slope started to decline.