In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/goodreads-best-books-ever-with-recommendations/Goodreads_BestBooksEver_1-10000.csv')

In [None]:
df.head()

In [None]:
df.describe(include='all')

With describe(), we see that there are actually only 9486 unique books in here, the reason being some books can have many different editions and publications and will be listed here more than once. For authors, we see that there are only 5204 uniques names, because of the same reason above (one book can appear more than once), and/or also simply because one author can have more than one book that appears on the list. 

# Top 20 authors that show up the most on the list

First, let's clean the bookTitle column so that each title only appears once for the most accurate counting.

In [None]:
df_unique_title = df.drop_duplicates(subset=['bookTitle']) #this drops all duplicates and keep the first value 

In [None]:
authors = df_unique_title['bookAuthors'].value_counts()[:20]
print(authors)

In [None]:
sns.set_context('poster')
plt.figure(figsize=(15,10))
ax = sns.barplot(x = authors, y = authors.index)
ax.set_title("Authors with the highest frequency on the list")
ax.set_xlabel("Frequency")
ax.set_ylabel("Authors")
for i in ax.patches:
    ax.text(i.get_width() +.3, i.get_y() + 0.5, str(round(i.get_width())), fontsize = 15)

It is not surpriring to see Stephen King on top of this list considering that he's one of the most popular and prolific authors of our time. This list seems to be dominated by popular modern authors whose target audience are young readers. 

# Top 20 books that show up the most 

In [None]:
books = df['bookTitle'].value_counts()[:20]
sns.set_context('notebook')
plt.figure(figsize=(15,10))
ax = sns.barplot(x = books, y = books.index, palette = "magma")
ax.set_title("Books with the highest frequency on the list")
ax.set_xlabel("Frequency")
ax.set_ylabel("Books")

Appearing on this list doesn't imply a book is "good" per se, but it's an indication that it's at least popular enough that many of its editions/publications are voted to be "best" by Goodreads users. Almost all books here are modern titles targeted to young audience. There are some that are considered classics such as *Farenheit 451* and *Animal Farm*.

# Top 10 most rated books

To find out this information, we need to keep in mind that there exist different editions/publications of books on this list. What we do not know is whether rating one edition of a book can also affect other editions of the same book. When we sort list by the ratingCount we will see why this is a concern.

In [None]:
df.sort_values('ratingCount', ascending = False).loc[:, 'bookTitle':'ratingCount'].head(10).set_index('bookTitle')

As we suspect, some popular books will be on the top list under several editions, such as *The Hunger Games*. However, what's interesting is all the editions will have the exact same bookRating and almost the same ratingCount. It would not make much sense in this case to combine the count of all editions because we would very likely double-count. Instead we will keep the first instance of the books and remove any duplicates. 

In [None]:
most_rated = df.sort_values('ratingCount', ascending = False).drop_duplicates(subset=['bookTitle']).loc[:, 'bookTitle':'ratingCount'].head(15)
most_rated

So here we still see some duplicates because the bookTitle are not the exact same for *1984* and *To Kill a Mockingbird* but they should still be removed. 

In [None]:
most_rated = most_rated[(most_rated['bookTitle'] != "To Kill a Mockingbird") & (most_rated['bookTitle'] != "1984")].set_index('bookTitle')

In [None]:
plt.figure(figsize=(15,10))
ax = sns.barplot(most_rated['ratingCount'], most_rated.index, palette='icefire_r')
ax.set_title("Most rated books")
ax.set_xlabel("Ratings Count")
ax.set_ylabel("Books")

# Top 10 most reviewed books

In [None]:
most_reviewed = df.sort_values('reviewCount', ascending = False).drop_duplicates(subset=['bookTitle']).loc[:, 'bookTitle':'reviewCount'].head(10).set_index('bookTitle')
most_reviewed

Since there is no duplicates, we can proceed to plotting right away

In [None]:
plt.figure(figsize=(15,10))
ax = sns.barplot(most_reviewed['reviewCount'], most_reviewed.index, palette='rocket')
ax.set_title("Most reviewed books")
ax.set_xlabel("Reviews Count")
ax.set_ylabel("Books")

It is interesting to note that *The Hunger Games* gets significantly more written reviews than *Harry Potter and the Sorcerer's Stone* even though the latter dominates in terms of **ratingCount** (or the number of people who only vote stars). "The Fault in Our Stars" and "Gone Girl" also seem to have a lot of fans who are willing to write (often times lengthy) reviews instead of just voting stars. I read book reviews a lot of Goodreads and personally speaking I would consider this a stronger sign that a book is well-liked by readers. This observation leads us to the next data exploration: 

# Books with the highest Reviews over Ratings ratio

In [None]:
df['reviewsoverratingsRatio'] = df.apply(lambda row: row.reviewCount/row.ratingCount, axis = 1) #create a new column with this metrics

Like before, we will drop duplicates of the books with the same name and keep the first instance only (after sorting descending based on the ratio).

In [None]:
temp_highest_ratio = df[['bookTitle','bookRating','ratingCount','reviewCount','reviewsoverratingsRatio']].sort_values('reviewsoverratingsRatio', ascending = False).drop_duplicates(subset=['bookTitle'])
temp_highest_ratio.head(10)

In [None]:
temp_highest_ratio[temp_highest_ratio['bookTitle'] == "The Hunger Games"] #an example of a popular book from the previous lists

From this list we do see that there are books that have a very high ratio that almost for every rating there is an accompanying written review. However, these are all books with very few reviews and ratings and thus this ratio does not mean much because as a book accumulates more and more ratings this ratio will likely drop (we will explore this very soon) so we will not see any of the well-known, popular books. In fact, I know none of the books on the top list above. Let's redo this by only include the most popular names, let's say first 30 most rated books. 

In [None]:
most_rated_2 = df[['bookTitle','bookRating','ratingCount','reviewCount','reviewsoverratingsRatio']].sort_values(by = 'ratingCount', ascending = False).drop_duplicates(subset=['bookTitle'])
highest_ratio = most_rated_2.head(30).sort_values(by = 'reviewsoverratingsRatio', ascending = False)
highest_ratio.head(10)

For these popular titles, a top high ratio is somewhere between 3%-5%, which might not sound very impressive but we have to keep in mind the very high **ratingCount** they have to begin with.

# Distribution of the Reviews over Ratings ratio

In [None]:
sns.set_context('paper')
ax = sns.displot(df['reviewsoverratingsRatio'], color = 'green', bins = 100, kde = True)
ax.fig.set_figwidth(15)
ax.fig.set_figheight(10)

To see the exact bins and counts we can use numpy:

In [None]:
np.histogram(df['reviewsoverratingsRatio'], bins=100)

We can conclude that the majority of books have a ratio between 0-0.2. As we move further to the right of the distribution plot where the ratio steadily increases we see that the count drops really fast which means there are only a few books that have a ratio that high.

The peak of the distribution is at 0.026 or ~3%. This means that most books will have around 3 written reviews per 100 ratings.

We can revisit the list of the most popular books with the highest **reviewsoverratingsRatio** and see that most of them (bar the last two) have a ratio > 0.026. Those in my opinion make the *definitive* top 10 best, most-rated, most-reviewed books on Goodreads.

In [None]:
highest_ratio[highest_ratio['reviewsoverratingsRatio'] > 0.026].head(10)

# Correlation between ratingCount and reviewsoverratingsRatio

We previously speculated that after a certain ratingCount, as the number of ratings a book receives increases, its **reviewoverratingRatio** decreases (meaning more and more people will only leave a rating without writing a review). Let's see if that's a correct statement.

In [None]:
sns.set_context('paper')
ax = sns.jointplot(x = "ratingCount",y = 'reviewsoverratingsRatio', kind='scatter',  data = df[['reviewsoverratingsRatio','ratingCount']])
ax.set_axis_labels("ratingCount", "reviewsoverratingsRatio")
ax.fig.set_figwidth(15)
ax.fig.set_figheight(10)

Looking at this plot, we notice that the **reviewsoverratingsRatio** is highest when **ratingCount** is close to 0 (on this scale of 10^6). **reviewsoverratingsRatio** drops rapidly as we move from 0 to 100000 ratings and the trends continues on as **ratingCount** increases. To remove outliers, and for the sake of better visiblity, we can limit the range of books to include in this study only ones whose **ratingCount** > 100,000

In [None]:
sns.set_context('paper')
ax = sns.jointplot(x = "ratingCount",y = 'reviewsoverratingsRatio', kind='scatter',  data = df[df['ratingCount'] >  100000][['reviewsoverratingsRatio','ratingCount']])
ax.set_axis_labels("ratingCount", "reviewsoverratingsRatio")
ax.fig.set_figwidth(15)
ax.fig.set_figheight(10)