# Book reviews and recommendations

In this notebok we will attempt to gain insights into the popularity of books found on [GoodReads](https://www.goodreads.com/).

Information about books can be used in a variety of ways. It can give you both quantitative and qualitative info.
For example, you can get the number of Russian authors who have written more than 5 books (quantitative) or you can get the popularity of a book across many languages (qualitative).

At the outset we will define the goals.

* Answer simple questions.
* Answer deep questions.
* and some fun stuff!

The dataset used for the purposes of this project is the [Goodreads-books dataset](https://www.kaggle.com/jealousleopard/goodreadsbooks).

First, we will import the essential pandas library.

In [None]:
import pandas as pd

Now we will read the dataset. Note that we have used 'bookID' as the index for the dataset. How did we do this even before examining the dataset? The answer is that the data description given in the link for the dataset (see above) clearly states that this particular column has a unique value for each book, which is perfect for use as an index into the dataset.

The error_bad_lines argument needs to be set to False (default value is True) in order to ignore rows with too many columns. If this argument is not set to False, read_csv will return an error saying it encountered a row with 11 columns, when infact most the data have only 10 columns.

In [None]:
books_data = pd.read_csv('../input/goodreadsbooks/books.csv', error_bad_lines = False, index_col = 'bookID')

In [None]:
books_data.head()

Let's do a quick check to see if there are any missing values.

In [None]:
books_data.isna().sum().sum()

Good, so there are no missing values.

# Answering simple questions.

In this section we will answer the following questions.
1. How many unique authors are there?
2. Which are the top 10 most frequently occurring authors?
3. How is the distribution of number of pages in each book in the dataset?
4. Which are some of the books with many ratings?
5. Which are some of the highly rated books?
5. Which authors are the most popular?

In [None]:
# Answer to question 1
num_unique_authors = books_data['authors'].nunique()
print('Number of authors = ', num_unique_authors)

In [None]:
# Answer to question 2
# frequency of top 10 frequent authors
books_data['authors'].value_counts()[:10]

3. How is the distribution of number of pages in each book in the dataset?

In order to answer the third question, we need some visualization. Let's import the plotting essentials.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("dark")

In [None]:
# Answer to question 3
sns.distplot(a = books_data['# num_pages'], kde = False, bins = 2000)
plt.title('Distribution of number of pages')

Looks like on average, around 500 pages can be found in majority of the books in the dataset. The number of books having >1000 pages is very small.

4. Which are some of the books with many ratings?

In [None]:
books_with_many_ratings = books_data.sort_values('ratings_count', 
                                                 ascending = False).head(10).set_index('title')
plt.figure(figsize=(15,10))
sns.barplot(books_with_many_ratings['ratings_count'], books_with_many_ratings.index )

5. Which are some of the highly rated books?

Before answering question 5, we will see how the ratings of the books are. For example, what is the maximum value, minimum and average?

In [None]:
books_data['average_rating'].describe()

Let's define the rating as HIGH if rating > 4.5.

In [None]:
top_rated_books = books_data[books_data['average_rating'] > 4.5]
print('Number of top-rated books = ', top_rated_books.shape[0])
print('Top rated books are: ', top_rated_books['title'])

Ok, the results indicate that Harry Potter books have the highest rating. Let's see the top 5 rated books.

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(y = top_rated_books['title'][:5], 
            x = top_rated_books['average_rating'][:5], palette = 'dark')
plt.show()

In order to answer the 6th question: which authors are the most popular?

The answer to this question requires the use of the average_rating column. In fact, this question can be answered from the previous question's answer.

In [None]:
plt.figure(figsize=(10, 6))
sns.set_style("dark")
sns.barplot(y = top_rated_books['authors'][:10], 
            x = top_rated_books['average_rating'][:10], palette = 'dark')
plt.show()

## Answering deep questions.

Now we move on to more involved questions.

1. Does language of the book have any relationship with the ratings it receives?

The rating of the book is given by the column 'average_rating', which has minimum value of 0 and maximum vlue of 5. It can be treated as a continuous variable.

In [None]:
sns.catplot(x = 'average_rating', y = 'language_code', 
            kind = 'bar', data = books_data, height = 10)
plt.show()

It appears as though books written in the wel (Welsh) language tend to have the highest rating. Let's examine the data.

In [None]:
books_data[books_data['language_code'] == 'wel']

Well, there appears to be just one welsh book with one rating in our dataset! So our inference that books written in Welsh tend to have high rating is not the absolute truth. This book could very well be an anomaly.

In [None]:
books_data['ratings_count'].describe()

In [None]:
mean_ratings_count = books_data['ratings_count'].describe()[1]

In [None]:
books_data_with_sufficient_ratings = books_data[books_data['ratings_count'] >= mean_ratings_count]

In [None]:
sns.catplot(x = 'average_rating', y = 'language_code'
            , kind = 'bar', data = books_data_with_sufficient_ratings, height = 5)
plt.show()

The language_code of 'mul' stands for Multilingual content (includes at least two languages in separatable parts).

Thus, for languages like English, German, French, Spanish and mul, the average ratings can be expected to high. As these languages are some of the most common languages spoken in the world, this observation is not too surprising.

Now let's look at the languages with the least rating.

In [None]:
mean_rating = books_data_with_sufficient_ratings['average_rating'].describe()[1]

In [None]:
books_with_low_rating = books_data_with_sufficient_ratings[
    books_data_with_sufficient_ratings['average_rating'] < mean_rating]

In [None]:
books_with_low_rating.head()

In [None]:
sns.catplot(x = 'average_rating', y = 'language_code',
            kind = 'bar', data = books_with_low_rating, height = 5)
plt.show()

The languages like English, Spanish, German and French, which by the previous plot showed to be highly rated now appear to have low ratings too. Interestingly, the 'mul' category does not appear in this new plot, which shows that books with multi-lingual content tend not to have low ratings. This result just throws the previous result out of the window. So we can conclude that bad ratings are not language dependent.

2. What does the average rating(s) of a particular author's work(s) tell us?

For this analysis, we will consider the author Charles Dickens. First let's see how many books of his we have in our dataset.

In [None]:
books_data[books_data['authors'] == 'Charles Dickens'].shape[0]

In [None]:
books_data[books_data['authors'] == 'Charles Dickens'][:3]

In [None]:
sns.distplot(a = books_data[books_data['authors'] == 'Charles Dickens']['average_rating'], kde = False)
plt.show()

Most of books have an average rating of 4.0 and above, but none have 5. Does this mean that his works are not popular? No, this graph does not give us the full picture. For instance the book with isbn: [0486406512](https://www.goodreads.com/book/show/1952.A_Tale_of_Two_Cities) was published in the year 1999. But the novel itself was originally published in 1859. Our dataset contains reviews for a limited and considerably shorter period of time.

Now lets look at a modern author like J.K. Rowling.

In [None]:
books_data[books_data['authors'] == 'J.K. Rowling'][:3]

In [None]:
sns.distplot(a = books_data[books_data['authors'] == 'J.K. Rowling']['average_rating'], kde = False)
plt.show()

Now, majority of her books appear to have average rating above 4.4. This fact indicates that her books are more popular than those of Charles Dickens, but let's consider the time component. Consider her book with isbn:[0439554896](https://www.goodreads.com/book/show/4.Harry_Potter_and_the_Chamber_of_Secrets), which was published in 2003. This is a relatively newer book and the ratings would be available from a newer audience. It would be not quite fair to say that J.K. Rowling is more popular that Charles Dickens. A more accurate statement would be: J.K. Rowling is more popular than Charles Dickens according to the data available.

3. Is there a relationship between number of pages and ratings?

In [None]:
ax = sns.jointplot(x = "# num_pages", y = "average_rating", data = books_data)
ax.set_axis_labels("Number of Pages", "Average Rating")

There appears to be a few observations with rating 0 that have number of pages >= 1000. Let's get rid of them and see the plot again.

In [None]:
books_with_reasonable_num_pages = books_data[books_data['# num_pages'] <= 1000]

In [None]:
ax = sns.jointplot(x = "# num_pages", y = "average_rating", 
                   data = books_with_reasonable_num_pages)
ax.set_axis_labels("Number of Pages", "Average Rating")

The plot indicates that books with page numbers <= 1000 tend to have higher ratings.

4. Does the number of text reviews that a book has influence the rating of the book?

Logic suggests that if people have something to say about a book, then it is quite likely that there will be an associated rating (high, low or average). Let's see if this is indeed the case.

In [None]:
books_with_no_reviews = books_data[books_data['text_reviews_count'] == 0]
books_with_reviews = books_data[books_data['text_reviews_count'] > 0]

Among books with no text reviews, are there any ratings?

In [None]:
ax = sns.jointplot(x = "text_reviews_count", y = "average_rating", data = books_with_no_reviews)
ax.set_axis_labels("Number of text reviews", "Average Rating")
plt.title('Books with no reviews')

The plot shown above indicates that there can be ratings (high, low, medium) even for books without text reviews. Now let's see how many books satisfy these conditions.

In [None]:
books_with_no_reviews.shape[0]/books_data.shape[0] * 100

A mere 6% of the books in the dataset have no reviews and yet have ratings. Next, let's look at the ratings of books with reviews.

In [None]:
ax = sns.jointplot(x = "text_reviews_count", y = "average_rating", data = books_with_reviews)
ax.set_axis_labels("Number of text reviews", "Average Rating")
plt.title('Books with reviews')

In [None]:
books_with_reviews.shape[0]/books_data.shape[0] * 100

This plot shows that for non-zero number of reviews, the ratings can be high, low or medium. Besides, this category of books are high in number, occupying a 93% of the whole dataset. An interesting observation from the plot shown above is that books with more than 2000 reviews tend to have rating more than 3.0.

The conclusion is that the presence of a large number of text reviews indicates the presence of an above average rating (>3.0), whereas the absence of reviews do not necessarily mean a bad/good rating.

## Fun stuff

1. You want to recommend a book to a lazy friend who cannot bother with reading a 500 pages long book. In addition, you want to encourage him/her to read more books. So you better recommend something good and popular.

In [None]:
books_with_atmost_200_pages = books_data[books_data['# num_pages'] <= 200]
best_books_with_atmost_200_pages = books_with_atmost_200_pages.nlargest(10, ['ratings_count'])

In [None]:
sns.barplot(best_books_with_atmost_200_pages['ratings_count'],
            best_books_with_atmost_200_pages['title'], 
            hue = best_books_with_atmost_200_pages['average_rating'])
plt.xticks(rotation=25)
plt.title('Top 10 books with <=200 pages')

So, we can use the above plot to make our recommendation. But make sure the genre agrees with your friend!

2. You are going on a long and boring train journey and you need to pass the time pleasantly. So you are looking for a book that is long, but popular.

In [None]:
big_books = books_data[books_data['# num_pages'] >= 1000]
best_big_books = big_books.nlargest(10, ['ratings_count'])

In [None]:
sns.barplot(best_big_books['ratings_count'],
            best_big_books['title'], 
            hue = best_big_books['average_rating'])
plt.xticks(rotation=25)
plt.title('Top 10 books with more >=1000 pages')

## Extra

In this section we will think about what cannot be done with the current form of the dataset. This is interesting because the following points explore the limits of the dataset, which is key to understanding future possibilities.

* Can we make recommendations based on types/genres of books?
 - No, we cannot do this right now because the genre info is missing from the dataset. So, what can we do? We could fetch the genre info corresponding to each book and then add it to the dataset. Using the new dataset, we can make recommendations based on relevant features such as genre, rating, reviews etc.
 
* Can we track the popularity of the books over time?
    - This is not possible with the current dataset because there is no time information in it right? Its partly true. Its true that there is no direct date of publishing in the dataset, but we have the isbn number which can be used to track down the date of publishing. So what about popularity? We can measure popularity using the average_ratings column. By using this column, number of reviews and the date of publishing we can get a rough idea of how the book's popularity has stood the test of time. For example, if a book published in 1990 has a rating of 4.5 in 2013, then it can be regarded as an all-time favorite. Its true that we might end up missing extreme lows and extreme highs in its history, but we can say with confidence that as of now the book has so and so rating and has a good (or bad) rating when compared to a similar book (of similar age). 

* Are there similarities between authors?
    - Now this question requires author-specific info such as writing style, themes, etc. Incorporating these data into our dataset will help in the analysis of author similarities. 

**Conclusion: **

Finding answers to the questions considered in this project has been useful in understanding the possibilities as well as identifying the limitations of such a dataset.
