In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Goodreads: Data Visualization with Seaborn and Matplot

![pic](https://ebookfriendly.com/wp-content/uploads/2014/07/Reading-gives-us-someplace-to-go-when-we-have-to-stay-where-we-are-Mason-Cooley-quote-540x540.jpg)

In this kernel, we will visualize the data and try to answer some important questions:
1. Authors with the most books.
2. Books with highest ratings.
3. Books with highest number of reviews.
4. Top Publishing house.

### Reading the Dataset:

In [None]:
df = pd.read_csv('/kaggle/input/goodreadsbooks/books.csv', error_bad_lines = False, index_col = 'bookID')

In [None]:
df.head()

In [None]:
df.shape

* There are 11123 rows and 11 columns in the dataset.

In [None]:
df.info()

* info() provides information about columns in the dataset like column name, number of values, type of data. 

## Visualizing the Data:

### Which language of books are in the dataset?

In [None]:
plt.figure(figsize=(16,10))
plt.title('Languages in the dataset', fontsize = 14)
sns.countplot(df['language_code'])
plt.xticks(fontsize = 10)
plt.show()

- We will plot the graph again with the most languages in the dataset.

In [None]:
lang = df.loc[df.language_code.isin(['eng', 'en-US', 'spa', 'en-GB', 'fre', 'ger', 'jpn'])]

plt.figure(figsize=(16,10))
plt.title('Languages in the dataset', fontsize = 14)
graph = sns.countplot(lang['language_code'])
plt.xticks(fontsize = 12)
plt.xlabel('Languages', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/4., height + 0.1, height, fontsize = 14)
plt.show()

- English language books are most in number in the dataset with again subdividing into English-US, English-UK, English-CA.
- Apart from English, the dataset contains French, Spanish, German and Japanese books.
- The remaining languages contains very negligible amount of books in the dataset.

### Which Authors have most number of books?

In [None]:
most_books = df['authors'].value_counts()[:15]
plt.figure(figsize = (14,6))
sns.barplot(x = most_books, y = most_books.index, palette = 'Blues_d')
plt.title('Authors with highest number of books')
plt.xlabel('Number of books', fontsize = 12)
plt.show()

- P.G. Wodehouse and Stephen King have written highest number of books with 40 each.

### Books with highest number of ratings:

In [None]:
most_ratings = df[['ratings_count']].set_index(df['title']).sort_values(by = 'ratings_count', ascending = False)[:15]
plt.figure(figsize = (12,6))
sns.barplot(x = most_ratings['ratings_count'], y = most_ratings.index, palette = 'rocket')
plt.yticks(fontsize = 10)
plt.xlabel('Number of ratings', fontsize = 12)
plt.title('The books with highest number of ratings', fontsize = 16)
plt.show()

- Twilight has recieved most number of ratings and is almost double than any other book.
- Harry Potter books have recieved most ratings. We can tell that the count of ratings for HP books are same which tells us that almost all readers are giving ratings after completing each HP book.
- Fantasy Fiction genre can be seen more in this graph.

### Books with highest number of reviews:

In [None]:
most_reviews = df[['text_reviews_count']].set_index(df['title']).sort_values(by = 'text_reviews_count', ascending = False)[:15]
plt.figure(figsize = (12,6))
sns.barplot(x = most_reviews['text_reviews_count'], y = most_reviews.index, palette = 'Greens_d')
plt.yticks(fontsize = 10)
plt.xlabel('Review Counts', fontsize = 12)
plt.title('The books with highest number of reviews', fontsize = 16)
plt.show()

- From the chart we can see that Twilight has recieved the most reviews. And also from the previous chart it was the most rated book.
- We can see Harry Potter series and Percy Jackson series here.

### Publisher with highest number of books:

In [None]:
publisher = df['publisher'].value_counts().head(15)
plt.figure(figsize = (14,6))
graph = sns.barplot(y = publisher, x = publisher.index)
plt.title('Publisher with highest number of books', fontsize = 14)
plt.xlabel('Publisher', fontsize = 12)
plt.ylabel('Count')
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/4., height + 0.9, height)
plt.xticks(rotation = 45, fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

- Vintage Publisher has most number of books to its name over 300 books.

### Average Ratings of books:

In [None]:
plt.figure(figsize = (14,6))
sns.distplot(df['average_rating'], bins = 30)
plt.title('Average ratings of books', fontsize = 16)
plt.xticks(fontsize = 14)
plt.xlabel('Average Ratings', fontsize = 12)
plt.show()

- The ratings are given out of 5 stars.
- We can see that maximum number of books are rated between 3.5 and 4.5
- There are very few books which are rated 5 star and we can see that some books are rated 0 stars as well.

### What is the average number of pages in a book?

In [None]:
df.columns = df.columns.str.replace(' ', '')

In [None]:
plt.figure(figsize = (14,8))
sns.distplot(df['num_pages'], bins = 50)
plt.title('Average number of pages', fontsize = 14)
plt.xlabel('Number of Pages', fontsize = 12)
plt.show()

- There are some books which contains more than 6000 pages.
- We can see that there are some amount books between 2000-6000 pages.
- We will look at the plot more closely between 0 to 2000 pages because there lies maximum books.

In [None]:
a = df.loc[(df.num_pages < 2000)]
plt.figure(figsize = (14,8))
sns.distplot(a['num_pages'], bins = 60)
plt.title('Average number of pages', fontsize = 14)
plt.xlabel('Number of pages', fontsize = 12)
plt.show()

- From the plot, we can see that average number of pages in a book lies between the range of 250 and 400 pages.

### Checking relationship between Ratings and Review counts:

In [None]:
plt.figure(figsize=(14,6))
df.dropna(0, inplace=True)
sns.scatterplot(x = 'average_rating', y = 'text_reviews_count', data = df, color = 'red')
plt.title('Ratings vs Review counts', fontsize = 14)
plt.xlabel('Average ratings', fontsize = 12)
plt.ylabel('Reviews count', fontsize = 12)
plt.show()

- We can see that most of ratings lie between 3.5 and 4.5 stars. 
- We will plot the graph again with only 4000 reviews because density is more there.

In [None]:
a = df.loc[(df.text_reviews_count < 4000)]
plt.figure(figsize=(16,10))
df.dropna(0, inplace=True)
sns.jointplot(x = 'average_rating', y='text_reviews_count', data = a, color = 'red')
plt.show()

- From the plot, we can see that ratings are more between 3.5 and 4.5.
- We can't much infer from the graph due to outliers.
- We can say that People rate the books more compared to reviewing books.

### Checking relationship between Ratings and Number of pages:

In [None]:
plt.figure(figsize=(16,10))
sns.scatterplot(x = 'average_rating', y = 'num_pages', data = df, color = 'g')
plt.title('Ratings vs Number of Pages', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylabel('Number of pages', fontsize = 12)
plt.xlabel('Average Ratings', fontsize = 12)
plt.show()

- We will look at the graph between the range of 0 to 10000 pages as the density is more there.

In [None]:
a = df.loc[(df.num_pages < 1000)]
plt.figure(figsize=(16,10))
sns.jointplot(x = 'average_rating', y = 'num_pages', data = a, color = 'darkgreen')
plt.show()

- From the plot, we can see that the books containing pages between 200 to 400 are given the highest ratings.
- We can say that readers prefer books with moderate amount of pages.

** The dataset is missing 'Genre' column, which would have been useful in finding top rated books based on genre. Because every person has their own favourite genre.**

If you like the kernel. Please upvote it. 