# Objectives

In this notebooks I'd like to find out:
1. Which book is the most popular?
2. Which author is the most popular?
3. Which author wrote the biggest number of books?
4. Is number of pages correlated with rating or number of reviews?
5. Is there tendency to reduce number of pages in nowaday books? 
6. Which words are more likely to be used in description?

P.S. I'll be very grateful for review and feedback. 

## Data Upload

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

First of all let's create a dataset with all the books available.

In [None]:
# Creating empty dataframe with all the needed columns
books = pd.DataFrame(columns = pd.read_csv('../input/goodreads-book-datasets-10m/book1000k-1100k.csv', index_col = 'Id').columns)
books

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if "_" not in filename:
            books = pd.concat([books, pd.read_csv(os.path.join(dirname, filename), index_col = 'Id')])
            print(os.path.join(dirname, filename), 'OK')

# Data preparation

Now let's check what do we have...

In [None]:
books.sort_index(inplace = True)
books.head(3)

In [None]:
books.info()

There are a lot of numerical data, that was interpreted as strigs, which I'm going to fix.

Rating columns (RatingDist5, RatingDist4, RatingDist3, RatingDist2, RatingDist1, RatingDistTotal) start with redundant part like '5:', '4:', etc. This information need to be checked and removed, if it is not needed. 

PublishMonth = 16 - looks strange. Maybe data was wrongly assotiated?

There are also a lot of missing values in Language, Description and Count of text reviews columns. 

Let's go further.

### Name

In [None]:
books['Name'].nunique()

Not all the names are unique. Probably there are some books, that were published with the same name, but from different Publishers.

In [None]:
books[books['Name'].duplicated(keep = False)].sort_values('Name')

True. A lot of books were published multiple times. However let's check 100% duplicates.

In [None]:
books[books.duplicated(keep = False)]

In first output we saw 130k duplicated rows and here only 224 (including original ones). At this point I'm going to drop duplicated rows from last subset and leave only uniques ones. Also I'm still curious about duplicated Names from first subset. Are there any especially popular names or books were just several times published? 

In [None]:
# Droping duplicates
books.drop_duplicates(inplace = True)

In [None]:
plt.figure(figsize = (12,6))
popular_names = sns.barplot(books[~books[['Name', 'Authors']].duplicated()]['Name'].value_counts().head(20).index, books[~books[['Name', 'Authors']].duplicated()]['Name'].value_counts().head(20).values)
popular_names.set_xticklabels(popular_names.get_xticklabels(), rotation=90)
popular_names.set_xlabel('Book name')
popular_names.set_ylabel('Number of books')

Isn't it interesting to see Poems, Cinderella, Dinosaurs and Microeconomics in one list of popular names?) 

In [None]:
books[~books[['Name', 'Authors']].duplicated()]['Name'].value_counts().head(20)

So there are a lot of common names, that were used for several books, that are not related to each other. 

### Authors

In [None]:
# Checking number of unique authors is dataset
books['Authors'].nunique()

There are 3 times less authors than books. Can we say, that in avarage 1 author writes or wrote 3 books? Let's find the best 'performed' authors.

In [None]:
books.groupby('Authors')['Name'].count().sort_values(ascending = False).head(20)

Can see here some familiar names.. Interesting.

### ISBN

In [None]:
# Let's take a look on missing values, maybe those books are problematic and can be removed
books[books['ISBN'].isnull()]

Books seems to be fine and I don't want to remove them, but since I also can't replace missing values with anything, let's leave it as it is for now. 

### Rating

In [None]:
books['Rating'].describe()

In [None]:
sns.distplot(books['Rating'], bins = 15, kde = False)

So mostly there is either no rating, or quite good one with average ~4. If we omit 0 Ratings, then distribution is negatively skewed, which is quite typical for rankings of services.

### Time Data: PublishYear, PublishMonth, PublishDay    

At first let's convert date's to numerical data. 

In [None]:
books['PublishYear'] = books['PublishYear'].astype('int')
books['PublishMonth'] = books['PublishMonth'].astype('int')
books['PublishDay'] = books['PublishDay'].astype('int')

In [None]:
# Looking for descriptive statistics
books[['PublishYear', 'PublishMonth', 'PublishDay']].describe()

1. Minimal and maximal Years look quite strange. I need to investigate that.
2. Maximal Day is 12, but maximal Month is 31, so it is obvious, that data was mislabeled. Need to be fixed.

In [None]:
# Replacing day and month
books['PublishMonth'], books['PublishDay'] = books['PublishDay'], books['PublishMonth']

In [None]:
# Looking into years
books['PublishYear'].unique()

Years 162, 200, 299, 208, 20099, 162, 200, 299, 208, 20099, 19769, 2100, 3002, 4989, 20040, 20067 and 1384  look like errors. Also 2021 and 2030 can be wrong. Let's find out, how many books were publish those years. 

*Note:* After I started this investigation several more non-typical years were added (20099, 65535, 1376). Unfortunately that's not possible to re-check every book with strange year all the time, so I decided to change part of the code and remove suspisios rows from analysis. I'll leave only books between 1800 and 2020 years. 

In [None]:
books[(books['PublishYear'] < 1400) | (books['PublishYear'] > 2020)]['Name'].count()

In [None]:
# In details
books[(books['PublishYear'] < 1400) | (books['PublishYear'] > 2020)]

I started to investigate some of these books:

1. A Book *The correct year for The Secret of the Old Mill* by Dixon, Franklin W. was published in 1927. [Amazon](https://www.amazon.com/Secret-Mill-Hardy-Boys-Book/dp/0448089033)
2. *Disney Princess: Look and Find* by John Kurtz Studios was published in 2003. [Amazon](https://www.amazon.com/Disney-Princess-John-Kurtz-Studios/dp/0785379185)
3. *The Virtuous Knight* by Margo Maguire was published in 2003. [Amazon](https://www.amazon.com/Virtuous-Knight-Margo-Maguire/dp/0373292813)
4. *El futuro del espaciotiempo* by Stephen W. Hawking was published in 2001. [Amazon](https://www.amazon.com/El-Futuro-del-Espaciotiempo-Spanish/dp/8484323994)
5. *Discover Your Passion: An Intuitive Search to Find Your Purpose in Life* by Gail A. Cassidy was published in 2000. [Amazon](https://www.amazon.com/Discover-Your-Passion-Intuitive-Purpose/dp/0967743702)
6. *Gala* by Dominique Bona was published in 1993. [Amazon](https://www.amazon.in/Gala-la-muse-redoutable/dp/208066817X)
7. *Agatha Raisin and the Witch of Wyckhadden* by M. C. Beaton was published in 1999. [Amazon](https://www.amazon.sg/Agatha-Raisin-Witch-Wyckhadden-Beaton/dp/0312204949)

But as investigation is timeconsuming, I'm removing those rows from dataset.

In [None]:
# Removing books with errors in years
books.drop((books[(books['PublishYear'] < 1800) | (books['PublishYear'] > 2020)].index).tolist(), inplace = True)

### Publisher

In [None]:
# Checking missing values in Publisher column
books[books['Publisher'].isnull()].head(10)

A lot of data.. Also books with good ratings. I cannot remove it.

In [None]:
# How many unique published are there?
books['Publisher'].nunique()

In [None]:
# Which publisher issued the biggest variety of books?
books['Publisher'].value_counts().head(10)

Routledge is a British publisher, that specialises in providing academic books, journals and online resources in the fields of humanities, behavioural science, education, law and social science. It was founded in 1836 and no wonder, that for almost 200 years they published so many works!

But that is interesting, that these works are available on GoodReads!

### Numbers of different rating points: RatingDistTotal, RatingDist1, RatingDist2, RatingDist3, RatingDist4 RatingDist5 

In [None]:
books.head(3)

As it was mentioned above, I'm going to get rid of that redundant part like '5:', '4:', etc. 

In [None]:
books['RatingDistTotal'] = books['RatingDistTotal'].apply(lambda rating: rating.split(':')[1]).astype('int')
books['RatingDist1'] = books['RatingDist1'].apply(lambda rating: rating.split(':')[1]).astype('int')
books['RatingDist2'] = books['RatingDist2'].apply(lambda rating: rating.split(':')[1]).astype('int')
books['RatingDist3'] = books['RatingDist3'].apply(lambda rating: rating.split(':')[1]).astype('int')
books['RatingDist4'] = books['RatingDist4'].apply(lambda rating: rating.split(':')[1]).astype('int')
books['RatingDist5'] = books['RatingDist5'].apply(lambda rating: rating.split(':')[1]).astype('int')

Now we can finally get more information about ratings.

In [None]:
books[['RatingDistTotal', 'RatingDist1', 'RatingDist2', 'RatingDist3', 'RatingDist4', 'RatingDist5']].describe()

Here we can see the same picture as we saw before - high ratings are more likely to be given.

### Counts of review and Count of text reviews

In [None]:
#Changing data type
books['CountsOfReview'] = books['CountsOfReview'].astype('int')

In [None]:
books['CountsOfReview'].describe()

In [None]:
books['CountsOfReview'].value_counts()

In [None]:
# And let's check Count of text reviews right away
books['Count of text reviews'] = books['Count of text reviews'].astype('float')
books['Count of text reviews'].describe()

And here again we see, that if book is reviewed, then it is reviewed a lot - as there is a big difference between 75-percentile and 100. From the other side - there are a lot of books with no or only few reviews. 

### Language

In [None]:
books['Language'].unique()

- We can see that mostly 3-letter format is used for languages, however 'en-US', 'en-GB' and 'en-CA' have special format. As in this analysis it doesn't matter if english is british or canadian, let's just replace them with 'eng'.

- I'll also replace 'nl' with 'nld' (Dutch language).

- And I'll investigate '--' values.

In [None]:
books['Language'] = books['Language'].str.replace('en-US', 'eng').str.replace('en-GB', 'eng').str.replace('en-CA', 'eng').str.replace('nl', 'nld')

In [None]:
books[books['Language'] == '--']

Books *The Dinosaur Heresies* and *Did You Say Twins?!* are written in English (I'll fix that), however *Inkosana Encini* is in rare african language, so I'll just remove the language.

In [None]:
books.loc[[211273, 806815], 'Language'] = 'eng'
books.loc[[229808], 'Language'] = np.nan

And now let's take a look on languages distribution

In [None]:
books['Language'].value_counts()

In [None]:
plt.figure(figsize = (12,6))
langs = sns.barplot(x = books['Language'].value_counts().head(5), y = books['Language'].value_counts().head(5).index)
langs.set_xlabel('Number of books')
langs.set_ylabel('Language')

### Pages Number

In [None]:
#Changing data type
books['pagesNumber'] = books['pagesNumber'].astype('int')

In [None]:
books['pagesNumber'].describe()

For me it was expected that the average number of pages is between 200-300 pages (mean - 280, median - 240). However, it is strange to see books with million number of pages. Let's find them.

In [None]:
books[books['pagesNumber'] > 100000]

One copy of first book exists only on [Amazon](https://www.amazon.com/gp/product/1422004805/ref=x_gr_w_bb_sout?ie=UTF8&tag=x_gr_w_bb_sout-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1422004805&SubscriptionId=1MGPYB6YW3HWK55XCGG2) with only one copy. I have big conserns if it is a real book/

Probably there is an error with second book. Because on [Amazon](https://www.amazon.com/Sholokovs-Tikhii-Don-B-Murphy/dp/0704417707) book *Sholokov's "Tikhii Don": A Commentary* with ISBN 0704417707 written by A.B. Murphy has only 510 pages. 

However book *425 Heartwarmin' Expressions For Crafting, Painting, Stitching & Scrapbooking. Book # 1 Spiral-bound* with ISBN 0704417707 written by Shelly Ehbrecht has really 4517845 pages ([Amazon](https://www.amazon.com/Heartwarmin-Expressions-Crafting-Stitching-Scrapbooking/dp/0969941048)), which is very strange.

To avoid such a big outliers I decided to remove these books with more than 100,000 pages from the dataset. 

In [None]:
books.drop((books[books['pagesNumber'] > 100000].index).tolist(), inplace = True)

### Description

In [None]:
books['Description']

There are a lot of missing values in this colums, but in general this is just a text, that I'm going to check soon.

# EDA

Within my analysis I'll answer the questions, that I raised at the beginning:

1. Which book is the most popular?
2. Which author is the most popular?
3. Which author wrote the biggest number of books?
4. Is number of pages correlated with rating or number of reviews?
5. Is there tendency to reduce number of pages in nowaday books? 
6. Which words are more likely to be used in description?

In [None]:
# Checking information again
books.info()

### 1. Which book is the most popular?

In [None]:
# Let's check the book with biggest number of rates (total)
books[books['RatingDistTotal'] == books['RatingDistTotal'].max()]

In [None]:
# And let's check the book with biggest number of 5-star rates
books[books['RatingDist5'] == books['RatingDist5'].max()]

Book by J.K. Rowling on Japanese! Maybe that's Harry Potter? Amazing.

However, total Rating is not 5. Let's find book with 5.0 Rating.

In [None]:
books[books['Rating'] == 5]

Hmmm.. all these books have just few assessments. Let's restrict the search. Maybe we should check books with at least 1000 reviews. 

In [None]:
books[(books['Rating'] == 5) & (books['RatingDistTotal'] > 1000)]

No matches... Let's reduce rate.

In [None]:
books[(books['Rating'] > 4.5) & (books['RatingDistTotal'] > 1000)].sort_values('Rating', ascending = False).head(3)

Here we can see, that the book with the best rating and number of reviews from 1000, is *The Complete Calvin and Hobbes* by	Bill Watterson. Next two are different editions of *Harry Potter* sets.

### 2. Which author is the most popular?

Unfortunately we don't have any statistics about how many people read the book, so again we will rely on ratings.

In [None]:
# Let's check authors with biggest number of rates (total number for all books)
books.groupby('Authors')['RatingDistTotal'].sum().sort_values(ascending = False).head(5)

Definitely Rowling is the most rated author. Let's just confirm, that if we check 5-star rating, then picture is still similar:

In [None]:
books.groupby('Authors')['RatingDist5'].sum().sort_values(ascending = False).head(5)

### 3. Which author wrote the biggest number of books?

This information was already mentioned above, but let's repeat. 

In [None]:
books.groupby('Authors')['Name'].count().sort_values(ascending = False).head(10)

If we ignore 'Anonymous', then 'William Shakespeare' was the most productive author!

### 4. Is number of pages correlated with rating or number of reviews?

In [None]:
books[['RatingDistTotal', 'RatingDist1', 'RatingDist2', 'RatingDist3', 'RatingDist4', 'RatingDist5', 'CountsOfReview', 'pagesNumber']].corr()

Seems, that number of reviews doesn't depend on number of pages and it's good news for authors.

### 5. Which year were the biggest number of books written?

In [None]:
plt.figure(figsize = (12,6))
books_years = sns.barplot(y = books.groupby(['PublishYear'])['Name'].count().tail(60), x = books.groupby(['PublishYear'])['Name'].count().tail(60).index)
books_years.set_xticklabels(books_years.get_xticklabels(), rotation=90)
books_years.set_xlabel('Publish Year')
books_years.set_ylabel('Number of books')

That's very interesting, that since 2008 such a big decrease has place! Maybe data is not full for that period? 

### 6. Is there tendency to reduce number of pages in nowaday books?

In [None]:
books.groupby(['PublishYear'])['pagesNumber'].mean().tail(10)

In [None]:
plt.figure(figsize = (12,8))
sns.lineplot(x = 'PublishYear', y = 'pagesNumber', data = books)

Before 1900 number of pages was randomly distributed, then we can see that in 1900-1915 years books were mostly near 100-200 pages. After World War I and till mid of the century there is a distribution peak. During these years we got a lot of works of Lost Generation (Ernest Hemingway, F. Scott Fitzgerald, Erich Maria Remarque, John O'Hara, etc.), who wrote about wars, broken dreams, broken lives. I suppose, that their books were full of experiences, thoughts, frustration, which made book more volume. 

After 1950 we see decaying curve and already nowadays number of pages is more or less stable and is near 250-300. It is hard for me to explain the fall to 200 pages near 2010-2015 years. Again - maybe there is lack of daya. Or maybe it is somehow related to active transition to electronic devices, but at the same time slow process of e-books supply (at least in my country). Everyday people have less and less time for reading, so authors dedicate themselves less for writing. However that is terrifying situation and already a lot of organisations noticed that, so last few years I can see more actions, that attract youth to read books, more apps that make reading easier, more e-books are now available. 

Anyway - I'm not an expert in literature, so I can only make assumptions.

### 7. Book names and descriptions analysis

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
# Setting stopwords for names
stopwords_names = set(STOPWORDS)
stopwords_names.update(['book', 'story'])

# Creating words list for names
words_from_names = [word for rows in books['Name'].str.lower().str.split() for word in rows if word not in stopwords_names]
names = " ".join(name for name in words_from_names)

In [None]:
# Creating a cloud with words from names:
plt.figure(figsize = (10,6))
wordcloud = WordCloud(max_words=30, background_color="white", colormap = 'copper').generate(names)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Creating words list from descriptions
words_from_description = [word for rows in books['Description'].dropna().str.replace('/','').str.replace('\\','').str.replace('<br>','').str.replace('<p>','').str.replace('><br','').str.replace('<br','').str.replace('<','').str.replace('>','').str.replace('--','').str.replace('.','').str.replace(',','').str.lower().str.split() for word in rows if word not in STOPWORDS]

In [None]:
# Creating a cloud with words from descriptions taken top 200 words:
plt.figure(figsize = (10,6))
wordcloud = WordCloud(max_words=60, background_color="white", colormap = 'copper').generate_from_frequencies(frequencies = pd.Series(words_from_description).value_counts().head(100))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### End

Thank you for reviewing. Hope you found interesting insights there!