# Bestsellers Amazon

## Index
- [1. Import libraries and download data](#section1)
- [2. EDA](#section2)
- [3. Conclusion](#section3)

## 1. Import libraries and download data <a id='section1'></a>

In [None]:
import numpy as np 
import pandas as pd 
import re

import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
path = '/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv'
bestsellers = pd.read_csv(path, sep=",")

## 2. EDA <a id='section2'></a>

We are going to study the dataset on Amazon's Top 50 bestselling books from 2009 to 2019.

### 2.1. Structure

Firstly, we look at what shape this data has and what type of features make up the database.

In [None]:
print('- Shape:', bestsellers.shape)
bestsellers.head()

In [None]:
bestsellers.info()

The dataset contains 7 features and 50 books per year accross 11 years, then 550 books. This data does not have any Null values.

The features are the following type:

    - String: Name, Author and Genre.
    - Float: User Rating.
    - Integer: Reviews, Price, Year.

### 2.2. Single feature analysis

In this section, we analyse the seven features separately.

- Author

We are going to see if there is any author has more than one book in the bestsellers list. For that reason, the writers are grouped by number of books are bestsellers over the period 2009 - 2019. We plot the frequency and distribution, we can say that the half number of writers have more than one book in the bestseller list.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(12,8))
plt.subplots_adjust(hspace = .50)
df_authors = pd.DataFrame(pd.DataFrame(bestsellers['Author'].value_counts())['Author'].value_counts().sort_index())
df_authors.plot.bar(ax =axes[0], xlabel='# books written by the same author', ylabel='# Authors', legend =False)
axes[0].set_title('Frequency of number of bestsellers books writen by author')
df_authors_2 = pd.DataFrame(bestsellers.groupby('Author')['Name'].count())
df_authors_2 = df_authors_2.reset_index()
sns.distplot(df_authors_2['Name'], ax=axes[1], bins =25)
axes[1].set_title('number of books writen by the same author distribution')
axes[1].set(xlabel='# books written by the same author')
plt.show()

Now, we would like to see what type of books belong to the bestsellers from Authors who have sold more or equal than 6 books, if they are a saga, a collection, and so on. When we observe the bellow dataframe, we realise that there is a number of books that is the same book repeated over years. We decide to remove the duplicates and plot again the previous graphs, and so, the number of authors which have 1 book in the list of bestsellers increase considerably and it reduces the other numbers.

In [None]:

author_list = df_authors_2[df_authors_2['Name']>=6].Author.to_list()
bestsellers[bestsellers['Author'].isin(author_list)][['Author','Name', 'Year','Reviews', 'Price']].sort_values(by='Author').head(25)

In [None]:
bestsellers_not_duplicates = bestsellers[['Name','Author', 'User Rating', 'Reviews', 'Price']].drop_duplicates()

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(12,8))
plt.subplots_adjust(hspace = .50)
df_authors = pd.DataFrame(pd.DataFrame(bestsellers_not_duplicates['Author'].value_counts())['Author'].value_counts().sort_index())
df_authors.plot.bar(ax =axes[0], xlabel='# books written by the same author', ylabel='# Authors', legend =False)
axes[0].set_title('Frequency of number of bestsellers books writen by author')
df_authors_2 = pd.DataFrame(bestsellers.groupby('Author')['Name'].count())
df_authors_2 = df_authors_2.reset_index()
sns.distplot(df_authors_2['Name'], ax=axes[1], bins =25)
axes[1].set_title('number of books written by the same author distribution')
axes[1].set(xlabel='# books written by the same author')
plt.show()

Curiosity, the author who has the most bestsellers is Jeff Kinney with the following books:

In [None]:
bestsellers_not_duplicates[bestsellers_not_duplicates['Author']=='Jeff Kinney']

- Name

The name feature is a string variable with the title of each book and the way that we analyse is plotting a wordcloud, where shows the words' size with the frequency that it is used. We remove the stopwords, digits, punctuation  and the duplicate books.

In [None]:
class removing():
    def __init__(sel,texto):
        self.text = texto
    def remove_punctuation(texto):
        return(re.sub(r'[^\w\s]','',texto))
    def remove_digit(texto):
        return(re.sub(r'[0-9]','',texto)) 
        

In [None]:
bestsellers['aux_Name'] = bestsellers.Name.apply(lambda x: removing.remove_punctuation(x))
bestsellers['aux_Name'] = bestsellers.aux_Name.apply(lambda x: removing.remove_digit(x))

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS 


def generate_wordcloud(text): # optionally add: stopwords=STOPWORDS and change the arg below
    wordcloud = WordCloud(background_color='white',
                          max_font_size=40,
                          scale=3,
                          stopwords = STOPWORDS # set or space-separated string
                          ).generate(text)
    
    fig = plt.figure(1, figsize=(12, 7))
    plt.axis('off')
    plt.imshow(wordcloud)
    plt.axis("off")
    




In [None]:
text = ''.join(list(bestsellers.aux_Name.drop_duplicates()))

generate_wordcloud(text)

- User rating

We plot the rating frequency and distribution, we make the distinction with the whole dataset and non the duplicate books. Observing them, the shape of distributions are quite similar, the distribution seems smoother when we remove the duplicates. 

In [None]:
df_drop_duplicates = bestsellers[['Name','Author', 'User Rating', 'Reviews', 'Price', 'Genre']].drop_duplicates()

fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(12,8))
plt.subplots_adjust(hspace = .50)
bestsellers['User Rating'].value_counts().sort_index().plot.bar(ax=axes[0],
                                                                xlabel='User Rating', ylabel='Frequency', color='orange')
df_drop_duplicates['User Rating'].value_counts().sort_index().plot.bar(ax=axes[0], 
                                                                       xlabel='User Rating', ylabel='Frequency',color='blue')

axes[0].legend(['whole dataset', 'removing duplate books'])
axes[0].set_title('Rating Frequency')

sns.distplot(bestsellers['User Rating'], ax = axes[1], bins =25, color='orange')
sns.distplot(df_drop_duplicates['User Rating'], ax = axes[1], bins =25, color='blue')
axes[1].set_title('Rating distribution')
axes[1].legend(['whole dataset', 'removing duplate books'])
plt.show()


- Price

In order to plot price frequency and distribution, we distinguish between the whole dataset and removing the duplicate books. Looking at the results, the frequency plots have similar shape, although we eliminate some peaks when dropping duplicate books, and price distributions seems similar, with non duplicates the distribution is smoother.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(12,8))
plt.subplots_adjust(hspace = .50)
bestsellers['Price'].value_counts().sort_index().plot.bar(ax=axes[0], 
                                                          xlabel='Price',ylabel='Frequency',color='orange')
df_drop_duplicates['Price'].value_counts().sort_index().plot.bar(ax=axes[0], color = 'blue')
axes[0].legend(['whole dataset', 'removing duplate books'])
axes[0].set_title('Price Frequency')
sns.distplot(bestsellers['Price'], ax = axes[1],bins =50, color='orange')
sns.distplot(df_drop_duplicates['Price'], ax = axes[1],bins =50, color='blue')
axes[1].legend(['whole dataset', 'removing duplate books'])
axes[1].set_title('Price Distribution')
plt.show()



- Year

The plot shows us that the dataset contains a constant number (50) bestseller book per year as one might expect.

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,4))
bestsellers['Year'].value_counts().sort_index().plot.bar(ax = axes, xlabel='Year', ylabel='# books')
axes.set_title('Number of bestseller books per year')

plt.show()

- Genre 

The bar graph represents the genre, fiction or non fiction, the way how the book is classified. We see that the proportion of books is bigger in non fiction than fiction, and when we remove the duplicate books the proportion between different genre are quite closer, therefore this means that number of books, which are eliminated, is greater in non fiction than fiction.

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,4))
bestsellers['Genre'].value_counts().sort_index().plot.bar(ax = axes,color='orange',xlabel='Genre', ylabel='# book')
df_drop_duplicates['Genre'].value_counts().sort_index().plot.bar(ax=axes, color='blue')
axes.legend(['whole dataset', 'removing duplate books'])
axes.set_title('Genre Frequency')
plt.show()

The percentages: 

In [None]:
percentage_dupl = round(pd.DataFrame(bestsellers['Genre'].value_counts())/550*100,2)
percentage_dupl = percentage_dupl.rename(columns={'Genre':'whole_dataset'})

percentage_non_dupl = pd.DataFrame(df_drop_duplicates['Genre'].value_counts()) 
percentage_non_dupl = round(percentage_non_dupl/percentage_non_dupl.sum().values[0]*100,2)
percentage_non_dupl = percentage_non_dupl.rename(columns={'Genre':'non-duplicates'})

pd.concat([percentage_dupl, percentage_non_dupl],axis=1)

### 2.3 Cross feature analysis

We are going to study the relationship between the differerent features.


#### 2.3.1. Correlation between features <a id='conclusion2'></a>

Firstly, we are going to draw a table with the correlation between the numeric features. In general, we cannot say there is a big relationship between the variables. The positive relationship is between the following variables: User Rating & Year, Reviews & Year. Negative relationship: Price & User Rating, Price & Reviews, Price & Year. And finally, non relationship: Reviews & User Rating.  



In [None]:
corr = bestsellers[['User Rating', 'Reviews', 'Price', 'Year', 'Genre']].corr()
#train_corr.corr()
corr.style.background_gradient().set_precision(2)

#### 2.3.2. Line graph <a id='conclusion1'></a>

- Number of books vs Year

The line graph represents the number of books split into non fiction and fiction, over 11 years. Observing it, both lines are made up each other because they have to sum 50. In general, non fiction surpasses the fiction the majority of years except in 2014.

In [None]:
df_genre_year = pd.DataFrame(bestsellers.groupby('Year')['Genre'].value_counts())
df_genre_year.reset_index(level=0, inplace=True)
df_genre_year = df_genre_year.rename(columns={"Genre": "values"})
df_genre_year = df_genre_year.reset_index()

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,4))
sns.lineplot(data = df_genre_year, x = "Year", y = "values", hue = "Genre", ax= axes)
axes.set_ylabel('# books')
axes.set_title('# books (Non Fiction & Fiction) vs Year')

plt.show()

#### 2.3.3. Boxplots <a id='conclusion3'></a>

The boxplot displays the distribution of data based on five numbers (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). The plot also displays whether the data contains any outliers, which are the points that are below (above) 1.5 times the interquartile range of Q1 (Q3). Finally, the mean of data is represented in the graph by green triangle.

- Price vs Year

The plot shows us the bestsellers' price distribution in every year. Looking at each year results we can observe that the box plot is not symetric (skewed) and there are several outliers which indicates that the distribution is most likely not Gaussian.  Comparing the years, the median and min value are similar (the minimun value can not go below 0), however the Q3 and the maximum vary more from year to year. To determine if each year sample follows the same distribution one could do a Kolmogorov-Smirnov test. 

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,4))
sns.boxplot(x="Year", y="Price", data=bestsellers, ax=axes,showmeans=True)
axes.set_title('Price vs Year')
plt.show()

- User rating vs Year

The plot shows the user rating distributions in every year. Looking at the boxplot, the maximum value (it can not be above 5), the median and mean are quite constan over the years, even third quartile is quite similar between different years. However, the minumun value, which can reach 0, is quite different over years, this also caused that Q1 dissimilar over years. The result are similar as before and the same comments about the distribution can be said.

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,4))
sns.boxplot(x="Year", y="User Rating", data=bestsellers, ax=axes,showmeans=True)
plt.show()

- Review vs Year

This plot show us the different data distributions from reviews over years. The first three years, the boxplots are smaller, and data more concentrated. And the others years, the minimum value and first quartile are quite similar, however the other values are quite different. We can also say the data is asymetric, right skewed and contain different outliers. The result are similar as before and the same comments about the distribution can be said.



In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,4))
sns.boxplot(x="Year", y="Reviews", data=bestsellers, ax=axes,showmeans=True)
plt.show()

#### 2.3.4 Scatter plot<a id='conclusion4'></a>

Scatter plot uses dots to represent values for two different numeric variables. This type of plot is useful to see the relationship between the variables and also we can see whether the data contains any clusters.

- User Rating vs Price (Fiction & Non Fiction)

We plot the relationship between the variables user rating and price using a scatter plot, we distinguish with different colour the genre. The majority of points are concentrated on the left top corner, meaning that most user ratings are between 4 and 5 and the prices are between 0 and 30. The points are aligned horizontally due to the granularity of the user rating feature. 



In [None]:

g = sns.relplot(data=bestsellers,x='Price', y = 'User Rating', hue = 'Genre')
g.fig.suptitle('User Rating vs Price - Genre')

plt.show()

- User Rating vs Reviews (Fiction & Non Fiction)

We use scatter plot to represent user rating vs reviews to see the relationship between them, and a different colour is used to distinguish the type of genre. The majority of points are situated on top left corner, this means the user rating has the values between 4 and 5, reviews between 0 and 30,000. As before the points are aligned horizontally due to the granularity of the user rating feature.

In [None]:
g = sns.relplot(data=bestsellers,x='Reviews', y = 'User Rating', hue = 'Genre')
g.fig.suptitle('User Rating vs Reviews - Genre')
plt.show()

- Reviews vs Price (Fiction & Non Fiction)

We plot the relationship between reviews and price, distinguishing the genre with colours. The majority of points are concentrated on the bottom left corner and it is quite difficult to see an structure, but we see more fiction points on top and the bottom are more blue(non fiction).

In [None]:
g = sns.relplot(data=bestsellers,x='Price', y = 'Reviews', hue = 'Genre')
g.fig.suptitle('Reviews vs Price - Genre')
plt.show()

## 3. Conclusion<a id='section3'></a>

- The data contains 7 features and 550 rows with information of the bestsellers over 11 years. When we observe the plot with the number of books by the same writer, you can think there are some authors which have a lot of bestselleres or they write a saga of books. However, analysing better we realise that there are 94 books which are duplicated in 189 rows, therefore some books are bestsellers across different years.
- Analysing some features (number of books by author, rating and Price), we decide to compare the whole dataset and remove duplicate books, as a result the frequency decreases but the distriubtion shapes remain quite similar.
- Regarding the genre feature, the bestsellers contain more non-fictinon than fiction, however when we remove the duplicate books, the number between different genres is quite similar. In addition )looking at [the line graph](#conclusion1) with number of books which are sold over the years, distinguishing genre, all years, non-fiction are more bestseller than fiction, except for 2014 that was inside out. 
- The [correlations](#conclusion2) between numeric variables are not very high. The majority of values are close to 0, so this mean it does not exit a big dependecy neither negative or positive.
- [Boxplot](#conclusion3) draws data ditribution for the numeric features, which are Price, User rating and Review, over 11 years. Regarding the price feature, the data distribution are quite similar over the years because the books' prices did not change. Regarding User rating, there are some variations in data distribution, for example, the data are concentrated around 5 in 2019. Reviews, at the beging the data distribution are more concentrated and after 2012 there is more dispersion, this means that people start to get used to writing reviews.
- [Scatter plot](#conclusion4), to see the relationship between variables. In general, we cannot see a strong relationship between features, but we can see some structures.