In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import scipy.stats as stats

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 

# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Let's take a look at the DataFrame

df = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.sort_values('Reviews',ascending=False).head()

As you can see, this is the DataFrame we are gonna be working with. It shows the top sold books in Amazon, from 2009 to 2019. As many people, you may have the urge to work with unique values. However, some books appeared on this list for over a year. For this reason it is important to work with the whole data.

In [None]:
#This function shows us some statistics about the data
#It is a great function to visualize the whole data in a very practical way 

df.describe()

In [None]:
# And by adding the paramether 'include', we can see some more information

df.describe(include='O')

In [None]:
#Grouping the data by the year 

mean = df.groupby('Year').mean()
df.groupby('Year').mean()

To see that data more clearly, let's create some graphs!

# **Analysing the data from the previous table**

In [None]:
#Graph 1 - Average value of user rating over the years

plt.plot(mean.index,mean['User Rating'])
plt.title('Graph 1 - Average value of user rating over the years')
plt.show()

**As you can see in the graph above, the user rating has increased significantly over the years. Do you have any possible explanation for why this happened?**

In [None]:
#Graph 2 - Average number of reviews per book over the years

plt.plot(mean.index,mean['Reviews'])
plt.title('Graph 2 - Average of number of reviews per book over the years')
plt.show()

The graph above shows that, after 2012, the average reviews per book has not changed much. Nevertheless, the user rating has changend drastically. Is it possible that the profile of Amazon's clients is different now?

In [None]:
#Graph 3 - Average price per book over the years

plt.plot(mean.index,mean['Price'])
plt.title('Graph 2 - Average price per book over the years')
plt.show()

As you can see in the graph, the average price of the top sold books has decreased in a signicant way. This is a possible explanation on why the average number of reviews per book has increased so much in that period.

In [None]:
# Increase in the average number of reviews per book, from 2010 to 2012

grow = ((mean['Reviews'].loc[2012]/mean['Reviews'].loc[2010])-1)*100
round(grow,2)
print(str(grow) + '%')

* [](http://)https://pattern.com/blog/amazon-prime-a-timeline-from-2005-to-2020/

As you can see in this link, in 2011 Amazon included Prime Videos in Prime Subscription. Probably, with that measure, the number of clients with Prime Subscription has increased significantly in this period. And that's another possible explanation on why the average number of reviews per book has increased more than 138% in two years.

In [None]:
#Comparative of the graphs

fig, axes = plt.subplots(1, 2, figsize=(10,4))
axes[0].plot(mean.index,mean['Reviews'])
axes[0].set_title('Average number of reviews per book over the years')

axes[1].plot(mean.index,mean['Price'])
axes[1].set_title('Average price per book over the years');

As you can see in the comparison above, the number of reviews and the price of the books seem to have a negative correlation. Let's see if that's true.

In [None]:
#Correlation

corr, pval = stats.pearsonr(mean['Price'], mean['Reviews'])
print('The correlation is {} and the p-value is {}'.format(corr,pval))

As you can see, there is a strong negative correlation of these two factors, as we expected. That means that when the price goes up, the number of the reviews tend to decrease. However, the correlation only is not enough. It is important to analyse the p-value. Basically, "when you perform a statistical test a p-value helps you determine the significance of your results in relation to the null hypothesis". In this case, p-value is high, so we can't trust the result. If you wanna learn more about p-value, there is a link below:

* https://www.simplypsychology.org/p-value.html


# Let's try to find some more interesting information about the data

In [None]:
plt.scatter(df['Price'],df['User Rating'])
plt.show()
print('The average user rating of a book is', df['User Rating'].mean())
print('The average price of a book is {} dollars'.format(df['Price'].mean()))

In [None]:
plt.scatter(df['Price'],df['Reviews'])
plt.show()
print('The average number of reviews per book is', df['Reviews'].mean())
print('The average price of a book is {} dollars'.format(df['Price'].mean()))

As you can see, most of the values in both graphs are around their average. Of course there are some outliers, but we are not going to do further analysis on them. However, this is an interesting concept to know and apply. The link below explains more about that:

* https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm#:~:text=An%20outlier%20is%20an%20observation,random%20sample%20from%20a%20population.&text=Examination%20of%20the%20data%20for,often%20referred%20to%20as%20outliers.

Just to see an example of an outlier analysis, let's check out which book has the lowest user rating

In [None]:
df[df['User Rating']==df['User Rating'].min()]

And yes, J.K. Rowling probably did a better job writing Harry Potter series. Also, the fame of the author can explain why such a bad evaluated book appears in this DataSet.

In [None]:
# Books with most reviews

plt.bar(df[['Name','Reviews']].drop_duplicates().sort_values('Reviews',ascending=False)['Name'].head(10),df[['Name','Reviews']].drop_duplicates().sort_values('Reviews',ascending=False)['Reviews'].head(10))
plt.xticks(rotation='vertical')
plt.title('Most Reviwed Books')
plt.show()

In [None]:
# Authors that appeared most on the list

Top10 = df.groupby('Author').count()[['Price']].rename(columns={'Price':'Total'}).sort_values('Total',ascending=False).head(10)
plt.bar(Top10.index,Top10['Total'])
plt.xticks(rotation='vertical')
plt.title('Top 10 Authors by the total appearances in the data')
plt.show()

In [None]:
#Histogram to see the distribution 

n = df['User Rating']
fig, axes = plt.subplots(1, 2, figsize=(12,4))

axes[0].hist(n)
axes[0].set_title("Regular Histogram")
axes[0].set_xlim((min(n), max(n)))

axes[1].hist(n, cumulative=True)
axes[1].set_title("Cumulative Histogram")
axes[1].set_xlim((min(n), max(n)));

The histogram shows how the data is distributed. It's important to know that the data is divided in intervals, called bins. In the graph above, we can see that the most common user ratings are between 4.6 and 4.7. Histograms can also show a cumulative value, as in the example on the left.

To know more about histograms, check this website: https://statistics.laerd.com/statistical-guides/understanding-histograms.php

In [None]:
# Genre of the books

Percentage = df['Genre'].value_counts(normalize=True)
Percentage = Percentage*100
Percentage.map("{:,.2f}%".format)

plt.pie(Percentage, autopct='%1.2f%%')
plt.legend(['Non Fiction', 'Fiction'])
plt.title('Fiction X Non Fiction')
plt.show()
df['Genre'].value_counts()

In this pie chart, we can see that there are more non fiction books than fiction. But do you think this trend has remained constant over time? Let's check it out!

In [None]:
#Creating the graph

serie1 = df[df['Genre']=='Fiction'].groupby(['Year','Genre']).count()[['Name']]
serie2 = df[df['Genre']=='Non Fiction'].groupby(['Year','Genre']).count()[['Name']]

y = [i for i in range(2009,2020,1)]
plt.plot(y,serie2['Name'].to_list(), label='Non Fiction')
plt.plot(y,serie1['Name'].to_list(), label='Fiction')
plt.legend()
plt.title('Fiction X Non Fiction')
plt.show()


As you can see, in 2014 there were more fiction books in that list than no fiction. However, in the following year, the difference between these two genres increased greatly, with non fiction books in the lead. Does it mean that Amazon's clients regreted reading that much non fiction books in the preview's year? (And just to make it clear, I love fiction books!!!)

In [None]:
fic=round(df[df['Genre']=='Fiction']['Price'].mean(),2)
non_fic=round(df[df['Genre']=='Non Fiction']['Price'].mean(),2)

per = round(((non_fic/fic)-1)*100,2)

print('The average price for a fiction book is ${} and for a non fiction is ${}. And this means that non fiction books are {}% more expensive than fiction.'.format(fic, non_fic,per))

In [None]:
plt.bar(['Fiction','Non Fiction'],[fic,non_fic], color=['orange','b'])
plt.title('Average price for each genre')
plt.show()

Even though non fiction books are significantly more expensive, they are more popular? Do you have any theory on why this happened?

In [None]:
#User Rating analyzed by genre

df2 = df.groupby('Genre')[['User Rating']].mean()
df2['STD'] = df.groupby('Genre')[['User Rating']].std()
df2.rename(columns={'User Rating':'Mean'},inplace=True)
df2['Max'] = df.groupby('Genre')[['User Rating']].max()
df2['Min'] = df.groupby('Genre')[['User Rating']].min()

df2

As you could see during this analysis, I tried to look at this data from a different perspective and I surely brought up more questions than answers. However, I belive that, when it comes to data analysis, it is really important to be curious and ask as many questions as you can!

# Thank you so much for reading my notebook!I hope you enjoyed it and please give me your feedbacks, we are all here to learn!