# Business Understanding

The goal of this project is:
1. Find the cheapest books.
2. Find the most expensieve books.
3. Find which book that has higher reviews.
4. Find which book that has higher user ratings.
5. Find which book genre that was most likely bought by the users.

# Data Description

First we will import the library and the dataset.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #first data visualization library
sns.set_style('whitegrid')
import matplotlib.pyplot as plt #second data visualization library
%matplotlib inline
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import warnings
warnings.filterwarnings("ignore")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Then, we will use pandas to view the first five rows of the dataset.

In [None]:
df=pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head()

In [None]:
# computing number of rows 
rows = len(df.axes[0]) 
  
# computing number of columns 
cols = len(df.axes[1]) 
  
print("Number of Rows: ", rows) 
print("Number of Columns: ", cols) 

We can see that the dataset contains 550 rows and 7 columns. Furthermore, we can display the types of data for each column in the dataset.

In [None]:
df.info()

Moreover, we can also display the descriptive statistics too.

In [None]:
df.describe(include=['object'])

In [None]:
df.describe(include=['number'])

By using the code below we can display the list of column in the dataset.

In [None]:
df.columns

In the subsequent step, we will display the unique value of each categorical data using the code below.

In [None]:
g = df.columns.to_series().groupby(df.dtypes).groups
y={k.name: v for k, v in g.items()}
y

In [None]:
mylist=[j for j in y['object']]
mylist

In [None]:
for i in mylist:
    if isinstance(i, str)==True:
        print('Variable {} unique value:'.format(i))
        print(df[i].unique())

# Data Visualization

First, we will use heatmap to check whether the dataset has missing values or not.

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

From the heatmap above we can see that the data did not contain any missing value.

Next, we will use a countplot to check which book was mostly buy by the user.

In [None]:
sns.countplot(x="Genre", data=df)

From the countplot above we can say that people most likely buy Non Fiction books.

We already use a heatmap and countplot to analyze our data, in the subsequent steps we will use a line chart to analyze our data.

In [None]:
sns.set_style('whitegrid')
sns.lineplot(df['Year'],df['Reviews'],ci=None)
plt.grid(False)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(df['Year'],df['Reviews'],hue=df['Genre'],ci=None)
plt.grid(False)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(df['Year'],df['Price'],ci=None)
plt.grid(False)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(df['Year'],df['Price'],hue=df['Genre'],ci=None)
plt.grid(False)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(df['Year'],df['User Rating'],ci=None)
plt.grid(False)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(df['Year'],df['User Rating'],hue=df['Genre'],ci=None)
plt.grid(False)

From the five line chart above we can infer that:
1. All the five chart experienced a fluctuation.
2. Although Fiction book price was higher than Nonfiction book price in the beginning of the period, however, Non Fiction book price was lessen than Fiction Book Price in the next period.
3. Although Fiction book has smaller reviews than Nonfiction book in 2018, however, it has higher reviews than Nonfiction books throughout the entire period.
4. Although Fiction book has smaller user rating than Nonfiction book in 2012, however, it has higher user rating than Nonfiction books throughout the entire period.

Lastly we will use a heatmap of correlation plot to check the relationship of the numerical variabe in the dataset.

In [None]:
sns.heatmap(df[['User Rating', 'Reviews', 'Price', 'Year']].corr(),vmin=-1,vmax=1,annot=True)

From the heatmap above we can conclude that:
1. User Rating has linear positive relationship with year and has linear negative relationship with reviews and price.
2. Reviews has linear positive relationship with year and has linear negative relationship with user rating and price.
3. Price has linear negative relationsip with year, user rating, and reviews.
4. Year has linear negative relationship with price and has linear positive relationship with user rating and reviews.
5. User rating and reviews value mostly effected by year.

# Display the Top Books Based on Reviews

First we will display the top books based on reviews.

In [None]:
df[['Name','Author','Reviews','Genre']].sort_values(by=['Reviews'],ascending=False)

From the table above we can see that the book which has the higherst reviews was Where the Crawdads Sing, which was wroten by Delia ownes. It has 87841 reviews and the genre was fiction. On the other hand, the Divne Soul Mind Body has only 37 reviews, which was the lowest one compare to other books.

Next, we will check which book was the most expensive and also the most cheapest.

In [None]:
df[['Name','Author','Price','Genre']].sort_values(by=['Price'],ascending=False).head(5)

In [None]:
df[['Name','Author','Price','Genre']].sort_values(by=['Price'],ascending=False).tail(5)

From the table avove we can see that the most expensive book was Diagnostic and Statistic Manual which was wrotten by American Psychiatric Association, while the most cheapest book was sold for free such as Frozen (Little Golden Book), The Constitution of the United States, To Kill a Mockingbird, and Diary of a Wimpy Kid: Hard Luck, Book 8.

We already display sorted books by price and reviews. Next, we will display sorted books based on year.

In [None]:
df[['Name','Author','Year','Genre']].sort_values(by=['Year'],ascending=False)

From the table above we can see that the oldest book was The Last Olympian, Breaking Dawn, Eat This, Not That!, Good to Great, and Sookie Stackhouse. On the other hand, the newest book in the dataset was You Are a Badasss, School Zone, The Wonderful Things You Will Be, P is for Potty!, and Girl, Was Your Face.

In [None]:
df[['Name','Author','User Rating','Genre']].sort_values(by=['User Rating'],ascending=False)

We can see that the book which has the lowest user rating was The Casual Vacany which was wrotten by J.K. Rowling. On the other hand, the book which has the highest user rating was The Magnolia Story and Dog Man series, which was wrotten by Chip Gaines and Dav Pilkey respectively.

# Conclusion

1. The cheapest book are Frozen (Little Golden Book), The Constitution of the United States, To Kill a Mockingbird, and Diary of a Wimpy Kid: Hard Luck, Book 8.
2. The most expensieve book was  Diagnostic and Statistic Manual which was wrotten by American Psychiatric Association.
3. The book that has higher reviews was Where the Crawdads Sing, which was wroten by Delia ownes.
4. The book that has higher user was The Magnolia Story and Dog Man series, which was wrotten by Chip Gaines and Dav Pilkey respectively.
5. The book genre that was most likely bought by the users was Nonfiction books.