This notebook contains
- Data Preparation: quick view towards the dataset, replacing value with mean value
- Data Exploration: finding duplicates/ similar words by using cosine similarity
- Data Visualization

# Import Library

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Data Preparation

In [None]:
# import data
df = pd.read_csv(r"/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")

In [None]:
# quick view about the table
df.head()

In [None]:
print(f"There are {np.shape(df)[0]} rows and {np.shape(df)[1]} columns")

In [None]:
# check the numeric columns
df.describe()

Plot three figures to show the details of "User Rating", "Reviews", "Price".

In [None]:
ax = sns.boxplot(x=df["User Rating"])

In [None]:
ax = sns.boxplot(x=df["Reviews"])

In [None]:
ax = sns.boxplot(x=df["Price"])

Some question I have after reading the figures:
* (1) Which book got average 3.3 user rating?
* (2) Which book got only 37 reviews while other books have thousands of reviews?
* (3) The minimun price is 0? >> seems that we need to take a detailed look at it

(1) Which book got average 3.3 user rating?

In [None]:
df[df["User Rating"] == 3.3]

>Okay. Reasonable.

(2) Which book got only 37 reviews?

In [None]:
df[df["Reviews"] == 37]

    It just seem unreasonable that one of the best selling books in amazon only got 37 review on that year. Anyway, I dont have way to check the validity.

(3) The minimun price is 0? 

In [None]:
df[df["Price"] == 0]

    We can also find books with low price (eg. 1,2,3). As long as the price is not 0, we can accept the validity of the table. I will replace the 0 with average price grouped by year and genre. 

Price that equals to 0 is replaced with the average price.
Code:

In [None]:
# Seperate the dataframe into two dataframes
df_not_zero = df[df["Price"] != 0]
df_equal_zero = df[df["Price"] == 0]

# calculate the mean price, grouped by year and genre
price_groubpy_year_genre = df_not_zero.groupby(["Year", "Genre"])["Price"].mean()

# Drop the "Price" column
df_equal_zero.drop(columns=['Price'], inplace = True)

# Left join the average price and concate the two dataframes.
df_replaced_zero = pd.merge(df_equal_zero,price_groubpy_year_genre,on=["Year", "Genre"])
df_new = pd.concat([df_not_zero, df_replaced_zero])

To check if the code is correct:

In [None]:
print(f"There are {np.shape(df_new)[0]} rows and {np.shape(df_new)[1]} columns")

# Data Exploration - checking duplicate names

In [None]:
print(f"There are {len(df_new.Name.unique())} books and {len(df_new.Author.unique())} arthurs")

The actual number of unique books or arthurs may be less due to typing difference in name. For example, 

    "The Girl Who Played with Fire (Millennium Series)" and 
    "The Girl Who Played with Fire (Millennium)" 

Below I will use word similarity to check all the unique books name and arthurs name. 

In [None]:
# Import libries:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Code to find similar names:

In [None]:
# Find all the Unique names, and transform to lower case:
unique_name = df_new.Name.unique()
unique_name = [word.lower() for word in unique_name]
# Vectorize words:
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(unique_name)

# Calculate the similarity between each other
result = (cosine_similarity(sparse_matrix, sparse_matrix))

# Only get similar names with score > 0.7 and < 1. 
# Similarity Score is a hyperparameter can be set by ourselves.
# I have tried several similarity score. 0.7 give a good result.
similar_name = np.argwhere(result > 0.7)
similar_name_filter = similar_name[similar_name[:,0] != similar_name[:,1]]

# Some result is repeated. The below code filter them out
similar_name_filter_not_repeat = similar_name_filter[similar_name_filter[:,1] > similar_name_filter[:,0]]

# Make a dataframe for easier comparison
left = np.array(similar_name_filter_not_repeat[:,0])
right = np.array(similar_name_filter_not_repeat[:,1])
df_similar = pd.DataFrame({'col1': np.take(unique_name, left), 'col2': np.take(unique_name, right)}) 

In [None]:
# Set pandas to show more rows
pd.set_option('display.max_rows', 100)
# Result that some books have similar with another books
df_similar

The meaning of Similarity Score:
* (1) It gives us a fast and easy way to check if there are similar words in the dataset. Here, we find that some books' names are different but they are acturally representing the same book.

The meaning of Similarity Score:
* (2) It gives us another view to look into the dataset. We found some Series of books by using the above method. For example: heroes of olympus,harry potter, heroes of olympus, diary of a wimpy kid, dog man, etc. This gives us another perspective to drive into the dataset.

We can change the book name, if we want to explore the data in the perspective of book name. 
Here, I just leave it unchanged now.

I made some changes to the above code.
We can now find the similar authors' names by the following code:

In [None]:
# Find all the Unique names, and transform to lower case:
unique_name = df_new.Author.unique()
unique_name = [word.lower() for word in unique_name]
# Vectorize words into vectors:
count_vectorizer = CountVectorizer(ngram_range=(1, 3))
sparse_matrix = count_vectorizer.fit_transform(unique_name)

# Calculate the similarity between each other
result = (cosine_similarity(sparse_matrix, sparse_matrix))

# Only get similar names with score > 0.7 and < 1. 
# Similarity Score is a hyperparameter can be set by ourselves.
# I have tried several similarity score. 0.7 give a good result.
similar_name = np.argwhere(result > 0.7)
similar_name_filter = similar_name[similar_name[:,0] != similar_name[:,1]]

# Some result is repeated. The below code filter them out
similar_name_filter_not_repeat = similar_name_filter[similar_name_filter[:,1] > similar_name_filter[:,0]]

# Make a dataframe for easier comparison
left = np.array(similar_name_filter_not_repeat[:,0])
right = np.array(similar_name_filter_not_repeat[:,1])
df_similar = pd.DataFrame({'col1': np.take(unique_name, left), 'col2': np.take(unique_name, right)}) 

In [None]:
df_similar

In [None]:
# Correct the authors' names in the dataframe
# Remember that authors' names were transformed to lower case. Here, I hard code to change the dataframe.
df_new = df_new.replace(["George R. R. Martin", "J. K. Rowling"], ["George R.R. Martin","J.K. Rowling"])

In [None]:
print(f"There are {len(df_new.Name.unique())} books and {len(df_new.Author.unique())} arthurs. There are 248 arthurs' names before correction.")

# Data Visualization

In [None]:
# We can see there are different series of books.
# I use "dog man series" as an example and to 
# see if we can find some insight from it.
df_new = df_new.reset_index()
name_list = []
name_lower = df_new["Name"].apply(lambda x: x.lower())
for i in range(len(name_lower)):
    if "dog man" in name_lower[i]:
        name_list.append(i)
df_series = df_new.iloc[name_list, :]

In [None]:
df_series

Lets see how is the performace of the Dog Man series (by comparing with the average review of all books at different years).
Code:

In [None]:
df_series_avg = df_series.groupby(["Year"]).mean().reset_index()
df_fiction_avg = df[df["Genre"] == "Fiction"].groupby(["Year", "Genre"]).mean()

df_fiction_avg = df_fiction_avg.reset_index()[["Year", "Reviews"]]
df_fiction_avg = df_fiction_avg.iloc[-3:]

df_series_avg["Legend"] = "Dog Man"
df_fiction_avg["Legend"] = "Total"

dog_man_df = pd.concat([df_series_avg, df_fiction_avg])
dog_man_df = dog_man_df.astype({"Year": "string"})

In [None]:
sns.lineplot(data = dog_man_df, x="Year", y="Reviews", hue="Legend", style="Legend")
ax.set(xlim=(2017, 2019))

Next, we can take a look at the data in the perspective of author.

In [None]:
# First, I find which author had most bestselling books in amazon
df_author = df_new.groupby(["Author"]).count().reset_index()
df_author.sort_values(by=['index']).tail(1)["Author"]

Jeff Kinney published most books that are the bestselling books. Code to draft the figure:

In [None]:
JK_df = df_new[df_new["Author"] == "Jeff Kinney"]
JK_df = JK_df.groupby(["Year"]).mean()
JK_df["Legend"] = "Jeff Kinney"
JK_df = JK_df[["Legend", "User Rating"]]
JK_df.reset_index(inplace=True)

df_fiction_avg = df[df["Genre"] == "Fiction"].groupby(["Year", "Genre"]).mean()
df_fiction_avg = df_fiction_avg.reset_index()[["Year", "User Rating"]]
df_fiction_avg["Legend"] = "Total"

JK_df = pd.concat([JK_df, df_fiction_avg])

In [None]:
sns.lineplot(data = JK_df, x="Year", y="User Rating", hue="Legend", style="Legend")

Feel free to upvote it if you think this notebook is useful for you! Thank You!