<img src="http://s3-us-west-2.amazonaws.com/lndr-landorcom-assets-prd/app/uploads/2018/05/22233704/Amazon-Books_Flickr-Shinya-Suzuki_2-e1527064849416-1160x809.jpg" width="90%" align="left">

<b>Contents:</b>

1. <a href='#section1'> Importing libraries & reading data </a>
2. <a href='#section2'>Data Description </a>
3. <a href='#section3'>Data Cleaning  </a><br>
    <a href='#section3.1'>3.1. Changing of datatypes  </a><br>
    <a href='#section3.2'>3.2. Missing Values  </a><br>
    <a href='#section3.3'>3.3. Dealing with duplicates  </a><br>
4. <a href='#section4'>Visualisation </a><br>
    <a href='#section4.1'>4.1. Visualising Categorical Variables </a><br>
    <a href='#section4.2'>4.2. Visualising Numerical Variables </a><br>
5. <a href='#section5'>Correlation  </a> <br>
6. <a href='#section6'>Clustering  </a> 

<a id='section1'></a>

# <mark>1. Importing required libs & reading data </mark>

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

import collections
from scipy.stats import shapiro

In [None]:
data = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
data

In [None]:
data.shape

<a id='section2'></a>

# <mark>2. Data Description </mark>

In [None]:
data.describe().T

-> <u>Datatypes of columns: </u>

In [None]:
data.dtypes

<a id='section3'></a>

# <mark>3. Data Cleaning </mark>

( of data types, missing values, duplicates)

<a id='section3.1'></a>

## <mark>3.1. Chaging of Datatypes</mark>

-> <u> Using a category datatype (for Genre variable) as there is lots of repetition (it saves memory), and we want to categorize the data further. </u>

In [None]:
data.Genre = data.Genre.astype('category')

In [None]:
data.dtypes

<a id='section3.2'></a>

## <mark>3.2. Missing Values</mark>

In [None]:
data.isnull().sum()

-> <u> There are no missing values </u>

<a id='section3.3'></a>

## <mark>3.3. Dealing with duplicates </mark>

In [None]:
data.duplicated(subset=['Name']).sum()

<u>There are 199 duplicates in Name columns. And, 302 duplicate author values. </u>

In [None]:
data.duplicated(subset=['Author']).sum()

In [None]:
data.duplicated().sum()

<u>Although, there are no duplicate rows entirely. <b>This is possibly becuase some books remained bestselling throughout many years.</b></u>

---

In [None]:
#with pd.option_context('display.max_rows', None):
display(data[data['Name'].duplicated() == True].sort_values(by=['Name']))


In [None]:
count_books = collections.Counter(data['Name'].tolist())
count_books.most_common(50)

<u> Checking if there are same author names with different spellings:</u>

In [None]:
authors = data['Author'].sort_values().unique()
authors

In [None]:
authors = data['Author'].unique()
authors.size

###### -> Names of authors George R. R. Martin, George R.R. Martin and J. K. Rowling, J.K. Rowling have different spellings although they are the same authors. 

Although, this is a tedious task, and we should never check duplicates like this. There are high possibilities of human error. There is a method in fuzzywuzzy package for python, which is used for sting matching. Basically it uses Levenshtein Distance to calculate the differences between sequences.

In [None]:
data = data.replace('George R. R. Martin', 'George R.R. Martin')
data = data.replace('J. K. Rowling', 'J.K. Rowling')

In [None]:
authors = data['Author'].unique()
authors.size

<u>As we can see, total numbers of unique values have decreased by 2.</u>

In [None]:
genre = data['Genre'].unique()
genre

In [None]:
years = list(data['Year'].unique())
sorted(years)

<u>There are no duplicates in Genre. No duplicates in Years. </u>

---

In [None]:
data[data['Name'].duplicated() == True]

<u>As we know, some of the books remained bestselling throughtout the years. We will remove the year column, and store the datafame in 'data_without_year' variable. </u>

In [None]:
data_without_year = data.drop(['Year'], axis = 1)
data_without_year

<u> Removing duplicates to check how many books are left in the data </u>

In [None]:
data_without_year = data_without_year.drop_duplicates(keep='first')
data_without_year.info()

<u> Earlier there were 550 books, now we have only 361. </u>

---

<u> Checking whether the duplicates are completely removed or not. </u>

<u>For this I'm using collections library. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. </u>

In [None]:
count_books = collections.Counter(data_without_year['Name'].tolist())
count_books.most_common(10)

<u> Duplicates are still there. But Why!? </u>

In [None]:
data_without_year[data_without_year['Name'] == 'Gone Girl']

-> <u> Records are duplicated by price. Apparently, in different years the book costs differently, which is quite logical taking into account inflation, demand, etc. Let's leave only the first entries. </u> 

In [None]:
data_without_year_dup = data_without_year.drop_duplicates(subset='Name', keep='first')
data_without_year_dup

-> <u><b>CONCLUSION: Thus, the data contains 351 different books written by 246 authors. All books are presented in two categories (Non Fiction, Fiction). </b></u>

<a id='section4'></a>

# <mark>4. Visualisation </mark>

Now that we are done with data cleaning, lets begin with visualisations.

<a id='section4.1'></a>

## <mark>4.1. Visualising Categorical Variables</mark>

<u>Checking top 15 best authors according to user rating. </u>

In [None]:
top_authors = data_without_year_dup.groupby('Author')[['User Rating']].mean()\
                                                                .sort_values('User Rating', ascending=False).head(15)

top_authors

<u> Authors who have written most books </u>

In [None]:
count_author_freq = collections.Counter(data['Author'].tolist())
count_author_freq = count_author_freq.most_common(15)
count_author_freq

In [None]:
count_author_freq = pd.DataFrame.from_dict(count_author_freq)
count_author_freq.columns =['Name', 'Books']
count_author_freq.plot.bar(x='Name', y='Books')

<u> Books with highest reviews: </u>

In [None]:
books_highest_reviews = data_without_year_dup.groupby('Name')[['Reviews']].sum()\
                                                                        .sort_values('Reviews', ascending=False).head(10)
books_highest_reviews

In [None]:
books_highest_reviews.plot.bar()

<u> Categorising & plotting genres: </u>

In [None]:
number_of_books_by_genre = data_without_year_dup.groupby('Genre')[['Name']].count()\
                                                                 .sort_values('Name', ascending=False).head(10)

number_of_books_by_genre

In [None]:
number_of_books_by_genre.plot.pie(y='Name', figsize=(5, 5), autopct="%.1f%%")

###### INSIGHT: By analyzing the categorical data:

1. <b>The following 10 authors have the highest rating: </b><br>
Nathan W. Pyle, Patrick Thorpe, Eric Carle, Emily Winfield Martin, Chip Gaines, Jill Twiss, Rush Limbaugh, Sherri Duskey Rinker, Alice Schertle, Pete Souza, Sarah Young. The average rating for their works was 4.9. When buying a new book, you should pay attention to these authors.

2. <b> Authors who have written more bestsellers: </b><br>
Jeff Kinney - 12 books, Rick Riordan - 10 books, J.K. Rowling - 8 books, Stephenie Meyer - 7 books, Dav Pilkey - 6 books, Bill O'Reilly - 6 books, John Grisham - 5 books, E L James - 5 books, Suzanne Collins - 5 books, Charlaine Harris - 4 books. These authors always have something to read.

3. <b>Books with the most reviews: </b><br>
Where The Crawdads Sing - 87841 Reviews, The Girl On The Train - 79446 Reviews, Becoming - 61133 Reviews, Gone Girl - 57271 Reviews, The Fault In Our Stars - 50482 Reviews. It's definitely worth reading the book Where The Crawdads Sing, it is the most talked about.

4. <b> Non-fiction is more likely to become a bestseller. </b>

<a id='section4.2'></a>

## <mark>4.2. Visualising Numerical Variables</mark>

In [None]:
data_without_year_dup.describe().T

In [None]:
sns.boxplot(x=data_without_year_dup['User Rating'])

In [None]:
sns.boxplot(x=data_without_year_dup['Reviews'])

In [None]:
sns.boxplot(x=data_without_year_dup['Price'])

##### INSIGHT: By analyzing the numeric data:

1. <b>User Rating: </b> 

- Data is not distributed normally. Asymmetry is observed.
- Average and median book ratings are 4.6.
- There are outliers in the data. There are a small number of books in the data below the 4.1 rating.

2. <b>Reviews: </b>

- Data is not distributed normally. Asymmetry is observed.

3. <b>Price:</b>

- Data is not distributed normally. Asymmetry is observed.
- There are books that cost much higher than the average, as well as books with a cost of 0!. Either the book is given for free or this error.

<a id='section5'></a>

# <mark> 5. Corelation </mark>

In [None]:
data_without_year_dup.corr()

In [None]:
sns.heatmap(data_without_year_dup.corr())

In [None]:
sns.scatterplot(x=data_without_year_dup['User Rating'], y=data_without_year_dup['Reviews'])

In [None]:
sns.scatterplot(x=data_without_year_dup['User Rating'], y=data_without_year_dup['Price'])

In [None]:
sns.scatterplot(x=data_without_year_dup['Price'], y=data_without_year_dup['Reviews'])

<u><b>INSIGHT: Based on the constructed correlation matrix as well as the constructed visualizations, it can be seen that the data does not contain any positive or negative linear relationship between the rating, reviews and the price of books. </b></u>

#### Testing Hypothesis

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

In [None]:
alpha=0.05

stat, pval = shapiro(data_without_year_dup['User Rating'])

print('P-Value:', f'{pval:.20f}')

if pval > alpha:
    print('Accept H0 - Data is distributed normally.')
if pval < alpha:
    print('Reject H0 - Data is not distributed normally.')

<a id='section6'></a>

# <mark> 6.Clustering </mark>

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

In [None]:
# chaging the name of the dataframe
data_for_cluster = data_without_year_dup.drop(columns=['Name','Author'])

In [None]:
data_for_cluster.head(2)

In [None]:
data_for_cluster.Genre = data_for_cluster.Genre.astype('object')
data_for_cluster.dtypes

In [None]:
data_for_cluster['Genre'].replace(['Fiction','Non Fiction'],[0,1],inplace=True)
data_for_cluster.head(2)

In [None]:
X = data_for_cluster.values
X

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,max_iter=300,random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss,'bo-')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
#choosing number of clusters as 4
kmeans = KMeans(n_clusters=3,init = 'k-means++', random_state = 100)
y = kmeans.fit_predict(X)

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(14,6))

ax1.set_title('K Means Review/Rating')
ax1.scatter(data_for_cluster['Reviews'],data_for_cluster['User Rating'],c=kmeans.labels_,cmap='rainbow')

ax2.set_title("K Means Price/Rating ")
ax2.scatter(data_for_cluster['Price'],data_for_cluster['User Rating'],c=kmeans.labels_,cmap ='rainbow')