# Bookcrossing User Review Analytics

Few questions which could be analysed in this data :

1.How does the rating distribution look like ?

2.Is there a pattern where a set of users have always provided higher/lower ratings for books reviewed ?

3.Which category has got the most number books for review ? Is there a rating pattern for a certain category of books ?

4.Is specific set of age group inclined towards a category or they are diverse ?

### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os

In [None]:
for dirname,_,filenames in os.walk('../data/'):
    for filename in filenames:
        print(os.path.join(dirname,filename))

In [None]:
book_rating=pd.read_csv('../input/bookcrossing-dataset/Book reviews/Book reviews/BX-Book-Ratings.csv',sep=";",encoding='latin-1')
users=pd.read_csv('../input/bookcrossing-dataset/Book reviews/Book reviews/BX-Users.csv',sep=";",encoding='latin-1')
books=pd.read_csv('../input/bookcrossing-dataset/Book reviews/Book reviews/BX_Books.csv',sep=";",encoding='latin-1')
book_clean=pd.read_csv('../input/bookcrossing-dataset/Books Data with Category Language and Summary/Preprocessed_data.csv',index_col=0)

In [None]:
books.head()

In [None]:
book_rating.shape,users.shape,books.shape

In [None]:
print(f"Total Books :{books['ISBN'].nunique()}\nTotal Users:{users['User-ID'].nunique()}\nTotal Users who have rated:{book_rating['User-ID'].nunique()}")

### Data Quality Checks

In [None]:
## Check for nulls:
(book_rating.isna().sum()/book_rating.shape[0])*100

In [None]:
(books.isna().sum()/books.shape[0])*100

In [None]:
(users.isna().sum()/users.shape[0])*100

Book rating has no null values in the colums whereas books has 2 nulls in publisher and 1 null in book author columns.Age column has lot of nulls in the users dataframe.
#### Check for duplicates:
While it might be possible that each user could review multiple books, we could check to see whether there are any duplicated rows  in all the dataframe and remove them incase if there are any.

In [None]:
book_rating.duplicated().sum(),books.duplicated().sum(),users.duplicated().sum()

As seen from the results,there are no duplicated entries.The individual dataframes are already have been cleaned and available to us.Lets do the same quality check and understand how the data was preprocessed.

In [None]:
book_clean.isna().sum()

Going by the count, it is seen that columns city,state and country have null values. Age column which we saw earlier with lot of nulls has been processed to remove nulls.

In [None]:
book_clean.duplicated().sum()

There are no duplicated rows.Lets use this dataframe for our analysis going forward.

## Exploratory Data Analysis:

### Rating Distribution

In [None]:
plt.figure(figsize=(12,8))
p=sns.countplot(book_clean['rating'],color='#88527F')
plt.title('Distribution of Book Ratings',fontsize=12)
plt.xlabel('Ratings',fontsize=10)
plt.ylabel('Frequency',fontsize=10)
for t in p.patches:
    #print(t)
    p.annotate("{}".format(t.get_height()), (t.get_x() + t.get_width() / 2., t.get_height()),
         ha='center', va='center', fontsize=15, color='black', xytext=(0, 10),
         textcoords='offset points')

In [None]:
(book_clean['rating'].value_counts()/book_clean.shape[0])*100

6 Lakh books have no ratings for them while 91K books have been provided with 8 rating.Going by the raw numbers, it is observed that most of the rated books have been provided rating of 8 to 10.

### Users who have provided the ratings

Now that we know 62 % of the books have zero rating, we are interested to identify users who have provided ratings. Lets first understand users who have provided ratings out of the total users.

In [None]:
(book_clean.loc[book_clean['rating']>0,'user_id'].nunique()/book_clean['user_id'].nunique())*100

Out of the total users in the database,73 % of the users have rated the books they have read.Lets get the top 10 users who have provided ratings for most books.

In [None]:
top_users=book_clean.loc[book_clean['rating']>0].groupby('user_id')['isbn'].nunique().sort_values(ascending=False)[:10]
top_users

Lets understand the rating trend for these users alone.
For the purpose of this analysis, let me group the rating into 3 buckets - 0-4,5-7,8-10. I have arrived at this going by the rating trend we had seen earlier.

In [None]:
book_clean['rating_window']=np.select([((book_clean['rating']>=0) & (book_clean['rating']<=4)),
                                       ((book_clean['rating']>=5) & (book_clean['rating']<=7)),
                                       ((book_clean['rating']>=8) & (book_clean['rating']<=10))],
                                      ['0-4',
                                      '5-7',
                                      '8-10'])

In [None]:
user_review=book_clean.loc[(book_clean['user_id'].isin(top_users.index)) & (book_clean['rating']>0),]
user_review.shape

In [None]:
review_count=user_review.groupby(['user_id','rating_window'])['isbn'].nunique()
review_count=review_count.groupby(level=0).apply(lambda x:(x/x.sum())).reset_index().rename(columns={'isbn':'value'})
review_count=review_count.pivot(index='user_id',columns='rating_window',values='value').fillna(0)

In [None]:
plt.figure(figsize=(10,8))
p=sns.heatmap(review_count,annot=True,cmap='viridis',fmt='.2%')
p.set_yticklabels(p.get_yticklabels(),rotation=0)
plt.title('Ratings Summary for the top 10 Users',fontsize=12)
plt.xlabel('Rating Window',fontsize=10)
plt.ylabel('User ID',fontsize=10)
plt.show()

A few insights infered:
1. Among the top 10 users, 3 users have provided rating 8-10 for more than 80 % books they have read.
2. While the user 11676 who have  books has ratings distributed between 5-7 and 8-10 , user id 98391 has more than 95 % books rated in 8-10.
3. Except for 2 user,(189835 & 171118) all the users have more than 50 % of books rated between 8-10.

Lets get the individual ratings also and check the trend.

In [None]:
review_count=user_review.groupby(['user_id','rating'])['isbn'].nunique()
review_count=review_count.groupby(level=0).apply(lambda x:(x/x.sum())).reset_index().rename(columns={'isbn':'value'})
review_count=review_count.pivot(index='user_id',columns='rating',values='value').fillna(0)

plt.figure(figsize=(10,8))
p=sns.heatmap(review_count,annot=True,fmt='.2%',cmap='viridis')
p.set_yticklabels(p.get_yticklabels(),rotation=0)
plt.title('Ratings Summary for the top 10 Users',fontsize=12)
plt.xlabel('Ratings',fontsize=10)
plt.ylabel('User ID',fontsize=10)
plt.show()

**Analysis loading...**