# Notebook 2: Exploratory Data Analysis (EDA)

**Objective:** Load the cleaned data from `artifacts/` and perform exploratory data analysis to understand the data distributions and find insights.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.offline as py
import plotly.graph_objects as go
import os

py.init_notebook_mode(connected=True)

# Define file paths
ARTIFACTS_DIR = '../artifacts'
BOOKS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_books.pkl')
USERS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_users.pkl')
RATINGS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_ratings.pkl')

In [2]:
# Load cleaned data
books_df = pd.read_pickle(BOOKS_PATH)
users_df = pd.read_pickle(USERS_PATH)
ratings_df = pd.read_pickle(RATINGS_PATH)

print("Data loaded successfully!")

Data loaded successfully!


## 1. Analysis of Book Ratings

In [3]:
rating_counts = ratings_df['book_rating'].value_counts().sort_index()
fig = px.bar(rating_counts, 
             x=rating_counts.index, 
             y=rating_counts.values, 
             title='Distribution of Book Ratings',
             labels={'x': 'Rating', 'y': 'Count'},
             color=rating_counts.index,
             template='plotly_dark'
            )
fig.show()

**Inference:**
A vast majority of ratings are '0', which represents implicit ratings (user interaction without an explicit score). Among the explicit ratings (1-10), the distribution is skewed towards the higher end, with '8' being the most common rating, followed by '7', '10', and '9'.

## 2. Analysis of User Age

In [4]:
fig = px.histogram(users_df, 
                   x='age', 
                   nbins=50, 
                   title='Distribution of User Ages',
                   template='plotly_dark'
                  )
fig.show()

**Inference:**
The distribution of user ages shows a large peak around 25-35. This is the median age we used for imputation, so many of the 'NaN' values were assigned this age. However, even without imputation, the primary user base appears to be in their 20s and 30s.

## 3. Analysis of Top Authors

In [5]:
top_authors = books_df['book_author'].value_counts().head(20)
fig = px.bar(top_authors, 
             x=top_authors.values, 
             y=top_authors.index, 
             orientation='h', 
             title='Top 20 Authors by Number of Books',
             labels={'x': 'Number of Books', 'y': 'Author'},
             template='plotly_dark'
            )
fig.update_yaxes(autorange="reversed")
fig.show()

**Inference:**
Agatha Christie, William Shakespeare, and Stephen King are the most prolific authors in this dataset. This indicates a strong presence of classic and popular fiction.

## 4. Analysis of Top Publishers

In [6]:
top_publishers = books_df['publisher'].value_counts().head(20)
fig = px.bar(top_publishers, 
             x=top_publishers.values, 
             y=top_publishers.index, 
             orientation='h', 
             title='Top 20 Publishers by Number of Books',
             labels={'x': 'Number of Books', 'y': 'Publisher'},
             template='plotly_dark'
            )
fig.update_yaxes(autorange="reversed")
fig.show()

**Inference:**
Mass-market publishers like Harlequin, Silhouette, Pocket, and Ballantine Books dominate the dataset. This suggests a large collection of paperback fiction, particularly romance and mystery.

## 5. Analysis of Publication Year

In [7]:
# Filter out the imputed median year to get a clearer picture, just for this plot
median_year = books_df['year_of_publication'].median()
plot_df = books_df[books_df['year_of_publication'] != median_year]

fig = px.histogram(plot_df, 
                   x='year_of_publication', 
                   nbins=100, 
                   title='Distribution of Publication Years (Imputed values removed for plot)',
                   template='plotly_dark'
                  )
fig.show()

**Inference:**
After removing the imputed median value (which would create a massive spike), we see a clear trend: the number of books published increases significantly from the 1950s onwards, peaking in the late 1990s and early 2000s. This aligns with the dataset's collection period.

## 6. Analysis of User Location (Top Countries)

In [8]:
top_countries = users_df['country'].value_counts().head(20)
fig = px.bar(top_countries, 
             x=top_countries.values, 
             y=top_countries.index, 
             orientation='h', 
             title='Top 20 User Countries',
             labels={'x': 'Number of Users', 'y': 'Country'},
             template='plotly_dark'
            )
fig.update_yaxes(autorange="reversed")
fig.show()

**Inference:**
The dataset is overwhelmingly dominated by users from the 'usa', followed by 'canada', 'united kingdom', 'germany', and 'spain'. The 'unknown' category is also large, stemming from missing location data.