# Homework 2.3 Solutions | Bivariate Relationships by Category

*Homework is designed to both test your knowlege and challenge you to apply familiar concepts to new applications. Answer clearly and completely. You are welcomed and encouraged to work in groups so long as your work is your own. Submit your figures and answers to [Gradescope](https://www.gradescope.com).*

In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# File Path
file_path = 'https://tayweid.github.io/econ-0150/parts/part-2-1/data/'

In [None]:
# Load Data
books = pd.read_csv(file_path + 'amazon_book_sales.csv', index_col=0)
books.head()

## Q1. Explore the Data

The dataset `amazon_book_sales.csv` contains the 50 bestselling Amazon books each year between 2009 and 2021. Prices are expressed in US dollars and rounded to the nearest dollar.

a) How many unique books are in the dataset *(hint: use .unique() to count a book only once even if it shows up in multiple years)*?

In [None]:
len(books.Name.unique())

There are 441 unique books in the dataset (out of 700 total rows, since some books appear in multiple years).

b) How many books are Fiction? How many are Non-Fiction? *(hint: use a filter)*

In [None]:
print('Fiction:', len(books[books.Genre == 'Fiction']))
print('Non Fiction:', len(books[books.Genre == 'Non Fiction']))

There are 312 Fiction entries and 388 Non-Fiction entries.

## Q2. Ratings by Genre

The column `Genre` describes whether the bestseller is Fiction or Non-Fiction.

a) Create a boxplot of `User Rating` by `Genre`. Add a stripplot overlay.

In [None]:
sns.boxplot(data=books, x='Genre', y='User Rating', whis=(0,100))
sns.stripplot(data=books, x='Genre', y='User Rating', alpha=0.3, color='black')
plt.title('User Rating by Genre')

b) Calculate the mean and standard deviation of `User Rating` by `Genre`.

In [None]:
books.groupby('Genre')['User Rating'].agg(['mean', 'std'])

c) Which genre has higher average ratings? Is the difference large or small relative to the variation within each group?

Fiction has a slightly higher average rating (4.66 vs 4.62). The difference (~0.04) is very small relative to the standard deviation within each group (~0.19–0.25), so the two genres have largely overlapping rating distributions.

## Q3. Reviews and Ratings

a) Create a scatter plot of `Reviews` (x-axis) vs `User Rating` (y-axis) for all books.

In [None]:
sns.scatterplot(data=books, x='Reviews', y='User Rating')
plt.title('Reviews vs User Rating')

b) Is there a relationship between reviews and ratings? If so, is it positive, negative, or unclear?

There appears to be a weak positive relationship — books with more reviews tend to have slightly higher ratings — but the pattern is hard to see because the reviews are heavily right-skewed, compressing most of the data into the left side of the plot.

c) How might you use a transformation to improve this figure?

Taking the log of `Reviews` would spread out the compressed data on the left side of the plot and make it easier to see the relationship. A log transformation is appropriate here because the reviews variable is heavily right-skewed.

## Q4. Does the Relationship Differ by Genre?

a) Create a scatter plot of `Reviews` vs `User Rating`, colored by `Genre`. Use any transformations you suggested in Q3.c).

In [None]:
books['Log_Reviews'] = np.log(books['Reviews'])
sns.scatterplot(data=books, x='Log_Reviews', y='User Rating', hue='Genre')
plt.title('Log Reviews vs User Rating by Genre')
plt.xlabel('Log(Reviews)')

b) Does the relationship between reviews and ratings differ between Fiction and Non-Fiction? Describe the pattern you observe.

Both genres show a weak positive relationship between log reviews and user ratings. However, Fiction books tend to have more reviews on average (the Fiction points are shifted to the right) and also show slightly more variation in ratings. Non-Fiction books are more tightly clustered in both reviews and ratings.

c) In one sentence, what might explain any differences you observe?

Fiction bestsellers may attract a broader and more passionate readership that is more likely to leave reviews, while Non-Fiction bestsellers may appeal to a more targeted audience with more consistent expectations and ratings.