# Homework 5

**Ryan Kulyassa**

**Due 3/4/25**

Assignment Goals:
1. Continue practicing bar plots.
2. Use indicator variables to highlight a single bar in a bar plot.
3. Practice making bar plots with facets.

## Part 1: Practice

1. Use the top-500-books dataset we've previously used in class for this assignment.

In [27]:
import pandas as pd

df = pd.read_csv("data/top-500-novels-metadata_2025-01-11.csv")

df.head()

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,gr_num_ratings,gr_num_reviews,gr_avg_rating_rank,gr_num_ratings_rank,oclc_owi,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,269435,12053,318,211,1810748000.0,17220427.0,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,561016,15380,172,133,11561320000.0,66462036.0,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,1262480,19440,373,68,3373178000.0,50566653.0,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,
3,4,The Adventures of Tom Sawyer,Mark Twain,1876,English,action,1835,1910,male,eng,...,931898,13603,301,88,3373178000.0,50566653.0,https://www.goodreads.com/book/show/24583.The_...,https://en.wikipedia.org/wiki/The_Adventures_o...,https://www.gutenberg.org/cache/epub/74/pg74.txt,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,486155,16307,368,145,3434.0,95207986.0,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,


2. Create an aesthetic bar plot that uses an indicator variable to highlight a single bar.

In [60]:
import plotly.express as px

df_book_count = (
    df
    # Group by author
    .groupby('author')
    # Aggregate by counting the number of books by each author
    .agg(
        book_count=('title', 'count')
    )
    .reset_index()
    # Sort
    .sort_values(by='book_count', ascending=True)
    # Filter to authors with more than 5 books
    [lambda x: x['book_count'] > 5]
    .rename(
        columns={
            'book_count': 'Books Published',
            'author': 'Author'
        }
    )
)

# Add indicator variable

df_book_count.loc[
    df_book_count['Author'] == 'John Grisham',
    'flag'
] = True

df_book_count.loc[
    df_book_count['Author'] != 'John Grisham',
    'flag'
] = False

# display(df_book_count)

# Create chart

fig = px.bar(
    df_book_count,
    x='Books Published',
    y='Author',
    template = 'simple_white',
    title = '<b>Most Published Authors: Who Leads the Pack?</b>',
    subtitle = 'With 19 books published, John Grisham tops the list ahead of Charles Dickens and others.',
    height = 500,
    width = 1000,
    color_discrete_sequence=['lightgray', 'palegreen'],
    color = 'flag',
    text = 'Books Published'
)
fig.update_layout(
    yaxis_ticks = "",
    title_font_family = 'Lora',
    font_family = 'Sora',
    title_font_size = 22,
    margin = {'t':130},
    showlegend = False
)
fig.update_traces(textposition = 'outside')
fig.update_xaxes(visible = False)

fig.show()

3. Create an aesthetic faceted bar plot that also uses an indicator variable to a highlight a single
bar.

In [92]:
# faceted bar plot

# copy original df
df_fig2 = df.copy()

# add indicator variable
df_fig2.loc[
    df_fig2['genre'] == 'fantasy',
    'flag'
] = True
df_fig2.loc[
    df_fig2['flag'] != True,
    'flag'
] = False

# remove columns where genre is na
# must be done after adding indicator variable otherwise pandas throws a warning
df_fig2 = df_fig2[df_fig2['genre'] != 'na']

fig2 = px.bar(
    df_fig2,
    x = 'genre',
    y = 'gr_num_ratings',
    color = 'flag',
    # Add a facet variable
    facet_col = 'author_gender',
    # Change the number of columns
    facet_col_wrap = 1,
    height = 800,
    color_discrete_sequence=['lightgray', 'palegreen'],
    # Different TEMPLATE (with gridlines)
    template = 'none',
    title = "<b>Fantasy Reigns Supreme: The Most Rated Genre on Goodreads.</b>",
    subtitle = "Across both male and female authors, fantasy novels have received the highest number of ratings on Goodreads,<br>highlighting the genre’s widespread popularity and reader engagement.",
    width = 900,
)
fig2.update_layout(
    showlegend = False,
    margin = {'t':200},
    font_family = 'Helvetica',
    title_font_family = 'Baskerville',
    title_font_size = 22,
    # Left align title
    title_x = 0.068 # Start title 6.8% into the figure
)
fig2.for_each_xaxis(lambda xaxis: xaxis.update(showticklabels=True))

4. Convert the bar plot of Example 3 into grouped and stacked bar plots.

In [None]:
# grouped bar plot
fig_grouped = px.bar(
    df_fig2,
    x='genre',
    y='gr_num_ratings',
    color='author_gender',  # Group by gender instead of faceting
    barmode='group',  # Ensures bars are placed side-by-side
    height=800,
    width=900,
    color_discrete_sequence=['lightblue', 'salmon'],
    template='none',
    title="<b>Fantasy Reigns Supreme: Goodreads Ratings by Genre and Gender</b>",
    subtitle="Fantasy leads in ratings for both male and female authors, with scifi and romance following closely.",
)

fig_grouped.update_layout(
    font_family='Helvetica',
    title_font_family='Baskerville',
    title_font_size=22,
    title_x=0.068,  # Align title
)

fig_grouped.show()

# stacked bar plot
fig_stacked = px.bar(
    df_fig2,
    x='genre',
    y='gr_num_ratings',
    color='author_gender',  # Different colors for gender
    barmode='stack',  # Stacks bars on top of each other
    height=800,
    width=900,
    color_discrete_sequence=['lightblue', 'salmon'],
    template='none',
    title="<b>Fantasy Reigns Supreme: Stacked Goodreads Ratings by Genre and Gender</b>",
    subtitle="Stacked representation shows the cumulative impact of male and female authors in each genre.",
)

fig_stacked.update_layout(
    font_family='Helvetica',
    title_font_family='Baskerville',
    title_font_size=22,
    title_x=0.068,  # Align title
)

fig_stacked.show()

***Between all 3 visualization (facet, stacked, and grouped), which do you prefer and why?***

I prefer the stacked bar plot for this specific visualization, because it is very clear to see both which genre is the most popular (the columns are the sum of both genders), and also the breakdown between genders within each genre. For example, we can observe that the majority of genres were male-dominated while scifi and romance were female-dominated.