# Welcome to my data visualizations using altair! 

## We're going to try to find out which factors seem to impact the rankings of these books most!

### To get started, let's import our datasets, align our column names, and merge them together!
###### Thanks GitHub Copilot for help with the visualizations

In [49]:
import pandas as pd
import altair as alt
alt.renderers.enable('mimetype')
novels_df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv")
nyt_bestsellers_df = pd.read_csv("https://raw.githubusercontent.com/ecds/post45-datasets/main/nyt_full.tsv", sep="\t")
nyt_bestsellers_df = nyt_bestsellers_df.rename(columns={'title': 'nyt_title'})
nyt_bestsellers_df['title'] = nyt_bestsellers_df['nyt_title'].str.capitalize()
combined_novels_nyt_df = novels_df.merge(nyt_bestsellers_df, how='left', on=['author', 'title'])
combined_novels_nyt_df

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url,year,week,rank,title_id,nyt_title
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,17220427,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...,,,,,
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,,,,,,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,50566653,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,,,,,,
3,4,The Adventures of Tom Sawyer,Mark Twain,1876,English,action,1835,1910,male,eng,...,50566653,https://www.goodreads.com/book/show/24583.The_...,https://en.wikipedia.org/wiki/The_Adventures_o...,https://www.gutenberg.org/cache/epub/74/pg74.txt,,,,,,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,95207986,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716,499,Room,Emma Donoghue,2010,English,na,1969,ALIVE,female,eng,...,39539889,,https://en.wikipedia.org/wiki/Room_(novel),NA_not-pub-domain,,2011.0,2011-03-06,8.0,3859.0,ROOM
717,499,Room,Emma Donoghue,2010,English,na,1969,ALIVE,female,eng,...,39539889,,https://en.wikipedia.org/wiki/Room_(novel),NA_not-pub-domain,,2011.0,2011-03-13,13.0,3859.0,ROOM
718,499,Room,Emma Donoghue,2010,English,na,1969,ALIVE,female,eng,...,39539889,,https://en.wikipedia.org/wiki/Room_(novel),NA_not-pub-domain,,2011.0,2011-03-20,13.0,3859.0,ROOM
719,499,Room,Emma Donoghue,2010,English,na,1969,ALIVE,female,eng,...,39539889,,https://en.wikipedia.org/wiki/Room_(novel),NA_not-pub-domain,,2011.0,2011-04-03,15.0,3859.0,ROOM


# Exploratory Data Analysis (EDA) time!

### I wonder, how much of this dataset is comprised of books originally written in languages *other* than English?

In [2]:
alt.Chart(combined_novels_nyt_df).mark_bar().encode(
    x='orig_lang:N',
    y='count():Q',
    tooltip=['orig_lang', 'count()']
).properties(
    title='Count of Novels by Original Language'
).interactive().show()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### That's a lot! With that in mind, is it possible the ones not in English average a really high score? There's so few of them comparatively that the only ones who made the list must have really high rankings! 

In [3]:
alt.Chart(combined_novels_nyt_df).mark_bar().encode(
    x='orig_lang',
    y='mean(top_500_rank)',
    tooltip=['orig_lang', 'mean(top_500_rank)']
).properties(
    title='Mean Top 500 Rank by Original Language'
).interactive().show()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Nope, that doesn't seem to be the case either! Let's try taking a look at how publication year influences our rankings. 

In [4]:
alt.Chart(combined_novels_nyt_df).mark_circle(size=60).encode(
    x=alt.X('pub_year', scale=alt.Scale(domain=[1600, 2024])),
    y='top_500_rank',
    color='genre',
    tooltip=['title', 'author', 'top_500_rank', 'pub_year']
).properties(
    title="Rank vs Publication Year"
).interactive().show()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Interesting, but that doesn't seem to tell us too much! Maybe the author's gender will show some contrast? 

In [5]:
gender_df = combined_novels_nyt_df[combined_novels_nyt_df.author_gender.notna()]
alt.Chart(gender_df).mark_circle(size=60).encode(
    x=alt.X('pub_year', scale=alt.Scale(domain=[1600, 2024])),
    y='top_500_rank',
    color='author_gender',
    tooltip=['title', 'author', 'top_500_rank', 'pub_year', 'author_gender']
).properties(
    title="Rank vs Author Gender"
).interactive().show()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Nothing there either! The last thing I can think is that the genre could play a role in the rankings. 

In [6]:
alt.Chart(combined_novels_nyt_df).mark_boxplot().encode(
    x='genre',
    y='top_500_rank',
    color='genre',
    tooltip=['genre', 'top_500_rank']
).properties(
    title='Rank vs Genre'
).interactive().show()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Now I'm really stumped! Seems as though all kinds of books can have their moment in the spotlight. Which makes me think: is there a genre of book that's getting written more often now than in the past? Maybe one in the past that isn't written anyone?

In [7]:
books_per_genre_per_year = combined_novels_nyt_df.groupby(['pub_year', 'genre']).size().reset_index(name='count')
filtered_books_per_genre_per_year = books_per_genre_per_year[books_per_genre_per_year['genre'] != 'na']
filtered_books_per_genre_per_year = filtered_books_per_genre_per_year.copy()
filtered_books_per_genre_per_year['cumulative_count'] = filtered_books_per_genre_per_year.groupby('genre')['count'].cumsum()
alt.Chart(filtered_books_per_genre_per_year).mark_line().encode(
    x=alt.X('pub_year:O', axis=alt.Axis(labelAngle=45, labelOverlap=True, tickCount=10, title='Publication Year')),
    y='cumulative_count:Q',
    color='genre:N',
    tooltip=['pub_year', 'genre', 'cumulative_count']
).properties(
    width=800,
    height=400,
    title='Cumulative Amount of Books Published in Each Genre Throughout the Years'
).transform_filter(
    alt.datum.pub_year >= 1600
).interactive().show()

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


### Wow! That's really cool! Pretty idiosyncratic that history books had a spike in the '70s and '80s, and sort of plateaued after that. This spike implies that of all the history books that are top ranked now or were considered it upon their release, the history books from the '70s and '80s were most likely to get be perceived as top ranking. I wonder what made that time period for history books so special!

# Welp, that's all for this data analysis! Goodbye, and thanks for reading! Hope you learned something new and fun!