# Louisville Free Public Library

Analysis of Young Adult (YA) genre in the Louisville Free Public Library collection.

## Questions

In this analysis we will look at the following questions:

- How much was spent on the collection for YA? 
- How many books in the collection are YA?
- How does YA spending compare to other collections?
- Did the spending on YA change over time?
- Is YA more or less popular at any of the locations?


In [5]:
import pandas as pd
import numpy as np
from pathlib import Path

# load the clean books data into a dataframe and show the first few rows
data_path = Path('results/books_clean.csv.gz')
books_df = pd.read_csv(data_path)
books_df.head()

Unnamed: 0,BibNum,Title,Author,PublicationYear,ItemType,ItemCollection,ItemLocation,ItemPrice,Genre,Audience
0,707409,"Jeff Immelt and the new GE way : innovation, t...","Magee, David, 1965-",2009,Book,Adult Non-Fiction,Main,25.95,Non-Fiction,Adult
1,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Southwest,19.99,Non-Fiction,Adult
2,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Southwest,19.99,Non-Fiction,Adult
3,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.99,Non-Fiction,Adult
4,707411,Robin rescues dinner : 52 weeks of quick-fix m...,"Miller, Robin, 1964-",2009,Book,Adult Non-Fiction,Remote Shelving - Main,19.99,Non-Fiction,Adult


## How much was spent on the collection for YA?

In [6]:
# First figure out which records in the dataframe are YA using a mask
# YA = Genre: Fiction, Audience = Teen
ya_mask = (books_df['Audience'] == 'Teen') & (books_df['Genre'] == 'Fiction')
ya_mask

0          False
1          False
2          False
3          False
4          False
           ...  
1187198    False
1187199    False
1187200    False
1187201    False
1187202    False
Length: 1187203, dtype: bool

In [7]:
# Now use the mask to slice the dataframe
books_df[ya_mask]

Unnamed: 0,BibNum,Title,Author,PublicationYear,ItemType,ItemCollection,ItemLocation,ItemPrice,Genre,Audience
50,1340382,Pyromantic,"McBride, Lish",2017,Book,Older Teen Fiction,South Central,18.99,Fiction,Teen
51,1340382,Pyromantic,"McBride, Lish",2017,Book,Older Teen Fiction,Remote Shelving - Shawnee,18.99,Fiction,Teen
52,1340385,Shadow run,"Strickland, AdriAnne, 1984-",2017,Book,Younger Teen Fiction,South Central,17.99,Fiction,Teen
53,1340385,Shadow run,"Strickland, AdriAnne, 1984-",2017,Book,Younger Teen Fiction,Northeast,17.99,Fiction,Teen
54,1340385,Shadow run,"Strickland, AdriAnne, 1984-",2017,Book,Younger Teen Fiction,Remote Shelving - Shawnee,17.99,Fiction,Teen
...,...,...,...,...,...,...,...,...,...,...
1186871,2632279,Chain of thorns,"Clare, Cassandra",2023,Book,Older Teen Fiction,Main Teen,17.19,Fiction,Teen
1186872,2632279,Chain of thorns,"Clare, Cassandra",2023,Book,Older Teen Fiction,Northeast,17.19,Fiction,Teen
1186873,2632279,Chain of thorns,"Clare, Cassandra",2023,Book,Older Teen Fiction,South Central,17.19,Fiction,Teen
1186874,2632280,You,"Benoit, Charles.",2012,Book,Older Teen Fiction,St Matthews,11.61,Fiction,Teen


In [8]:
# Finally, select the ItemPrice column and use the sum() function to get the total
"${s:,}".format(s = books_df[ya_mask]['ItemPrice'].sum())


'$555,691.26'

The YA collection cost a total of $555K.

## How many books in the collection are YA?

In [9]:
# Add a new column to the dataframe called "YA_Category"
# If the Audience == "Teen" and Genre == "Fiction" the value should be "YA"
# Otherwise the value should be "Other"
books_df['YA_Category'] = np.where(ya_mask, 'YA', 'Other')

# calculate the counts & percents (and format them appropriately)
ya_counts = books_df['YA_Category']\
    .value_counts()\
    .apply(lambda x: "{:,}".format(x))
ya_percents = books_df['YA_Category']\
    .value_counts(normalize=True)\
    .mul(100)\
    .round(1)\
    .astype(str) + '%'

# # contatenate the counts and percents into a single dataframe
pd.concat([ya_counts, ya_percents], axis=1, keys=['books','percentage'])


Unnamed: 0,books,percentage
Other,1145946,96.5%
YA,41257,3.5%


YA accounted for 3.5% of the total number of books in the collection.

## How does YA spending compare to other collections?

In [10]:
# Group the data by Genre and Audience and use sum() to get the total cost

books_df.groupby(['Genre','Audience'])['ItemPrice'].sum().apply(lambda x: "${:,.2f}".format(x))


Genre        Audience
Fiction      Adult       $3,457,835.27
             Children      $687,553.59
             Teen          $555,691.26
             Unknown     $1,731,767.36
Non-Fiction  Adult       $9,209,529.31
             Children    $1,597,204.37
             Teen          $401,104.39
             Unknown       $875,794.32
Unknown      Adult         $281,617.43
             Children    $2,505,961.73
             Teen              $119.09
             Unknown       $533,619.31
Name: ItemPrice, dtype: object

## Did the spending on YA change over time?

In [11]:
# calculate the counts, total cost, and average cost for all YA books by publication year
ya_years_count = books_df[['PublicationYear', 'ItemPrice']][books_df['YA_Category']=='YA'].groupby('PublicationYear').count()
ya_years_count.columns = ['BookCount']
ya_years_price = books_df[['PublicationYear', 'ItemPrice']][books_df['YA_Category']=='YA'].groupby('PublicationYear').sum()
ya_years_price.columns = ['TotalCost']
ya_years_avg = books_df[['PublicationYear', 'ItemPrice']][books_df['YA_Category']=='YA'].groupby('PublicationYear').mean()
ya_years_avg.columns = ['AverageCost']


# concatenate the counts and costs into a single dataframe
ya_years_summary = pd.concat([ya_years_count, ya_years_price, ya_years_avg], axis=1)

# format the counts and costs
ya_years_summary['BookCount'] = ya_years_summary['BookCount'].apply(lambda x: "{:,}".format(x))
ya_years_summary['TotalCost'] = ya_years_summary['TotalCost'].apply(lambda x: "${:,.2f}".format(x))
ya_years_summary['AverageCost'] = ya_years_summary['AverageCost'].apply(lambda x: "${:,.2f}".format(x))
ya_years_summary

Unnamed: 0_level_0,BookCount,TotalCost,AverageCost
PublicationYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1919,1,$20.00,$20.00
1938,3,$42.53,$14.18
1939,5,$124.95,$24.99
1966,1,$2.99,$2.99
1967,1,$6.99,$6.99
1968,2,$22.94,$11.47
1970,3,$30.89,$10.30
1971,8,$126.90,$15.86
1972,2,$27.95,$13.97
1973,15,$101.13,$6.74


Spending on YA books peaked for books published in 2014 and 2015 with more than $50K spent per year.

Is there a correlation between the PublicationYear and the ItemPrice?

In [12]:
books_df[['ItemPrice','PublicationYear']].corr()

Unnamed: 0,ItemPrice,PublicationYear
ItemPrice,1.0,-0.256709
PublicationYear,-0.256709,1.0


There is a weak negative correlation between ItemPrice and Publication Year.

Meaning: Prices generally decrease over time but the relationship between price and publication year is not consistent.

## Is YA more or less popular at any of the locations?

In [13]:
# Get the number of YA books by location
location_ya = books_df['ItemLocation'][books_df['YA_Category'] == 'YA']\
                    .value_counts()
location_ya.rename("YABookCount", inplace=True)

# Get the total number of books by location
location_all = books_df['ItemLocation'].value_counts()
location_all.rename("TotalBookCount", inplace=True)

location_summary = pd.concat([location_all, location_ya], axis=1)

# calculate the percentage of YA books based on the YA count and the total count
location_summary['PercentYA'] = (location_summary['YABookCount'] / 
                                location_summary['TotalBookCount'])

# format the columns and display the dataframe values
location_summary['TotalBookCount'] = location_summary['TotalBookCount']\
                                    .apply(lambda x: "{:,}".format(x, axis=1))
location_summary['YABookCount'] = location_summary['YABookCount']\
                                    .apply(lambda x: "{:,.0f}".format(x, axis=1))
location_summary['PercentYA'] = location_summary['PercentYA'].mul(100).round(1)
location_summary.sort_values(by=['PercentYA'], ascending=False)

Unnamed: 0,TotalBookCount,YABookCount,PercentYA
Main Teen,6018,3848.0,63.9
Remote Shelving - Shawnee,9060,2988.0,33.0
Content Management,4,1.0,25.0
Shawnee,22861,1830.0,8.0
Shively,23549,1206.0,5.1
Western,21615,1009.0,4.7
South Central,115614,5238.0,4.5
Southwest,121914,5413.0,4.4
Newburg,23536,1011.0,4.3
Fairdale,23025,983.0,4.3


Most of the YA books are at the Northeast, Southeast and South Central locations.


The Main Teen location has the highest percentage of YA books.

The Central and Central Childrens locations have the lowest percentages of YA books.