### Import Data, Data Wrangling

In [1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import collections
from collections import Counter
from scipy import stats
import datetime

In [2]:
df_books = pd.read_excel(r"pen_america_books.xlsx")
df_books.head()

Unnamed: 0,Author,Title,Type of Ban,Secondary Author(s),Illustrator(s),Translator(s),State,District,Date of Challenge/Removal,Origin of Challenge
0,"Àbíké-Íyímídé, Faridah",Ace of Spades,Banned in Libraries and Classrooms,,,,Florida,Indian River County School District,2021-11-01,Administrator
1,"Acevedo, Elizabeth",Clap When You Land,Banned in Classrooms,,,,Pennsylvania,Central York School District,2021-08-31,Administrator
2,"Acevedo, Elizabeth",The Poet X,Banned in Libraries,,,,Florida,Indian River County School District,2021-11-01,Administrator
3,"Acevedo, Elizabeth",The Poet X,Banned in Libraries and Classrooms,,,,New York,Marlboro Central School District,2022-02-01,Administrator
4,"Acevedo, Elizabeth",The Poet X,Banned Pending Investigation,,,,Texas,Fredericksburg Independent School District,2022-03-01,Administrator


In [3]:
df_other = pd.read_csv("books_1.Best_Books_Ever.csv")
df_other.head()

FileNotFoundError: [Errno 2] No such file or directory: 'books_1.Best_Books_Ever.csv'

In [None]:
df_allbooks = pd.merge(df_books, df_other, left_on="Title", right_on="title")
df_allbooks.head()

In [None]:
# number of banned books also in df_other
print(df_allbooks.shape[0])

In [None]:
# list of column names in merged dataframe
for col in df_allbooks.columns:
    print(col)

In [None]:
# drop redundant columns
df_allbooks.drop(columns=['title', 'author'], inplace=True)

## Question 1:  Which books are banned most frequently, and why are they banned?


In [None]:
#PART 1: MERGED DATASET

#How many books in the dataset
print(len(df_allbooks['Title'].unique()))

#Books banned in order of frequency
title_bans = df_allbooks.groupby('Title').size().sort_values(ascending=False)
#10 most frequently banned books that appear in both dataframes
title_bans[:10]

In [None]:
#PART 2: PEN DATASET ONLY
#How many books in the dataset
print(len(df_books['Title'].unique()))

#Books banned in order of frequency
pen_bans = df_books.groupby('Title').size().sort_values(ascending=False)
#10 most frequently banned books that appear in both dataframes
df_pen_bans = (pen_bans[:10]).to_frame("Count").reset_index()
df_pen_bans

In [None]:
#For graphing purposes, because the titles are quite long, replacing them with indexes:
df_graph_bans = df_pen_bans
df_graph_bans["Title"] = df_graph_bans.index

#Graph regarding the top 10 banned
sns.catplot(data=df_pen_bans, x="Title", y="Count", kind="bar")
plt.show()

In [None]:
#Compiling this information with book descriptions and genres
title_df = title_bans.to_frame(name="count")
relevant_info = pd.DataFrame().assign(Title=df_allbooks['Title'], genres=df_allbooks['genres'], description=df_allbooks['description'])
relevant_info.head()

#Dataframe with description info
why_banned_descriptions = pd.merge(relevant_info, title_df, on="Title")

#Dataframe with just genres and unique Title entries
why_banned_genres = why_banned_descriptions.drop_duplicates(subset="Title", keep="last")
why_banned_genres = why_banned_genres.sort_values(by="count", ascending=False)
why_banned_genres.head()

In [None]:
#Top 10 banned books analysis
top_banned = why_banned_genres[:10]
lst = []
for genre in top_banned['genres']:
    lst = genre.split("['")

print(lst)

### Question 1: Manual Research over Book Content

#### All information has been pulled from their Wikipedia pages.
1. Synopses over the top 6 banned books in the PEN Dataset:
  - _Gender Queer: A Memoir_ - A memoir recounting Maia Kobabe, an American cartoonist/author, and er journey with gender identity. Includes themes of gender dysphoria, gender binary, and asexuality. 
  - _All Boys Aren't Blue_ - A semi-autobiographical recount of activist George M. Johnson's expierences growing up queer and black in the United States. Targets themes of sexual abuse, racism, homophobia, and includes themes of consent, agency, and gender identity.
  - _Out_of_Darkness_ - A love story between two teenagers, one Mexican-American and the other African-America, in thr 1930s. Incorporates the historical New London School explosion and targets themes of racism, classism, and historical segregation.
  - _The Bluest Eye_ - A story following an African-American girl growing up after the Great Depression, recounting her struggles of racism. Targets themes of racism, sexism, child abuse, and sexual abuse.
  - _Lawn_Boy_ - a semi-autobiographical recounting the experiences of a Mexican American boy growing up in the United States and the hardships he has faced. Targets themes of racism and discrimination.
  - _The_Hate_U_Give_ - Tells the story of a young black teen in America who witness her childhood friend shot and killed by the police and her journey in attempting to find him justice. Targets themes of police brutality, racism, and discrimination.
  
  
2. An analysis over these themes:
  - Most banned books include stories that target topics such as homophobia, racism, sexism, classism, and other social discriminations. Considering the data accumulated in Question 2, these themes often are presented as inappropriate in more conservative areas for children to read, despite their relevance in education. 

## Question 2: Have trends in book bans changed over time, and if so, how? 

In [None]:
# count number of challenges for each unique date
date_counts = df_books.groupby('Date of Challenge/Removal').count()['Author']
date_counts = date_counts.to_frame()


In [None]:
# dataframe of the dates, for merging
unique_dates = pd.DataFrame(df_books['Date of Challenge/Removal'].unique())
unique_dates = unique_dates.rename(columns = {0 : "Date"})


In [None]:
# create dataframe with dates and counts in order to plot
merge_dates = unique_dates.merge(date_counts, how='left', left_on = 'Date', right_on = 'Date of Challenge/Removal')
merge_dates.fillna(0, inplace = True)

merge_dates = merge_dates.sort_values(by = 'Date')

In [None]:
# plot number of bans for each unique date

sns.catplot(data = merge_dates, kind = "bar", x = "Date", y = "Author", color = "cornflowerblue",  
            aspect = 1.5)

plt.xlabel("Date")
plt.xticks(rotation = 45)

# fix labels on x axis to remove timestamp
labels = [tick.get_text()[:10] for tick in plt.gca().get_xticklabels()]
plt.gca().set_xticklabels(labels)

plt.ylabel("Number of Book Challenges")

plt.title("Number of Book Challenges for Each Date")

plt.show()


In [None]:
group_date_state = df_books.groupby(['Date of Challenge/Removal', 'State']).size().reset_index(name='count')
pivot_group = group_date_state.pivot(index='Date of Challenge/Removal', columns='State', values='count')
pivot_group.plot(kind='bar', stacked=True)

plt.title('Book Ban Counts, by Date and State')

plt.xlabel('Date')
labels = [tick.get_text()[:10] for tick in plt.gca().get_xticklabels()]
plt.gca().set_xticklabels(labels)

plt.ylabel('Number of Bans')
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

## Question 3: How do trends in banned books vary by genre? 

In [None]:
# convert string of genres to list
import ast
def convert_to_list(x):
    return ast.literal_eval(x)
df_allbooks['genres'] = df_allbooks['genres'].apply(convert_to_list)

In [None]:
# count occurrences of each genre
genre_counts = Counter([genre for genres in df_allbooks['genres'] for genre in genres])
df_genre_counts = pd.DataFrame.from_dict(genre_counts, orient='index', columns=['count'])
df_genre_counts = df_genre_counts.sort_values(by='count', ascending=False)
df_genre_counts = df_genre_counts.reset_index().rename(columns={'index': 'genre'})

specific_df_genre_counts = df_genre_counts[(df_genre_counts["genre"] != "Fiction") & (df_genre_counts["genre"] != "Young Adult") & (df_genre_counts["genre"] != "Contemporary") & (df_genre_counts["genre"] != "Realistic Fiction") & (df_genre_counts["genre"] != "Teen") & (df_genre_counts["genre"] != "Audiobook") & (df_genre_counts["genre"] != "Queer") & (df_genre_counts["genre"] != "Young Adult Contemporary") & (df_genre_counts["genre"] != "Adult") & (df_genre_counts["genre"] != "Novels") & (df_genre_counts["genre"] != "Historical Fiction") & (df_genre_counts["genre"] != "Classics") & (df_genre_counts["genre"] != "Literature") & (df_genre_counts["genre"] != "Nonfiction") & (df_genre_counts["genre"] != "Historical") & (df_genre_counts["genre"] != "Adult Fiction")]
specific_df_genre_counts.head(10)

In [None]:
# plot 10 most banned genres
sns.catplot(data = df_genre_counts.head(10), x = 'genre', y = 'count', kind = 'bar', aspect = 2)
plt.xticks(rotation = 45)
plt.xlabel("Genre")
plt.ylabel("Number of Books Banned")
plt.title("Top 10 Most Banned Genres")
plt.show()

## Question 4: How do trends in book banning vary by state? 

For this step of analysis, we will be using the original PEN America (`df_books`) data rather than the merged dataframe with extra information on the books (`df_allbooks`), since we are interested in number of bans per state.

Our definition of regions is based on regions defined by the U.S. Census Bureau. State populations are also based on information from the U.S. Census Bureau. 

In [None]:
# which states had most bans
state_bans = df_books.groupby('State').count().sort_values(by = 'Author', ascending = False)
most_bans = state_bans.head()['Author'].to_frame()
most_bans = most_bans.rename(columns = {'Author': 'Number of Bans'})
most_bans

most_bans = most_bans.sort_values(by='Number of Bans', ascending=True)  # sort values in ascending order
fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(most_bans.index, most_bans['Number of Bans'], color='red')
ax.set_title('Number of Books Banned by State (Top 5)')
ax.set_xlabel('Number of Bans')
ax.set_ylabel('State')
plt.show()

In [None]:
# which states had fewest bans
fewest_bans = state_bans.tail(10)['Author'].to_frame()
fewest_bans = fewest_bans.rename(columns = {'Author' : 'Number of Bans'})
fewest_bans

In [None]:
# create dataframe with just northeast states
northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania']
northeast_df = df_books[df_books['State'].isin(northeast)]

# count number of bans per state
northeast_df_counts = northeast_df.groupby('State').count()['Author']
northeast_counts = northeast_df_counts.to_frame()
northeast_counts = northeast_counts.rename(columns = {'Author' : "Number of Bans"})

# calculate mean number of bans per state
ne_mean = np.mean(northeast_counts['Number of Bans'])

print('northeastern mean:', ne_mean)

In [None]:
# create dataframe with just southern states
south = ['Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas']
south_df = df_books[df_books['State'].isin(south)]

# count number of bans per state
south_df_counts = south_df.groupby('State').count()['Author']
south_counts = south_df_counts.to_frame()
south_counts = south_counts.rename(columns = {'Author' : "Number of Bans"})

# calculate mean number of bans per state
south_mean = np.mean(south_counts['Number of Bans'])

print('southern mean:', south_mean)

In [None]:
region_means = pd.DataFrame({'Region': ['Northeastern States', 'Southern States'],
                             'Mean Number of Bans': [ne_mean, south_mean]})
fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(region_means['Region'], region_means['Mean Number of Bans'], color=['blue', 'red'])
ax.set_title('Mean Number of Books Banned by State Region')
ax.set_xlabel('Region')
ax.set_ylabel('Mean Number of Bans')
plt.show()

We want to compare the mean number of book bans for northeastern states and southern states. We found that while the mean number of book bans for northeastern states is 79.5, the mean number of book bans for southern states is 167.5. We know that the true mean of banned books for southern states and northern states are different (there is no point to do a hypothesis test), and this might give us some insight on how the number of banned books is realted to the political and social climate of states in the northeast (traditionally more liberal) and the political and social climate of states in the south (traditionally more conservative).

## Question 5: Who initiates book challenges and why? 


In this portion, we will group the books by their "origin of challenge" (who first proposed a book be banned; either from school administration, a formal challenge made by a parent or local resident, or other) and assess the demographic information of each books in each category in order to determine whether there is some underlying reason behind why a certain group of people challenges a book. 

In [None]:
df_allbooks = pd.merge(df_books, df_other, left_on="Title", right_on="title")
df_allbooks.head()

In [None]:
#the total number of bans by origin of challenge
df_allbooks.groupby("Origin of Challenge").count()

In [None]:
#information about books grouped by origin of challenge
df_allbooks.groupby("Origin of Challenge").mean()

In [None]:
##plot showing total number of bans by origin of challenge
sns.displot(data=df_allbooks, x="Origin of Challenge")
plt.show()

In [None]:
##plot displaying the distribution of the liked perecentage by origin of 
sns.catplot(data = df_allbooks, x = "Origin of Challenge", y = "likedPercent", kind = "box", color = "gray")
plt.show()

Now that we have established that there may be some difference in the liked percentage between the three orgins of challenge categories, we will conduct a hypothesis test to determine whether the difference in mean like percentange for administrator banned books and formal challenge banned books is statistically significant. 

$H_0: \mu_{admin} = \mu_{formal}$

$H_A:\mu_{admin} \neq \mu_{formal}$


**Null hypothesis:** there is no evidence of a statistically significant difference in mean like percentange for administrator banned books and formal challenge banned books.


**Alternative hypothesis:** there is evidence of a statistically significant difference in mean like percentange for administrator banned books and formal challenge banned books.

In [None]:
mask1 = df_allbooks["Origin of Challenge"] == "Administrator"
mask2 = df_allbooks["Origin of Challenge"] == "Formal Challenge"

admin = df_allbooks[mask1]["likedPercent"].tolist()
formal = df_allbooks[mask2]["likedPercent"].tolist()

stats.ttest_ind(admin, formal, equal_var = False)

Our p-value is 0.269 which is much greater than our predetermined significance level of 0.05. Thus, we fail to reject the null hypothesis there is no evidence of a statistically significant difference in mean like percentange for administrator banned books and formal challenge banned books. We have insufficient evidence to suggest a statistically significant difference in mean like percentange for administrator banned books and formal challenge banned books.