AniList.co offers a free GraphQL API with loads of cool info about anime. Let's see if we can find some cool insights from it!

First let's look at the `media` page which has basic info about virtually every anime.

First let's set up a query for this API to get some basic info on some anime.
With a little time looking at the documentation i was able to filter the request to:

* Anime Only
* Japanese Made Only
* Start Date after Dec 31, 2012
* Aired on TV

In [1]:
import requests
def make_request(page_num,media_type):
    query = '''
    query ( $page: Int, $perPage: Int
  , $startDate_greater: FuzzyDateInt
,$countryOfOrigin:CountryCode
,$type:MediaType) {
        Page (page: $page, perPage: $perPage) {
            pageInfo {
                total
                currentPage
                lastPage
                hasNextPage
                perPage
            }
            media (startDate_greater:$startDate_greater,
            countryOfOrigin:$countryOfOrigin,
            type:$type) {
                id
                title {
                    english
                    native
                }
                season
                seasonYear
                type
                format
                status
                episodes
                duration
                chapters
                volumes
                isAdult
                averageScore
                popularity
                source
                countryOfOrigin
                isLicensed
                genres
                startDate {
                    day
                    month
                    year
                  }
                endDate {
                    day
                    month
                    year
                  }
            }
        }
    }
    '''
    variables = {
        "startDate_greater": 20121231 # Fuzzy Date for Dec 31, 2012
  ,"countryOfOrigin": "JP"
  ,"type": media_type
  ,"page": page_num
  ,"perPage": 500
  ,"format": "TV"
    }
    
    url = 'https://graphql.anilist.co'

    response = requests.post(url, json={'query': query, 'variables': variables})
    # response = requests.post(url, json={'query': query, 'variables': variables},verify=False)
    return response

Testing that function, I was getting some weird responses with more than 500 records per page. So I'll limit `perPage` to 500.

I'll loop through 10 pages, so that'll get up to 5000 records, but there should be much less than that since 2012.

In [2]:
import json
import pandas as pd
from IPython.display import clear_output
df_media=pd.DataFrame()

hasNextPage=True
for i in range(1,9999):
    clear_output()
    print(i)
    response=make_request(i,"ANIME")
    json_data = json.loads(response.text)
    df_new=pd.json_normalize(json_data['data']['Page']['media'], meta  =[['english', 'native']])
    if len(json_data['data']['Page']['media'])>0:
        df_media=pd.concat([df_media,df_new])
    if json_data['data']['Page']['pageInfo']['hasNextPage']==False:
        break
df_media.reset_index(inplace=True,drop=True)
df_media.sort_values(['startDate.year','startDate.month','startDate.day'])
display(df_media)

91


TypeError: 'NoneType' object is not subscriptable

Okay awesome, that looks good.

I want to see if certain genres are more popular in different seasons of the year.
Let's see what those columns look like.

In [None]:
display(df_media[["title.english","title.native",'genres']])

Now some rows in `title.english` are missing, but I don't want to throw those out since Japanese is easy, so I'll leave those in and work with the Japanese name.

We've also got multiple genres for each row in the `genres` column, so let's one hot encode them.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df_hot = df_media.join(pd.DataFrame(mlb.fit_transform(df_media.pop('genres')),
                          columns=mlb.classes_,
                          index=df_media.index))

df_hot['averageScore']=pd.to_numeric(df_hot['averageScore'])
display(df_hot[["title.english","title.native",'Mahou Shoujo','Horror','Mecha']])
print('Original Dataframe has '+str(df_media.shape[1])+' columns')
print('One Hot Encoded Dataframe has '+str(df_hot.shape[1])+' columns')

That added about 20 columns so lets get rid of some where there weren't too many records so our plot isn't too messy

First let's find how many of each genre there are. The One Hot Coded Genre columns start at index=24

In [None]:
genre_list=df_hot.iloc[:, list(range(24, df_hot.shape[1]))].sum().sort_values(0,ascending=False)
display(genre_list)

Wow comedy and Action apply to a lot of titles.

Okay let's drop the genres from our dataframe that have less than 500 titles so we're not looking at outliers and our charts aren't too cluttered.

In [None]:
genre_list=genre_list.where(genre_list<500).dropna()
df_hot.drop(genre_list.keys(),axis=1,inplace=True)
display(df_hot)
print('After dropping niche genres, Dataframe has '+str(df_hot.shape[1])+' columns')

Perfect, we went from 43 to 34 columns, so we should have a more focused list of genres in our dataset.

Now let's look at viewership, but we'll want to divide out the number of titles released from the popularity (which represents total views) to get to an average viewership for each title in a genre.

Let's see what it looks like!

In [None]:
df_season=pd.DataFrame(columns=['index','Season','Viewership'])
for x in range(24,df_hot.shape[1]):
    colname = df_hot.columns[x]

    df_season_genre=df_hot[df_hot.iloc[:, x] == 1]

    series_plot_me_count=df_year_genre.groupby('season')['id'].count()
    series_plotme_sumPopularity=df_year_genre.groupby('season')['popularity'].sum()
    series_plotme_avgRanking=df_year_genre.groupby('season')['averageScore'].mean()
    df_year_genre=series_plotme_sumPopularity/series_plot_me_count

    df_season_genre=pd.DataFrame({ 'Season': df_season_genre.keys(), 'Viewership': df_season_genre.values, 'Ranking': series_plotme_avgRanking.values })
    season_order = ["SPRING", "SUMMER", "FALL", "WINTER"]
    df_season_genre=df_season_genre.reset_index().set_index("Season").loc[season_order]
    df_season_genre['Genre']=colname
    if df_season.empty:
        df_season=df_season_genre.copy(deep=True)
    else:
        df_season=pd.concat([df_season,df_season_genre])

df_season.index.name = 'Season'
df_season.reset_index(inplace=True)
df_season.drop(['index'],axis=1,inplace=True)
display(df_season.head(12))

In [None]:
Okay great, got the dataframe, now to plot...

In [None]:
import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})

lineplot(data=df_season,x="Season",y="Viewership",hue="Genre",style='Genre',size='Genre',sizes=(4,4)).set(title='Anime Viewership By Season And Genre')

Okay, well that's a hot mess. We can see which genres have higher average Viewership, but that's about it with this viz.

`Drama`, and `Supernatural` are the winners in terms of Viewership, and `Hentai` is the biggest loser

However, it's pretty hard to make out any interesting seasonal trends by genre

To see those, let's normalize each Genre by its average Viewership and then represent the season-to-season differenes as percents instead of absolute values. (That way the higher absolute Viewerships don't look skewed just because they have larger seasonal swings on an absolute scale)

In [None]:
viewership_offset_genre=df_season.groupby('Genre')['Viewership'].mean()
ranking_offset_genre=df_season.groupby('Genre')['Ranking'].mean()

df_norm=df_season.merge(viewership_offset_genre,on='Genre')
df_norm=df_norm.merge(ranking_offset_genre,on='Genre')

df_norm['Viewership']=(df_norm['Viewership_x']-df_norm['Viewership_y'])/df_norm['Viewership_y']
df_norm['Ranking']=(df_norm['Ranking_x']-df_norm['Ranking_y'])/df_norm['Ranking_y']

df_norm.drop(['Viewership_x','Viewership_y','Ranking_x','Ranking_y'],axis=1,inplace=True)

In [None]:
lineplot(data=df_norm,x="Season",y="Viewership",hue="Genre",style='Genre',size='Genre',sizes=(4,4)).set(title='Normalized Anime Viewership By Season And Genre')

What's interesting here is in *Summer*, some of the realistic genres like **Romance**, **Slice of Life**, and **Drama** gain more viewership, whereas **Fantasy**, **Adventure**, and **Supernatural** tend to have less relative Viewership. I wonder if that is due to the regular things in life like sunshine, hot dogs, and warm weather making real life seem not so bad. Whereas when it's cold, you'd rather choose something more escapist.

The clearest trend is generally all genres have an increase in Viewership between *Fall* and pick up in *Winter*. That makse sense just with people staying inside more with the cold weather as well as New Years holidays.

The exception to that is **Supernatural** anime that come out in *Winter* have much worse viewership than any other time. 

I wonder if Rankings show the same trend

In [None]:
lineplot(data=df_norm,x="Season",y="Ranking",hue="Genre",style='Genre',size='Genre',sizes=(4,4)).set(title='Normalized Anime Viewership By Season And Genre')

In terms of Ratings, *Adventure* is rated very well in **Winter** relative to **Spring**. Not surprising that people would want to go on an adventure when they're stuck inside all **Winter**

We also see *Supernatural* again doing poorly in Ranking in **Winter** as well as in Viewership. I wonder if people have a hard time believing in gods in **Winter** when everything outside is cold and dead

*Romance* does especially well in the **Spring** in terms of Viewership and Rankings. I guess thats Why they call it the *"Season of Love"*

Let's look at change in popularity or ranking over the years

In [None]:
df_year=pd.DataFrame(columns=['index','Season','Viewership'])
for x in range(24,df_hot.shape[1]):
    colname = df_hot.columns[x]

    df_year_genre=df_hot[df_hot.iloc[:, x] == 1]

    series_plot_me_count=df_year_genre.groupby('seasonYear')['id'].count()
    series_plotme_sumPopularity=df_year_genre.groupby('seasonYear')['popularity'].sum()
    series_plotme_avgRanking=df_year_genre.groupby('seasonYear')['averageScore'].mean()
    df_year_genre=series_plotme_sumPopularity/series_plot_me_count

    df_year_genre=pd.DataFrame({ 'Year': df_year_genre.keys(), 'Viewership': df_year_genre.values, 'Ranking': series_plotme_avgRanking.values })
    # season_order = ["SPRING", "SUMMER", "FALL", "WINTER"]
    # df_year_genre=df_year_genre.reset_index().set_index("Season").loc[season_order]
    df_year_genre['Genre']=colname
    if df_year.empty:
        df_year=df_year_genre.copy(deep=True)
    else:
        df_year=pd.concat([df_year,df_year_genre])

df_year.reset_index(inplace=True)
df_year.drop(['index'],axis=1,inplace=True)
display(df_year.head(12))

In [None]:
lineplot(data=df_year,x="Year",y="Ranking",hue="Genre",style='Genre',size='Genre',sizes=(4,4)).set(title='Anime Viewership By Year And Genre')

Wow! Since 2020 the **Romance**, **Comedy** and **Slice of Life** animes that have come out have garnered more viewership in 2021 and 2022 than they used to. That may be a sad story of people wanting a return to normal life and normal romance in the wake of the pandemic... or to just have a normal laugh.

No increase in the average viewership for a **Hentai** anime as people spent more time at home during the pandemic. Slighlty surprising, but promising.

Next let's take a look see if anime based on manga are more popular than original anime.

First let's see what all types of source material exist in this dataset. The column is called `source`

In [None]:
df_media.groupby('source').size().sort_values(ascending=False)

Interesting, there's a few more options for source material than just manga. 

Let's see the average ranking of each

In [None]:
import matplotlib.pyplot as plt

df_media['averageScore']=pd.to_numeric(df_media['averageScore'])

summarize_popularity=df_media.groupby('source')['averageScore'].mean()

fig = plt.figure(figsize = (10, 5))
plt.bar(summarize_popularity.keys(), summarize_popularity.values, color ='red',
        width = 0.4)
plt.title('Anime Rankings By Source Material Type')

Surprisingly, not a huge difference in ranking between any genre. `LIGHT_NOVEL` and `MANGA` are slightly ahead but probably not significantly.

What about... Viewership?

In [None]:
series_plot_me_count=df_media.groupby('source')['id'].count()
series_plotme_avgPopularity=df_media.groupby('source')['popularity'].sum()

df_plotme=series_plotme_avgPopularity/series_plot_me_count

import matplotlib.pyplot as plt

fig = plt.figure(figsize = (10, 5))
plt.bar(df_plotme.keys(), df_plotme.values, color ='green',
        width = 0.4)
plt.title('Anime Viewership By Source Material Type')

Wow okay, we've got a clear winner here as anime based on `LIGHT_NOVEL` have way more viewership than other source material types. Also we have a clear 2nd place in `MANGA`

I wonder if that's entirely fair to say or if there's more to that story. Let's trend these over the years and see...

In [None]:
series_plot_me_count=df_media.groupby(['source','seasonYear'],as_index=False)['id'].count()
df_plotme=df_media.groupby(['source','seasonYear'],as_index=False)['popularity'].sum()
df_plotme['avgPopularity']=df_plotme['popularity']/series_plot_me_count['id']

pd.DataFrame({'Source':df_plotme['source']})


import seaborn as sns
sns.set_theme(style="whitegrid")

# df_media['avgPopularity']=df_media['avgPopularity']/df_media['avgPopularity']

# Draw a nested barplot by species and sex
g = sns.catplot(
    data=df_plotme, kind="bar",
    x="seasonYear", y="avgPopularity", hue="source",
    ci="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("Year", "avgPopularity")
g.legend.set_title("Anime Populartiy by Genre and Year")

Wow, year over year, Anime based on `LIGHT_NOVEL` and `MANGA` out-perform others consistently, no outliers. That's very interesting.