# 01 - Explore anime metadata

This notebook explores the `anime-offline-database.json` file and builds:

- A clean **anime master table** (one row per anime)
- A simple **episode table** (one row per anime-episode)
- Basic summary stats about types, years, and tags


In [3]:
from pathlib import Path
import json
import pandas as pd
import numpy as np


In [4]:
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 140)

In [5]:
NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR.parent
DATA_DIR = PROJECT_ROOT / "data"
EXTERNAL_DIR = DATA_DIR / "external"

In [6]:
FILE_NAME = "anime-offline-database.json"
DATA_PATH = EXTERNAL_DIR / FILE_NAME

In [7]:
DATA_PATH

PosixPath('/Users/sanjaydilip/Desktop/Code/Projects/sim2real user engagement/anime_simulated/data/external/anime-offline-database.json')

## Load the raw JSON

The official schema for `anime-offline-database.json` looks like this at the top level:

- `$schema` – link to the JSON schema
- `license` – license info
- `repository` – GitHub URL
- `scoreRange` – global min and max scores
- `lastUpdate` – date
- `data` – list of anime objects

Each entry in `data` is an anime with fields like:

- `title`, `type`, `episodes`, `status`
- `animeSeason` with `season` and `year`
- `duration` per episode
- `score`
- `tags`, `studios`, `producers`, `synonyms`
- `sources` with links to MAL, AniList, etc.

In [8]:
with open(DATA_PATH, "r", encoding="utf-8") as f:
    root = json.load(f)

In [9]:
list(root.keys())

['$schema', 'license', 'repository', 'scoreRange', 'lastUpdate', 'data']

In [10]:
# Basic high-level info
print("Last update:", root.get("lastUpdate"))
print("Number of anime entries:", len(root.get("data", [])))

Last update: 2025-08-04
Number of anime entries: 39277


In [11]:
# Peek at score range metadata
root.get("scoreRange", {})

{'minInclusive': 1.0, 'maxInclusive': 10.0}

## Inspect a single anime entry

Before building tables, inspect one or two entries in detail to understand the fields.

We will:

- Look at the keys for the first anime
- Print a trimmed version of the entry
- Check how `animeSeason`, `duration`, `score`, and `tags` are stored

In [12]:
anime_list = root["data"]
first = anime_list[0]

In [13]:
print("Keys on a single anime entry:")
print(sorted(first.keys()))

Keys on a single anime entry:
['animeSeason', 'duration', 'episodes', 'picture', 'producers', 'relatedAnime', 'score', 'sources', 'status', 'studios', 'synonyms', 'tags', 'thumbnail', 'title', 'type']


In [14]:
pd.DataFrame([first])

Unnamed: 0,sources,title,type,episodes,status,animeSeason,picture,thumbnail,duration,score,synonyms,studios,producers,relatedAnime,tags
0,"[https://anilist.co/anime/142051, https://anim...",!NVADE SHOW!,SPECIAL,1,FINISHED,"{'season': 'FALL', 'year': 2020}",https://cdn.myanimelist.net/images/anime/1615/...,https://cdn.myanimelist.net/images/anime/1615/...,"{'value': 120, 'unit': 'SECONDS'}","{'arithmeticGeometricMean': 6.258565813170356,...","[!nvade Show!, Invade Show!, RAISE A SUILEN, R...",[sanzigen],[],"[https://anilist.co/anime/101633, https://kits...","[band, full cgi, music, primarily female cast,..."


In [15]:
sample_fields = {
    "title": first.get("title"),
    "type": first.get("type"),
    "episodes": first.get("episodes"),
    "status": first.get("status"),
    "animeSeason": first.get("animeSeason"),
    "duration": first.get("duration"),
    "score": first.get("score"),
    "num_tags": len(first.get("tags", [])),
    "num_synonyms": len(first.get("synonyms", [])),
    "num_sources": len(first.get("sources", [])),
}

In [16]:
sample_fields

{'title': '!NVADE SHOW!',
 'type': 'SPECIAL',
 'episodes': 1,
 'status': 'FINISHED',
 'animeSeason': {'season': 'FALL', 'year': 2020},
 'duration': {'value': 120, 'unit': 'SECONDS'},
 'score': {'arithmeticGeometricMean': 6.258565813170356,
  'arithmeticMean': 6.261151515151515,
  'median': 6.308},
 'num_tags': 5,
 'num_synonyms': 4,
 'num_sources': 4}

## Build an anime master table

Goal: one row per anime with the main fields we care about for engagement modeling.

Columns we will extract:

- `anime_row_id` – simple integer id for now (0, 1, 2, …)
- `title`
- `type` – TV, MOVIE, OVA, ONA, SPECIAL, UNKNOWN
- `episodes` – number of episodes or parts
- `status` – FINISHED, ONGOING, UPCOMING, UNKNOWN
- `season`, `year` – from `animeSeason`
- `duration_sec` – duration per episode in seconds (if present)
- `score_mean`, `score_median` – from the `score` object (if present)
- `num_tags`, `num_synonyms`, `num_sources`

In [17]:
def build_anime_master_table(anime_list: list[dict]) -> pd.DataFrame:
    rows = []
    for idx, anime in enumerate(anime_list):
        anime_season = anime.get("animeSeason") or {}
        duration = anime.get("duration") or {}
        score = anime.get("score") or {}
        rows.append(
            {
                "anime_row_id": idx,
                "title": anime.get("title"),
                "type": anime.get("type"),
                "episodes": anime.get("episodes"),
                "status": anime.get("status"),
                "season": anime_season.get("season"),
                "year": anime_season.get("year"),
                "duration_sec": duration.get("value"),
                "score_mean": score.get("arithmeticMean"),
                "score_median": score.get("median"),
                "num_tags": len(anime.get("tags", [])),
                "num_synonyms": len(anime.get("synonyms", [])),
                "num_sources": len(anime.get("sources", [])),
            }
        )
    df = pd.DataFrame(rows)
    return df

In [18]:
anime_df = build_anime_master_table(anime_list)
anime_df.head()

Unnamed: 0,anime_row_id,title,type,episodes,status,season,year,duration_sec,score_mean,score_median,num_tags,num_synonyms,num_sources
0,0,!NVADE SHOW!,SPECIAL,1,FINISHED,FALL,2020.0,120.0,6.261152,6.308,5,4,4
1,1,"""0""",SPECIAL,1,FINISHED,SUMMER,2013.0,240.0,4.906793,4.919091,7,11,9
2,2,"""1-punkan dake Furete mo Ii yo..."" Share House...",ONA,8,FINISHED,WINTER,2025.0,360.0,5.539082,5.539082,4,9,2
3,3,"""Aesop"" no Ohanashi yori: Ushi to Kaeru, Yokub...",MOVIE,1,FINISHED,WINTER,1970.0,720.0,5.220119,5.0,6,9,9
4,4,"""Ai"" wo Taberu",MOVIE,1,FINISHED,WINTER,2018.0,480.0,,,3,2,3


In [19]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39277 entries, 0 to 39276
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   anime_row_id  39277 non-null  int64  
 1   title         39277 non-null  object 
 2   type          39277 non-null  object 
 3   episodes      39277 non-null  int64  
 4   status        39277 non-null  object 
 5   season        39277 non-null  object 
 6   year          37862 non-null  float64
 7   duration_sec  36351 non-null  float64
 8   score_mean    29139 non-null  float64
 9   score_median  29139 non-null  float64
 10  num_tags      39277 non-null  int64  
 11  num_synonyms  39277 non-null  int64  
 12  num_sources   39277 non-null  int64  
dtypes: float64(4), int64(5), object(4)
memory usage: 3.9+ MB


In [20]:
anime_df[["episodes", "year", "duration_sec", "score_mean", "score_median"]].describe()

Unnamed: 0,episodes,year,duration_sec,score_mean,score_median
count,39277.0,37862.0,36351.0,29139.0,29139.0
mean,12.830562,2010.275501,1190.50315,6.156254,6.178693
std,56.761056,14.572009,1462.646636,1.126396,1.140535
min,0.0,1907.0,1.0,1.0,1.0
25%,1.0,2006.0,180.0,5.439055,5.440204
50%,1.0,2015.0,720.0,6.239182,6.274
75%,12.0,2020.0,1440.0,6.939844,6.982864
max,3937.0,2029.0,12780.0,10.0,10.0


## Simple episode table

Even though the raw data is at series level, we want an episode-level table for later:

- One row per (anime, episode_number)
- This will be the base for synthetic viewing logs

We will:

- Use `anime_row_id` as the foreign key
- Repeat rows from 1 to `episodes` for each anime that has a valid `episodes` count

In [21]:
def build_episode_table(anime_df: pd.DataFrame) -> pd.DataFrame:
    episode_rows = []
    for row in anime_df.itertuples(index=False):
        if pd.isna(row.episodes):
            continue
        try:
            n_eps = int(row.episodes)
        except (TypeError, ValueError):
            continue
        if n_eps <= 0:
            continue
        for ep in range(1, n_eps + 1):
            episode_rows.append(
                {
                    "anime_row_id": row.anime_row_id,
                    "title": row.title,
                    "type": row.type,
                    "year": row.year,
                    "episode_number": ep,
                }
            )
    episodes_df = pd.DataFrame(episode_rows)
    return episodes_df

In [22]:
episodes_df = build_episode_table(anime_df)
episodes_df.head()

Unnamed: 0,anime_row_id,title,type,year,episode_number
0,0,!NVADE SHOW!,SPECIAL,2020.0,1
1,1,"""0""",SPECIAL,2013.0,1
2,2,"""1-punkan dake Furete mo Ii yo..."" Share House...",ONA,2025.0,1
3,2,"""1-punkan dake Furete mo Ii yo..."" Share House...",ONA,2025.0,2
4,2,"""1-punkan dake Furete mo Ii yo..."" Share House...",ONA,2025.0,3


In [23]:
episodes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503946 entries, 0 to 503945
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   anime_row_id    503946 non-null  int64  
 1   title           503946 non-null  object 
 2   type            503946 non-null  object 
 3   year            471971 non-null  float64
 4   episode_number  503946 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 19.2+ MB


## Basic distributions

Before any modeling, it helps to know:

- How many anime of each `type` (TV, MOVIE, OVA, etc)
- How `episodes` are distributed
- How `year` is distributed
- How rich the tag lists are

These quick summaries give you a feel for the dataset and will help to design the simulation later.

In [24]:
# Type distribution
anime_df["type"].value_counts(dropna=False)

type
SPECIAL    10576
TV         10360
MOVIE       6784
ONA         6237
OVA         5194
UNKNOWN      126
Name: count, dtype: int64

In [25]:
# Status distribution (finished vs ongoing, etc)
anime_df["status"].value_counts(dropna=False)

status
FINISHED    37424
UPCOMING     1082
ONGOING       528
UNKNOWN       243
Name: count, dtype: int64

In [26]:
# Episodes distribution - a few quantiles
anime_df["episodes"].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])

count    39277.000000
mean        12.830562
std         56.761056
min          0.000000
25%          1.000000
50%          1.000000
75%         12.000000
90%         26.000000
99%        120.000000
max       3937.000000
Name: episodes, dtype: float64

In [27]:
# Year distribution - rough count by decade
year_counts = anime_df["year"].dropna().astype(int)

In [28]:
summary_by_decade = (
    year_counts
    .apply(lambda y: int(y // 10) * 10)
    .value_counts()
    .sort_index()
)

In [29]:
summary_by_decade

year
1900        1
1910       32
1920       51
1930      132
1940       59
1950      112
1960      464
1970      703
1980     1928
1990     2983
2000     5974
2010    14769
2020    10654
Name: count, dtype: int64

In [30]:
# Tag richness
anime_df["num_tags"].describe()

count    39277.000000
mean        14.692517
std         19.999429
min          0.000000
25%          3.000000
50%          7.000000
75%         17.000000
max        211.000000
Name: num_tags, dtype: float64

## Quick tag exploration

Tags mix genres, themes, and other descriptors like:

- `drama`, `comedy`, `fantasy`
- `crime`, `detective`
- `romance`, `slice of life`
- and many more

In [31]:
from collections import Counter

In [32]:
def get_top_tags(anime_list: list[dict], top_n: int = 30) -> pd.DataFrame:
    counter = Counter()
    for anime in anime_list:
        for tag in anime.get("tags", []):
            counter[tag] += 1
    common = counter.most_common(top_n)
    df = pd.DataFrame(common, columns=["tag", "count"])
    return df

In [33]:
top_tags_df = get_top_tags(anime_list, top_n=30)
top_tags_df

Unnamed: 0,tag,count
0,comedy,13924
1,japanese production,12352
2,fantasy,11050
3,action,10708
4,adventure,8003
5,kids,7956
6,drama,7884
7,present,7756
8,music,6716
9,slice of life,6045


## Save processed tables

We will save:

- `anime_master.parquet` – one row per anime
- `episodes.parquet` – one row per episode

These files will be the starting point for the viewing log simulation step.

In [34]:
PROCESSED_DIR = DATA_DIR / "processed"

In [35]:
anime_master_path = PROCESSED_DIR / "anime_master.parquet"
episodes_path = PROCESSED_DIR / "episodes.parquet"

In [36]:
anime_df.to_parquet(anime_master_path, index=False)
episodes_df.to_parquet(episodes_path, index=False)

In [37]:
print(anime_master_path)
print(episodes_path)

/Users/sanjaydilip/Desktop/Code/Projects/sim2real user engagement/anime_simulated/data/processed/anime_master.parquet
/Users/sanjaydilip/Desktop/Code/Projects/sim2real user engagement/anime_simulated/data/processed/episodes.parquet


## Summary

In this notebook we:

- Loaded `anime-offline-database.json`
- Inspected the schema and a sample anime entry
- Built an **anime master table** with type, episodes, season, year, duration, and score fields
- Built a simple **episode table** that expands each anime into per-episode rows
- Ran basic checks on types, years, episodes, and tag richness
- Saved the results into `data/processed/` for downstream use

Next step: design and implement the viewing log simulation in a new notebook, using `anime_master.parquet` and `episodes.parquet` as the base.