We will start with anime-dataset-2023 and clean it, we will preprocess it later.

In [1]:
import pandas as pd

In [2]:
anime = pd.read_csv("archive/anime-dataset-2023.csv")

In [3]:
anime.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


In [4]:
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popu

In [5]:
anime.describe()

Unnamed: 0,anime_id,Popularity,Favorites,Members
count,24905.0,24905.0,24905.0,24905.0
mean,29776.709014,12265.388356,432.595222,37104.96
std,17976.07629,7187.428393,4353.181647,156825.2
min,1.0,0.0,0.0,0.0
25%,10507.0,6040.0,0.0,209.0
50%,34628.0,12265.0,1.0,1056.0
75%,45240.0,18491.0,18.0,9326.0
max,55735.0,24723.0,217606.0,3744541.0


As we can see for anime metadata, we can use anime-dataset-2023. For user data, we can use user-score-2023. Let's start with anime metadata first. 

Columns to Keep:
- anime_id - Unique identifier for each anime.
- Name - The main title of the anime.
- Score - Average user rating of the anime (important for ranking/recommendation).
- Genres - Genres associated with the anime (used for content-based filtering).
- Synopsis - Brief description (useful for text-based similarity).
- Type - Type of anime (e.g., TV, Movie, OVA).
- Episodes - Number of episodes (useful for filtering preferences).
- Aired - Date range the anime aired (could extract year or seasonal trends).
- Popularity - Popularity score, indicating the general audience interest.
- Members - Total number of users who added this anime to their list (as a measure of popularity).
- Studios - Useful for identifying notable creators or production houses.
- Source - Indicates where the anime is adapted from (e.g., manga, light novel).

Columns which might not be needed:
- Rating - Age ratings might not add much value unless used for specific filtering.
- Rank - Redundant if Score and Popularity are used.
- Favorites - Overlaps with Popularity.
- Scored By - Redundant since we already have the Score.

Columns to Drop:
- English name - Dropped due to too many missing or unknown values.
- Other name - Unlikely to provide additional value for recommendations.
- Premiered - Redundant; the same information can be extracted from Aired.
- Status - Completed status is often assumed or irrelevant to recommendations.
- Producers, Licensors - Less critical unless specifically required.
- Duration - Episode duration is unlikely to influence recommendations.
- Image URL - Not relevant for data-driven recommendations.

In [6]:
# List of columns to keep
columns_to_keep = [
    "anime_id", "Name", "Score", "Rank", "Genres", "Synopsis", 
    "Type", "Episodes", "Aired", "Popularity", "Members", 
    "Studios", "Source", "Favorites", "Scored By", "Rating"
]

# Creating a new df
anime_filtered_df = anime[columns_to_keep]

anime_filtered_df.head()

Unnamed: 0,anime_id,Name,Score,Rank,Genres,Synopsis,Type,Episodes,Aired,Popularity,Members,Studios,Source,Favorites,Scored By,Rating
0,1,Cowboy Bebop,8.75,41.0,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",43,1771505,Sunrise,Original,78525,914193.0,R - 17+ (violence & profanity)
1,5,Cowboy Bebop: Tengoku no Tobira,8.38,189.0,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",602,360978,Bones,Original,1448,206248.0,R - 17+ (violence & profanity)
2,6,Trigun,8.22,328.0,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",246,727252,Madhouse,Manga,15035,356739.0,PG-13 - Teens 13 or older
3,7,Witch Hunter Robin,7.25,2764.0,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",1795,111931,Sunrise,Original,613,42829.0,PG-13 - Teens 13 or older
4,8,Bouken Ou Beet,6.94,4240.0,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",5126,15001,Toei Animation,Manga,14,6413.0,PG - Children


In [7]:
# Check for missing values in the filtered dataframe
anime_filtered_df.isnull().sum()

anime_id      0
Name          0
Score         0
Rank          0
Genres        0
Synopsis      0
Type          0
Episodes      0
Aired         0
Popularity    0
Members       0
Studios       0
Source        0
Favorites     0
Scored By     0
Rating        0
dtype: int64

In [8]:
# Save the file
anime_filtered_df.to_csv("data/anime_filtered_display.csv", index=False)

In [9]:
# Strip any leading/trailing spaces and convert column names to lowercase
anime_filtered_df.columns = [col.strip().lower() for col in anime_filtered_df.columns]

In [10]:
# Lowercase all string values
anime_filtered_df = anime_filtered_df.map(lambda x: x.lower() if isinstance(x, str) else x)

In [11]:
# Check the first few rows of the dataframe to confirm the change
anime_filtered_df

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,aired,popularity,members,studios,source,favorites,scored by,rating
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,"apr 3, 1998 to apr 24, 1999",43,1771505,sunrise,original,78525,914193.0,r - 17+ (violence & profanity)
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,"sep 1, 2001",602,360978,bones,original,1448,206248.0,r - 17+ (violence & profanity)
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,"apr 1, 1998 to sep 30, 1998",246,727252,madhouse,manga,15035,356739.0,pg-13 - teens 13 or older
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,"jul 3, 2002 to dec 25, 2002",1795,111931,sunrise,original,613,42829.0,pg-13 - teens 13 or older
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,"sep 30, 2004 to sep 29, 2005",5126,15001,toei animation,manga,14,6413.0,pg - children
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24900,55731,wu nao monu,unknown,unknown,"comedy, fantasy, slice of life",no description available for this anime.,ona,15.0,"jul 4, 2023 to ?",24723,0,unknown,web manga,0,unknown,pg-13 - teens 13 or older
24901,55732,bu xing si: yuan qi,unknown,0.0,"action, adventure, fantasy",no description available for this anime.,ona,18.0,"jul 27, 2023 to ?",0,0,unknown,web novel,0,unknown,pg-13 - teens 13 or older
24902,55733,di yi xulie,unknown,0.0,"action, adventure, fantasy, sci-fi",no description available for this anime.,ona,16.0,"jul 19, 2023 to ?",0,0,unknown,web novel,0,unknown,pg-13 - teens 13 or older
24903,55734,bokura no saishuu sensou,unknown,0.0,unknown,a music video for the song bokura no saishuu s...,music,1.0,"apr 23, 2022",0,0,unknown,original,0,unknown,pg-13 - teens 13 or older


In [12]:
# Check for "unknown" in all string columns
unknowns = anime_filtered_df.map(lambda x: 'unknown' in str(x) if isinstance(x, str) else False)

In [13]:
# Display rows with "unknown" values
anime_filtered_df[unknowns.any(axis=1)]

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,aired,popularity,members,studios,source,favorites,scored by,rating
11,21,one piece,8.69,55.0,"action, adventure, fantasy","gol d. roger was known as the ""pirate king,"" t...",tv,unknown,"oct 20, 1999 to ?",20,2168904,toei animation,manga,198986,1226493.0,pg-13 - teens 13 or older
37,56,avenger,5.86,9454.0,"adventure, fantasy, sci-fi",mars has been colonized and is a world where c...,tv,13.0,"oct 2, 2003 to dec 25, 2003",4856,17396,bee train,original,20,6788.0,r - 17+ (violence & profanity)
127,149,loveless,6.76,5020.0,"action, boys love, drama, mystery, supernatural","in the world of loveless, each person is born ...",tv,12.0,"apr 7, 2005 to jun 30, 2005",1526,137813,j.c.staff,manga,1308,69267.0,pg-13 - teens 13 or older
143,165,rahxephon,7.39,2108.0,"action, award winning, drama, mystery, romance...",the ordinary life of high school student ayato...,tv,26.0,"jan 21, 2002 to sep 11, 2002",1745,116145,bones,original,947,42467.0,pg-13 - teens 13 or older
152,175,tokyo underground,6.59,5855.0,"action, adventure, romance, sci-fi","under the capital city of tokyo, japan, there ...",tv,26.0,"apr 2, 2002 to sep 24, 2002",3873,28859,pierrot,manga,43,11503.0,pg-13 - teens 13 or older
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24900,55731,wu nao monu,unknown,unknown,"comedy, fantasy, slice of life",no description available for this anime.,ona,15.0,"jul 4, 2023 to ?",24723,0,unknown,web manga,0,unknown,pg-13 - teens 13 or older
24901,55732,bu xing si: yuan qi,unknown,0.0,"action, adventure, fantasy",no description available for this anime.,ona,18.0,"jul 27, 2023 to ?",0,0,unknown,web novel,0,unknown,pg-13 - teens 13 or older
24902,55733,di yi xulie,unknown,0.0,"action, adventure, fantasy, sci-fi",no description available for this anime.,ona,16.0,"jul 19, 2023 to ?",0,0,unknown,web novel,0,unknown,pg-13 - teens 13 or older
24903,55734,bokura no saishuu sensou,unknown,0.0,unknown,a music video for the song bokura no saishuu s...,music,1.0,"apr 23, 2022",0,0,unknown,original,0,unknown,pg-13 - teens 13 or older


In [14]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)

anime_id          0
name              1
score          9213
rank           4612
genres         4929
synopsis        247
type             74
episodes        611
aired             0
popularity        0
members           0
studios       10526
source         3689
favorites         0
scored by      9213
rating          669
dtype: int64


Remove rows where the 'rating' or 'genres' column contains the word 'hentai'

In [15]:
# Remove rows where the 'rating' column contains the word 'hentai'
anime_filtered_df = anime_filtered_df[~anime_filtered_df['rating'].str.contains('hentai', case=False, na=False)]

# Remove rows where the 'genres' column contains the word 'hentai'
anime_filtered_df = anime_filtered_df[~anime_filtered_df['genres'].str.contains('hentai', case=False, na=False)]

In [16]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)
print(anime_filtered_df.shape)

anime_id          0
name              1
score          9192
rank           3129
genres         4929
synopsis        225
type             74
episodes        568
aired             0
popularity        0
members           0
studios       10363
source         3546
favorites         0
scored by      9192
rating          668
dtype: int64
(23417, 16)


In [17]:
# Check for columns with 0 values
zero_columns = (anime_filtered_df == 0).sum()
zero_columns = zero_columns[zero_columns > 0]

# Display columns with 0
print(zero_columns)

popularity      184
members         183
favorites     10718
dtype: int64


Remove rows where popularity or members is 0

In [18]:
# Remove rows where popularity or members is 0
anime_filtered_df = anime_filtered_df[(anime_filtered_df['popularity'] != 0) & (anime_filtered_df['members'] != 0)]

In [19]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)
print(anime_filtered_df.shape)

anime_id          0
name              1
score          9007
rank           3128
genres         4859
synopsis        224
type             56
episodes        525
aired             0
popularity        0
members           0
studios       10182
source         3519
favorites         0
scored by      9007
rating          584
dtype: int64
(23232, 16)


We got too many noises still in the data, we can remove rows with unknown score.

In [20]:
# Remove rows where the 'score' column contains the word 'unknown'
anime_filtered_df = anime_filtered_df[anime_filtered_df['score'].str.lower() != 'unknown']

In [21]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)
print(anime_filtered_df.shape)

anime_id         0
name             0
score            0
rank          1526
genres        1753
synopsis       202
type             1
episodes        52
aired            0
popularity       0
members          0
studios       3374
source        1595
favorites        0
scored by        0
rating          94
dtype: int64
(14225, 16)


In [22]:
# Remove rows where 'score', 'episodes', 'type', or 'genres' contain 'unknown'
anime_filtered_df = anime_filtered_df[
    (anime_filtered_df['episodes'].str.lower() != 'unknown') &
    (anime_filtered_df['type'].str.lower() != 'unknown') &
    (anime_filtered_df['genres'].str.lower() != 'unknown')
]

In [23]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)
print(anime_filtered_df.shape)

anime_id         0
name             0
score            0
rank           443
genres           0
synopsis       199
type             0
episodes         0
aired            0
popularity       0
members          0
studios       2218
source        1398
favorites        0
scored by        0
rating          85
dtype: int64
(12421, 16)


In [24]:
# Remove rows where 'rank' is "unknown"
anime_filtered_df = anime_filtered_df[anime_filtered_df['rank'] != 'unknown']

In [25]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)
print(anime_filtered_df.shape)

anime_id         0
name             0
score            0
rank             0
genres           0
synopsis       198
type             0
episodes         0
aired            0
popularity       0
members          0
studios       1929
source        1366
favorites        0
scored by        0
rating          85
dtype: int64
(11978, 16)


After looking through the data it is certain that in synopsis where it is unknown is actually plot summary with the word unknown in it. As for studios with unknown, most series are not notable enough to be kept in the recommendation list to begin with. 

In [26]:
# Remove rows where 'studios' column contains "unknown"
anime_filtered_df = anime_filtered_df[~anime_filtered_df['studios'].str.contains('unknown', case=False, na=False)]

In [27]:
# Count occurrences of "unknown" in each column
unknown_counts = anime_filtered_df.map(lambda x: 'unknown' in str(x).lower() if isinstance(x, str) else False).sum()

# Display the counts of "unknown" per column
print(unknown_counts)
print(anime_filtered_df.shape)

anime_id        0
name            0
score           0
rank            0
genres          0
synopsis      185
type            0
episodes        0
aired           0
popularity      0
members         0
studios         0
source        865
favorites       0
scored by       0
rating         59
dtype: int64
(10049, 16)


In [28]:
anime_filtered_df.head()

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,aired,popularity,members,studios,source,favorites,scored by,rating
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,"apr 3, 1998 to apr 24, 1999",43,1771505,sunrise,original,78525,914193.0,r - 17+ (violence & profanity)
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,"sep 1, 2001",602,360978,bones,original,1448,206248.0,r - 17+ (violence & profanity)
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,"apr 1, 1998 to sep 30, 1998",246,727252,madhouse,manga,15035,356739.0,pg-13 - teens 13 or older
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,"jul 3, 2002 to dec 25, 2002",1795,111931,sunrise,original,613,42829.0,pg-13 - teens 13 or older
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,"sep 30, 2004 to sep 29, 2005",5126,15001,toei animation,manga,14,6413.0,pg - children


In [29]:
# Save the file
anime_filtered_df.to_csv("data/anime_filtered.csv", index=False)

We will extract year from 'aired' column:

In [30]:
import pandas as pd
import re

def extract_year(aired):
    # Regular expression to capture year (first year or the year in a range)
    match = re.search(r'(\d{4})', aired)
    if match:
        return int(match.group(1))  # Return the first year found
    return None  # If no year is found, return None

# Apply the function to create the 'year' column, using .loc to avoid the warning
anime_filtered_df.loc[:, 'year'] = anime_filtered_df['aired'].apply(extract_year)

In [31]:
# Check the DataFrame
anime_filtered_df[['aired', 'year']].head()

Unnamed: 0,aired,year
0,"apr 3, 1998 to apr 24, 1999",1998.0
1,"sep 1, 2001",2001.0
2,"apr 1, 1998 to sep 30, 1998",1998.0
3,"jul 3, 2002 to dec 25, 2002",2002.0
4,"sep 30, 2004 to sep 29, 2005",2004.0


In [32]:
# Check for any missing or incorrect values in the 'year' column
anime_filtered_df[anime_filtered_df['year'].isna()]

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,aired,popularity,members,studios,source,favorites,scored by,rating,year
13099,35628,honoo no alpenrose: ai no symphony ongaku-hen,5.7,10070.0,"drama, romance",a music-video style recap ova. this release co...,ova,1.0,not available,14846,512,tatsunoko production,manga,0,185.0,g - all ages,


In [33]:
# Remove the row where anime_id is 35628
anime_filtered_df = anime_filtered_df[anime_filtered_df['anime_id'] != 35628]

In [34]:
# Check for any missing or incorrect values in the 'year' column
anime_filtered_df[anime_filtered_df['year'].isna()]

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,aired,popularity,members,studios,source,favorites,scored by,rating,year


In [35]:
# Convert the 'year' column to integer type
anime_filtered_df['year'] = anime_filtered_df['year'].astype(int)

In [36]:
# Check the DataFrame
anime_filtered_df[['aired', 'year']].head()

Unnamed: 0,aired,year
0,"apr 3, 1998 to apr 24, 1999",1998
1,"sep 1, 2001",2001
2,"apr 1, 1998 to sep 30, 1998",1998
3,"jul 3, 2002 to dec 25, 2002",2002
4,"sep 30, 2004 to sep 29, 2005",2004


In [37]:
anime_filtered_df.columns

Index(['anime_id', 'name', 'score', 'rank', 'genres', 'synopsis', 'type',
       'episodes', 'aired', 'popularity', 'members', 'studios', 'source',
       'favorites', 'scored by', 'rating', 'year'],
      dtype='object')

We can drop 'aired' (as we have year) and 'scored by' columns(as we have popularity ranking)

In [38]:
anime_filtered_df = anime_filtered_df.drop(columns=['aired', 'scored by'])

In [39]:
anime_filtered_df.head()

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,popularity,members,studios,source,favorites,rating,year
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,43,1771505,sunrise,original,78525,r - 17+ (violence & profanity),1998
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,602,360978,bones,original,1448,r - 17+ (violence & profanity),2001
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,246,727252,madhouse,manga,15035,pg-13 - teens 13 or older,1998
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,1795,111931,sunrise,original,613,pg-13 - teens 13 or older,2002
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,5126,15001,toei animation,manga,14,pg - children,2004


In [40]:
# Save the file
anime_filtered_df.to_csv("data/anime_filtered.csv", index=False)

In [41]:
anime_filtered_df.columns

Index(['anime_id', 'name', 'score', 'rank', 'genres', 'synopsis', 'type',
       'episodes', 'popularity', 'members', 'studios', 'source', 'favorites',
       'rating', 'year'],
      dtype='object')

In [42]:
# Check all unique entries under the 'rating' column
unique_ratings = anime_filtered_df['rating'].unique()
unique_ratings

array(['r - 17+ (violence & profanity)', 'pg-13 - teens 13 or older',
       'pg - children', 'r+ - mild nudity', 'g - all ages', 'unknown'],
      dtype=object)

In [43]:
# Replace 'unknown' with 'unrated' and standardize other rating values
rating_mapping = {
    'unknown': 'unrated',
    'r - 17+ (violence & profanity)': 'rated 17',
    'pg-13 - teens 13 or older': 'parental guidance 13',
    'pg - children': 'parental guidance',
    'r+ - mild nudity': 'rated plus',
    'g - all ages': 'general'
}

# Apply the mapping
anime_filtered_df['rating'] = anime_filtered_df['rating'].map(rating_mapping).fillna(anime_filtered_df['rating'])

# Check the result
print(anime_filtered_df['rating'].unique())

['rated 17' 'parental guidance 13' 'parental guidance' 'rated plus'
 'general' 'unrated']


In [44]:
anime_filtered_df.head()

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,popularity,members,studios,source,favorites,rating,year
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,43,1771505,sunrise,original,78525,rated 17,1998
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,602,360978,bones,original,1448,rated 17,2001
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,246,727252,madhouse,manga,15035,parental guidance 13,1998
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,1795,111931,sunrise,original,613,parental guidance 13,2002
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,5126,15001,toei animation,manga,14,parental guidance,2004


In [45]:
# Save the file
anime_filtered_df.to_csv("data/anime_filtered.csv", index=False)

We are done cleaning the anime-dataset-2023 dataset and we are left with anime_filtered_df which we can use for content based filtering method.