Now we will clean user-score-2023 dataset, there are a lot of irrelevant data that will have to be cleaned in this step.

In [1]:
import pandas as pd

In [5]:
# We will use this filtered anime data to keep relevant user data
anime_filtered_df = pd.read_csv("data/anime_filtered.csv")

In [3]:
anime_filtered_df.head()

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,popularity,members,studios,source,favorites,rating,year
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,43,1771505,sunrise,original,78525,rated 17,1998
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,602,360978,bones,original,1448,rated 17,2001
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,246,727252,madhouse,manga,15035,parental guidance 13,1998
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,1795,111931,sunrise,original,613,parental guidance 13,2002
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,5126,15001,toei animation,manga,14,parental guidance,2004


In [4]:
# We start with loading user data
user = pd.read_csv("archive/users-score-2023.csv")

In [6]:
user

Unnamed: 0,user_id,Username,anime_id,Anime Title,rating
0,1,Xinil,21,One Piece,9
1,1,Xinil,48,.hack//Sign,7
2,1,Xinil,320,A Kite,5
3,1,Xinil,49,Aa! Megami-sama!,8
4,1,Xinil,304,Aa! Megami-sama! Movie,8
...,...,...,...,...,...
24325186,1291087,Oblongata,10611,R-15,3
24325187,1291087,Oblongata,174,Tenjou Tenge,6
24325188,1291097,JuunanaSai,1535,Death Note,9
24325189,1291097,JuunanaSai,226,Elfen Lied,10


We will only keep the rating for animes which are in our anime_filtered dataset that we have cleaned earlier.

In [7]:
# Get the valid anime IDs from anime_filtered_df
valid_anime_ids = anime_filtered_df['anime_id']

In [8]:
# Filter the user dataframe to keep only rows with anime_id in valid_anime_ids
user_clean = user[user['anime_id'].isin(valid_anime_ids)]

In [9]:
# Reset the index for better readability
user_clean.reset_index(drop=True, inplace=True)

In [10]:
user_clean.head()

Unnamed: 0,user_id,Username,anime_id,Anime Title,rating
0,1,Xinil,48,.hack//Sign,7
1,1,Xinil,49,Aa! Megami-sama!,8
2,1,Xinil,304,Aa! Megami-sama! Movie,8
3,1,Xinil,306,Abenobashi Mahou☆Shoutengai,8
4,1,Xinil,53,Ai Yori Aoshi,7


In [11]:
# Keep 'user_id', 'anime_id', 'rating' columns from the dataframe
user_clean = user_clean[['user_id', 'anime_id', 'rating']]
user_clean.head()

Unnamed: 0,user_id,anime_id,rating
0,1,48,7
1,1,49,8
2,1,304,8
3,1,306,8
4,1,53,7


In [12]:
# Check for missing values in the entire DataFrame
print(user_clean.isnull().sum())

user_id     0
anime_id    0
rating      0
dtype: int64


In [13]:
# Check for duplicates
duplicates = user_clean.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


In [14]:
# Check for duplicates in user_id and anime_id combinations
duplicates = user_clean[user_clean.duplicated(subset=['user_id', 'anime_id'])]
print(f"Number of duplicates: {len(duplicates)}")
print(duplicates.head())

Number of duplicates: 0
Empty DataFrame
Columns: [user_id, anime_id, rating]
Index: []


In [15]:
# Check how many unique users
print(f"Number of unique users before filtering: {user_clean['user_id'].nunique()}")
print(f"Number of anime before filtering: {user_clean['anime_id'].nunique()}")

Number of unique users before filtering: 267605
Number of anime before filtering: 9930


We have a very large user data, we will trim it down.

In [16]:
# Count the number of ratings per user
user_rating_counts = user_clean.groupby('user_id').size()

In [17]:
user_rating_counts

user_id
1          248
4          279
9           65
20         104
23         280
          ... 
1291057    232
1291079     93
1291085     59
1291087    209
1291097      3
Length: 267605, dtype: int64

We will do a series of sparsity check for our data, our ideal sweetspot for sparsity is around 70% to 80%.

From our sparsity check, the sweet spot for users are with users who have rated minimum 300 animes

In [18]:
# Filter users who have rated at least 300 anime
user_clean = user_clean[user_clean['user_id'].isin(user_rating_counts[user_rating_counts >= 300].index)]
print(f"Number of unique users: {user_clean['user_id'].nunique()}")

Number of unique users: 8624


From our sparsity check, the sweeet spot for anime is animes with over 800 ratings

In [19]:
# Filter out anime with fewer than 800 ratings
anime_counts = user_clean['anime_id'].value_counts()
filtered_anime = anime_counts[anime_counts >= 800].index
user_clean = user_clean[user_clean['anime_id'].isin(filtered_anime)]
print(f"Number of anime after filtering: {user_clean['anime_id'].nunique()}")

Number of anime after filtering: 1920


In [20]:
print(f"Number of unique users: {user_clean['user_id'].nunique()}")
print(f"Number of unique anime: {user_clean['anime_id'].nunique()}")

Number of unique users: 8624
Number of unique anime: 1920


We used the following sparsity checking method and reached our optimal user + anime combination.

In [21]:
# Sparsity check: We have run this a few times to get our best combo for user + anime

# Number of unique users and anime
num_users = user_clean['user_id'].nunique()
num_anime = user_clean['anime_id'].nunique()

# Total possible interactions (users x anime)
total_possible_interactions = num_users * num_anime

# Actual number of ratings
num_ratings = len(user_clean)

# Calculate sparsity
sparsity = 1 - (num_ratings / total_possible_interactions)

print(f"Number of Users: {num_users}")
print(f"Number of Anime: {num_anime}")
print(f"Number of Ratings: {num_ratings}")
print(f"Sparsity: {sparsity:.2%}")

Number of Users: 8624
Number of Anime: 1920
Number of Ratings: 3738836
Sparsity: 77.42%


Sparsity with 400 and 1000:
- Number of Users: 5770
- Number of Anime: 1287
- Number of Ratings: 2409398
- Sparsity: 67.55%

Sparsity with 200 and 500:
- Number of Users: 40516
- Number of Anime: 3978
- Number of Ratings: 12586531
- Sparsity: 92.19%

Sparsity with 250 and 700:
- Number of Users: 28341
- Number of Anime: 3161
- Number of Ratings: 9507865
- Sparsity: 89.39%

Sparsity with 300 and 800:
- Number of Users: 8624
- Number of Anime: 1920
- Number of Ratings: 3738836
- Sparsity: 77.42%

Now we can save this user dataset for later usage.

In [22]:
user_clean.to_csv("data/user_clean.csv", index=False)

Our user_clean dataset with Number of Users: 8624, Number of Anime: 1920, Number of Ratings: 3738836 and Sparsity: 77.42% is a good combination to build our collaborative model and also easier to handle with limited resources. 