<a href="https://colab.research.google.com/gist/zakiindra/163c066411b823a83a63e2ab37e58fa4/anime-recommendation-system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Idea:
- Create recommendation using user-user
- Create recommendation using item-item
- Evaluate against `anime_recommendation` as ground truth. The file contains 
- Compare results of 2 recommendations.
- Filter result of recommendation by "Won't Watch", "Watched", "Watching", because we don't want to recommend anything which user's don't want to watch, already watched, or currently watching.
- Weights recommendation by "Want to Watch" 1, "Stalled" 0.8, "Dropped" 0.6. This is because we want to prioritize anime which users want to watch or neutral, and deprioritize anime which users had stalled or dropped.

In [1]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()

# Data Analysis

Read rating data, uses `animelist.csv`, a file containing all ratings.

In [2]:
# anime_df = pd.read_csv(anime)
animelist_df = pd.read_csv("animelist.csv")
# animelist_df = pd.read_csv("animelist.csv", dtype={"user_id": "Int32", "anime_id": "Int32", "rating": np.float32})
# animelist_df = pd.read_csv("animelist.csv", dtype={"user_id": "Int32", "anime_id": "Int32", "rating": "Float32"})

animelist_df = animelist_df.drop(["watching_status", "watched_episodes"], axis=1)
# anime_recommendations_df = pd.read_csv(anime_recommendations)
# rating_complete_df = pd.read_csv(rating_complete)
# watching_status = pd.read_csv(watching_status)

Check basic metadata

In [3]:
animelist_df.head()

Unnamed: 0,user_id,anime_id,rating
0,0,7173,0.0
1,0,5323,0.0
2,0,5028,0.0
3,0,1048,0.0
4,0,12221,0.0


In [4]:
animelist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20842201 entries, 0 to 20842200
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   user_id   int64  
 1   anime_id  int64  
 2   rating    float64
dtypes: float64(1), int64(2)
memory usage: 477.0 MB


In [5]:
animelist_df.describe()

Unnamed: 0,user_id,anime_id,rating
count,20842200.0,20842200.0,20842200.0
mean,39842.37,4826.288,1.747458
std,22853.48,3823.595,2.005187
min,0.0,2.0,0.0
25%,19833.0,1549.0,0.0
50%,40329.0,4207.0,0.0
75%,59367.0,7123.0,4.0
max,79300.0,17365.0,6.0


Check length of unique anime ID and user ID

In [6]:
len(animelist_df['anime_id'].unique())

16745

In [7]:
len(animelist_df['user_id'].unique())

74129

In [8]:
animelist_df['anime_id'].sort_values().unique()

array([    2,     3,     4, ..., 17356, 17364, 17365])

There is outlier in  rating. The max is 6.0. Check how many outlier

In [9]:
animelist_df.groupby("rating").count()

Unnamed: 0_level_0,user_id,anime_id
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,11149626,11149626
0.5,131262,131262
1.0,154680,154680
1.5,178176,178176
2.0,408463,408463
2.5,597495,597495
3.0,1264631,1264631
3.5,1503363,1503363
4.0,2062762,2062762
4.5,1285605,1285605


Number of outlier is so small, only 11, so we can remote them.

In [10]:
animelist_df = animelist_df[animelist_df['rating'] <= 5]
animelist_df.groupby("rating").count().sort_values('rating', ascending=False)

Unnamed: 0_level_0,user_id,anime_id
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
5.0,2106127,2106127
4.5,1285605,1285605
4.0,2062762,2062762
3.5,1503363,1503363
3.0,1264631,1264631
2.5,597495,597495
2.0,408463,408463
1.5,178176,178176
1.0,154680,154680
0.5,131262,131262


Check any users who did not give any rating. All ratings given by the user are 0.

In [11]:
user_max_rating = animelist_df[["user_id", "rating"]].groupby("user_id").agg('max')
user_no_rating = user_max_rating[user_max_rating["rating"] == 0]
user_no_rating_count = user_no_rating.count()['rating']

all_users = pd.Series(animelist_df['user_id'].unique(), name="users")
user_no_rating_count / all_users.count() * 100
user_no_rating_count

4849

Verify if users gave no rating really gave no rating. Their unique rating should be only 0.

In [12]:
no_rating = animelist_df[animelist_df['user_id'].isin(user_no_rating.index.to_list())]
no_rating['rating'].unique()

array([0.])

Percentage of rating from user with no rating compared to all data

In [13]:
no_rating['rating'].count() / animelist_df['rating'].count() * 100

2.2453830427608614

Remove all users with all ratings are 0, because it's not useful.

In [14]:
users_with_rating_id = list(set(all_users.to_list()) - set(user_no_rating.index.to_list()))
animelist_df = animelist_df[animelist_df['user_id'].isin(users_with_rating_id)]

Replace all 0 with NaN, so we can use np.nanmean(x) to calculate the mean.

In [15]:
animelist_df['rating'] = animelist_df['rating'].replace(0.0, np.NaN)

Convert rating column to float32 from float64 to save memory. Without this, execution always crash because ran out of memory

In [16]:
animelist_df['rating'] = animelist_df['rating'].astype('float32')

Create matrix of user and anime, to calculate similarity

In [17]:
rating_matrix = animelist_df.pivot_table(index="user_id", columns="anime_id", values="rating")
rating_matrix

anime_id,2,3,4,5,6,7,8,10,11,12,...,17326,17329,17331,17333,17335,17338,17339,17341,17343,17364
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,0.5,,3.0,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79296,,,,,,,,,,,...,,,,,,,,,,
79297,,,,,,,,,,,...,,,,,,,,,,
79298,,,,,,,,,,,...,,,,,,,,,,
79299,,,,,,,,,,,...,,,,,,,,,,


Centerize the matrix by subtracting rating from mean of each anime.

In [19]:
rating_matrix_center = rating_matrix.apply(lambda x: x - np.nanmean(x), axis=0)

In [21]:
rating_matrix_center

anime_id,2,3,4,5,6,7,8,10,11,12,...,17326,17329,17331,17333,17335,17338,17339,17341,17343,17364
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,-3.286687,,-0.414724,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79296,,,,,,,,,,,...,,,,,,,,,,
79297,,,,,,,,,,,...,,,,,,,,,,
79298,,,,,,,,,,,...,,,,,,,,,,
79299,,,,,,,,,,,...,,,,,,,,,,


In [22]:
rating_matrix_center.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69280 entries, 0 to 79300
Columns: 15951 entries, 2 to 17364
dtypes: float32(15951)
memory usage: 4.1 GB


Replace NaN with 0 and transpose so it becomes anime x user

In [23]:
rating_matrix_center.fillna(0, inplace=True)
rating_matrix_center = rating_matrix_center.transpose()
rating_matrix_center

user_id,0,1,2,3,4,5,6,7,8,9,...,79291,79292,79293,79294,79295,79296,79297,79298,79299,79300
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.400033,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.310716,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,-3.286687,0.0,0.0,0.0,0.0,0.713313,0.0,1.213313,-2.786687,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17338,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17339,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17341,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17343,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Create similarity matrix

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

cossim = cosine_similarity(rating_matrix_center, rating_matrix_center)

In [30]:
sim_matrix = pd.DataFrame(cossim,
                          index=rating_matrix_center.index,
                          columns=rating_matrix_center.index)
sim_matrix

anime_id,2,3,4,5,6,7,8,10,11,12,...,17326,17329,17331,17333,17335,17338,17339,17341,17343,17364
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.999999,0.074766,0.036544,0.078020,0.058192,0.082408,0.115555,0.051990,0.043302,0.042876,...,0.000000,0.0,-0.001336,0.000000,-0.001336,0.0,0.003536,0.0,0.008607,0.0
3,0.074766,1.000001,0.074501,0.036417,0.035036,0.070169,0.061773,0.045757,0.047951,0.055902,...,0.000000,0.0,0.006361,0.000000,0.000000,0.0,0.000000,0.0,-0.002672,0.0
4,0.036544,0.074501,1.000000,0.061880,0.049421,0.028704,0.026566,0.046181,0.075259,0.044764,...,-0.003309,0.0,0.016008,0.000000,-0.003309,0.0,-0.003752,0.0,-0.005873,0.0
5,0.078020,0.036417,0.061880,0.999999,0.034435,0.056169,0.083549,0.072266,0.052710,0.037232,...,-0.005184,0.0,-0.008343,-0.005184,-0.033371,0.0,-0.018919,0.0,0.004105,0.0
6,0.058192,0.035036,0.049421,0.034435,0.999999,0.055915,0.027636,0.017520,0.035906,0.067930,...,0.016856,0.0,0.018146,0.015122,0.015122,0.0,0.019433,0.0,0.005991,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17338,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0
17339,0.003536,0.000000,-0.003752,-0.018919,0.019433,0.036278,-0.006093,-0.002368,0.000000,0.011833,...,0.566947,0.0,0.566947,0.566947,0.755929,0.0,1.000000,0.0,0.000000,0.0
17341,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0
17343,0.008607,-0.002672,-0.005873,0.004105,0.005991,0.004372,0.002468,-0.003335,0.006906,0.000000,...,0.000000,0.0,-0.140028,0.000000,0.000000,0.0,0.000000,0.0,1.000000,0.0


Predict rating of an anime and user id

In [31]:
anime_id = 2
user_id = 1

sim_values = sim_matrix[anime_id].sort_values(ascending=False).iloc[1:] # exclude itself
# sim_values
similar_anime_ids = sim_values.index
# similar_anime_ids
rated_anime = sim_values[rating_matrix_center[user_id].loc[similar_anime_ids] != 0]

np.dot(rated_anime, rating_matrix.transpose()[user_id].loc[rated_anime.index]), np.sum(rated_anime), \
np.dot(rated_anime, rating_matrix.transpose()[user_id].loc[rated_anime.index]) / np.sum(rated_anime)
# rated_anime, rating_matrix.transpose()[1].loc[rated_anime.index], \
# rated_anime * rating_matrix.transpose()[1].loc[rated_anime.index]

(47.673424, 13.438063, 3.547641)

# Initial Data Pipeline

This is basically a clean process derived from analysis, so we can simply run them all.

In [None]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()


animelist_df = pd.read_csv("animelist.csv")
animelist_df = animelist_df.drop(["watching_status", "watched_episodes"], axis=1)
animelist_df = animelist_df[animelist_df['rating'] <= 5]

user_max_rating = animelist_df[["user_id", "rating"]].groupby("user_id").agg('max')
user_no_rating = user_max_rating[user_max_rating["rating"] == 0]
all_users = pd.Series(animelist_df['user_id'].unique(), name="users")
users_with_rating_id = list(set(all_users.to_list()) - set(user_no_rating.index.to_list()))

animelist_df = animelist_df[animelist_df['user_id'].isin(users_with_rating_id)]
animelist_df['rating'] = animelist_df['rating'].replace(0.0, np.NaN)

animelist_df['rating'] = animelist_df['rating'].astype('float32')
rating_matrix = animelist_df.pivot_table(index="user_id", columns="anime_id", values="rating")
rating_matrix_center = rating_matrix.apply(lambda x: x - np.nanmean(x), axis=0)
rating_matrix_center.fillna(0, inplace=True)
rating_matrix_center = rating_matrix_center.transpose()

In [None]:
rating_matrix_center.head()

In [None]:
rating_matrix_center = rating_matrix_center.transpose()

# Item - Item Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
sim_matrix = pd.DataFrame(cosine_similarity(rating_matrix_center, rating_matrix_center),
                          index=rating_matrix_center.index,
                          columns=rating_matrix_center.index)

In [None]:
sim_matrix

In [None]:
def get_similar_anime(anime_id):
    return sim_matrix[anime_id].sort_values(ascending=False).iloc[1:] # exclude itself

In [None]:
def predict_rating(user_id, anime_id):
    sim_values = get_similar_anime(anime_id)
    similar_anime_ids = sim_values.index
    rated_anime = sim_values[rating_matrix_center[user_id].loc[similar_anime_ids] != 0]
    rating_matrix_center[user_id].loc[similar_anime_ids]
    return np.dot(rated_anime, rating_matrix.transpose()[user_id].loc[rated_anime.index]) / np.sum(rated_anime)
    

In [None]:
predict_rating(1, 17)

# Get Recommendation

In [None]:
def get_recommendations(user_id):
    ratings = [predict_rating(user_id, anime_id) for anime_id in sim_matrix.index]
    recommendation = pd.DataFrame({"anime_id": sim_matrix.index, "rating": ratings})
    
    return recommendation.sort_values(by="rating", ascending=False)

In [None]:
get_recommendations(1)[:20]

# Please ignore this, obsolete and irrelevant

In [None]:
from google.colab import drive
drive.mount('/content/drive')

folder_path = "/content/drive/MyDrive/Colab Notebooks/Data Mining Project"
anime = folder_path + "/anime.csv.zip"
animelist = folder_path + "/animelist.csv.zip"
anime_recommendations = folder_path + "/anime_recommendations.csv.zip"
rating_complete = folder_path + "/rating_complete.csv.zip"
watching_status = folder_path + "/watching_status.csv"

Need to justify whether zero ratings are legitimate zero ratings or empty. We don't want to predict rating of user which legitimately gave 0 rating. For example, rating is 0 but  

In [None]:
zero_rating = animelist_df2[animelist_df2["rating"] == 0]
zero_rating.groupby("watching_status").agg(["count"])

In [None]:
zero_rating = animelist_df2[animelist_df2["rating"] == 0.5]
zero_rating.groupby("watching_status").agg(["count"])

In [None]:
rating_matrix.fillna(0, inplace=True)

In [None]:
watching_status

In [None]:
animelist_df.groupby(['user_id', 'anime_id'])['rating'].max().unstack()

In [None]:
# animelist_df = animelist_df.pivot_table(index="user_id", columns="anime_id", values="rating")
# animelist_df = animelist_df.astype('float32')
# animelist_df = animelist_df.apply(lambda x: x - np.nanmean(x), axis=1)
# animelist_df.fillna(0, inplace=True)
# animelist_df = animelist_df.transpose()