<a href="https://colab.research.google.com/gist/zakiindra/163c066411b823a83a63e2ab37e58fa4/anime-recommendation-system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Idea:
- Create recommendation using user-user
- Create recommendation using item-item
- Evaluate against anime_recommendation as ground truth
- Compare results of 2 recommendations
- Filter result of recommendation by "Won't Watch", "Watched", "Watching"
- Weights recommendation by "Want to Watch" 1, "Stalled" 0.8, "Dropped" 0.6

In [None]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()

# Data Analysis

In [None]:
# anime_df = pd.read_csv(anime)
animelist_df = pd.read_csv("animelist.csv")
# animelist_df = pd.read_csv("animelist.csv", dtype={"user_id": "Int32", "anime_id": "Int32", "rating": np.float32})
# animelist_df = pd.read_csv("animelist.csv", dtype={"user_id": "Int32", "anime_id": "Int32", "rating": "Float32"})

animelist_df = animelist_df.drop(["watching_status", "watched_episodes"], axis=1)
# anime_recommendations_df = pd.read_csv(anime_recommendations)
# rating_complete_df = pd.read_csv(rating_complete)
# watching_status = pd.read_csv(watching_status)

Ratings from all users

In [None]:
animelist_df.head()

In [None]:
animelist_df.info()

In [None]:
animelist_df.describe()

In [33]:
len(animelist_df['anime_id'].unique())

16745

In [35]:
animelist_df['anime_id'].sort_values().unique()

array([    2,     3,     4, ..., 17356, 17364, 17365])

There is outlier in  rating. The max is 6.0. Check how many outlier

In [None]:
animelist_df.groupby("rating").count()

Number of outlier is so small, we can remote them.

In [None]:
animelist_df = animelist_df[animelist_df['rating'] <= 5]
animelist_df.groupby("rating").count().sort_values('rating', ascending=False)

Check any users who did not give any rating. All ratings given by the user are 0.

In [None]:
user_max_rating = animelist_df[["user_id", "rating"]].groupby("user_id").agg('max')
user_no_rating = user_max_rating[user_max_rating["rating"] == 0]
user_no_rating_count = user_no_rating.count()['rating']

all_users = pd.Series(animelist_df['user_id'].unique(), name="users")
user_no_rating_count / all_users.count() * 100
user_no_rating_count

Verify if users gave no rating really gave no rating

In [None]:
no_rating = animelist_df[animelist_df['user_id'].isin(user_no_rating.index.to_list())]
no_rating['rating'].unique()

Percentage of rating from user with no rating compared to all data

In [None]:
no_rating['rating'].count() / animelist_df['rating'].count() * 100

Remove all users with all ratings are 0, because it's not useful.

In [None]:
users_with_rating_id = list(set(all_users.to_list()) - set(user_no_rating.index.to_list()))
animelist_df = animelist_df[animelist_df['user_id'].isin(users_with_rating_id)]

In [None]:
animelist_df['rating'] = animelist_df['rating'].replace(0.0, np.NaN)

x

In [None]:
animelist_df.head()

In [None]:
rating_matrix = animelist_df.pivot_table(index="user_id", columns="anime_id", values="rating")

In [None]:
rating_matrix = rating_matrix.fillna(np.NaN)

In [None]:
rating_matrix = rating_matrix.apply(lambda x: x - np.nanmean(x), axis=1)

In [None]:
rating_matrix.head()

In [None]:
rating_matrix.info()

In [None]:
acc = 0
rating_matrix.iloc[0].apply(lambda x: np.NaN if pd.isna(x) else x)

# Initial Pipeline

In [1]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()


animelist_df = pd.read_csv("animelist.csv")
animelist_df = animelist_df.drop(["watching_status", "watched_episodes"], axis=1)
animelist_df = animelist_df[animelist_df['rating'] <= 5]

user_max_rating = animelist_df[["user_id", "rating"]].groupby("user_id").agg('max')
user_no_rating = user_max_rating[user_max_rating["rating"] == 0]
all_users = pd.Series(animelist_df['user_id'].unique(), name="users")
users_with_rating_id = list(set(all_users.to_list()) - set(user_no_rating.index.to_list()))

animelist_df = animelist_df[animelist_df['user_id'].isin(users_with_rating_id)]
animelist_df['rating'] = animelist_df['rating'].replace(0.0, np.NaN)
# animelist_df = animelist_df.pivot_table(index="user_id", columns="anime_id", values="rating")
# animelist_df = animelist_df.astype('float32')
# animelist_df = animelist_df.apply(lambda x: x - np.nanmean(x), axis=1)
# animelist_df.fillna(0, inplace=True)
# animelist_df = animelist_df.transpose()

animelist_df['rating'] = animelist_df['rating'].astype('float32')
rating_matrix = animelist_df.pivot_table(index="user_id", columns="anime_id", values="rating")
# rating_matrix = rating_matrix.astype('float32')
rating_matrix_center = rating_matrix.apply(lambda x: x - np.nanmean(x), axis=1)
rating_matrix_center.fillna(0, inplace=True)
# rating_matrix_center = rating_matrix_center.transpose()

user_id,0,1,2,3,4,5,6,7,8,9,...,79291,79292,79293,79294,79295,79296,79297,79298,79299,79300
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,-0.247649,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,-0.247649,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,-3.022277,0.0,0.0,0.0,0.0,0.706349,0.0,0.752351,-2.135714,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17338,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17339,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17341,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17343,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
rating_matrix_center.head()

anime_id,2,3,4,5,6,7,8,10,11,12,...,17326,17329,17331,17333,17335,17338,17339,17341,17343,17364
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,-3.022277,0.0,-0.522277,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
rating_matrix_center = rating_matrix_center.transpose()

# Item - Item Similarity

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
sim_matrix = pd.DataFrame(cosine_similarity(rating_matrix_center, rating_matrix_center),
                          index=rating_matrix_center.index,
                          columns=rating_matrix_center.index)

In [20]:
sim_matrix

anime_id,2,3,4,5,6,7,8,10,11,12,...,17326,17329,17331,17333,17335,17338,17339,17341,17343,17364
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,1.000000,0.028697,-0.006795,0.013215,-0.002808,0.034110,0.031200,-0.003357,0.011834,-0.007549,...,0.000000,0.000000,-0.014770,0.000000,0.018777,0.000000,0.011330,0.000000,0.007417,0.0
3,0.028697,1.000000,0.034022,-0.005610,0.000280,0.015707,0.018723,0.001569,0.018067,0.015143,...,0.000000,0.000000,-0.033104,0.000000,0.000000,0.000000,0.000000,0.000000,0.000963,0.0
4,-0.006795,0.034022,1.000000,0.019477,0.024749,-0.006135,-0.012956,0.018277,0.040482,0.032483,...,0.024051,0.000000,-0.051496,0.000000,0.026145,0.000000,0.025851,0.000000,0.019466,0.0
5,0.013215,-0.005610,0.019477,1.000000,-0.020479,0.019894,0.023303,0.030403,0.022515,-0.005986,...,-0.055229,-0.083563,-0.010334,-0.067448,-0.010292,-0.083563,-0.002558,0.000000,0.001619,0.0
6,-0.002808,0.000280,0.024749,-0.020479,1.000001,0.013823,-0.033634,-0.034134,0.028455,0.055459,...,-0.000192,-0.000291,-0.010559,0.001610,0.007708,-0.000291,0.004656,-0.012724,-0.001929,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17338,0.000000,0.000000,0.000000,-0.083563,-0.000291,-0.030113,0.000000,0.000000,0.000000,0.000000,...,0.660934,1.000000,-0.094657,0.807154,-0.091931,1.000000,-0.069981,0.000000,0.000000,0.0
17339,0.011330,0.000000,0.025851,-0.002558,0.004656,0.005866,-0.003354,0.002478,0.000000,0.038122,...,-0.046253,-0.069981,0.006624,-0.056485,0.761227,-0.069981,1.000000,0.000000,0.000000,0.0
17341,0.000000,0.000000,0.000000,0.000000,-0.012724,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.0
17343,0.007417,0.000963,0.019466,0.001619,-0.001929,0.001762,0.010145,-0.002825,-0.002393,0.002659,...,0.000000,0.000000,-0.033298,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.0


In [124]:
def get_similar_anime(anime_id):
    return sim_matrix[anime_id].sort_values(ascending=False).iloc[1:] # exclude itself

In [125]:
def predict_rating(user_id, anime_id):
    sim_values = get_similar_anime(anime_id)
    similar_anime_ids = sim_values.index
    rated_anime = sim_values[rating_matrix_center[user_id].loc[similar_anime_ids] != 0]
    rating_matrix_center[user_id].loc[similar_anime_ids]
    return np.dot(rated_anime, rating_matrix.transpose()[user_id].loc[rated_anime.index]) / np.sum(rated_anime)
    

In [127]:
sim_values = get_similar_anime(17)
# sim_values
similar_anime_ids = sim_values.index
# similar_anime_ids
rated_anime = sim_values[rating_matrix_center[1].loc[similar_anime_ids] != 0]



# np.dot(rated_anime, rating_matrix.transpose()[1].loc[rated_anime.index]), np.sum(rated_anime), \
np.dot(rated_anime, rating_matrix.transpose()[1].loc[rated_anime.index]) / np.sum(rated_anime)
# rated_anime, rating_matrix.transpose()[1].loc[rated_anime.index], \
# rated_anime * rating_matrix.transpose()[1].loc[rated_anime.index]

6.732284

In [126]:
predict_rating(1, 17)

6.732284

In [37]:
rating_matrix.transpose()

user_id,0,1,2,3,4,5,6,7,8,9,...,79291,79292,79293,79294,79295,79296,79297,79298,79299,79300
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,,,,,,,,,4.0,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,4.0,,...,,,,,,,,,,
6,,0.5,,,,,4.5,,5.0,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17338,,,,,,,,,,,...,,,,,,,,,,
17339,,,,,,,,,,,...,,,,,,,,,,
17341,,,,,,,,,,,...,,,,,,,,,,
17343,,,,,,,,,,,...,,,,,,,,,,


# Get Recommendation

In [144]:
def get_recommendations(user_id):
    ratings = [predict_rating(user_id, anime_id) for anime_id in sim_matrix.index]
    recommendation = pd.DataFrame({"anime_id": sim_matrix.index, "rating": ratings})
    
    return recommendation.sort_values(by="rating", ascending=False)
#     return ratings, recommendation.sort_values(by="rating", ascending=False)
#     for anime_id in sim_matrix.index:
#         print(anime_id, predict_rating(user_id, anime_id))

In [145]:
get_recommendations(1)[:20]

  return np.dot(rated_anime, rating_matrix.transpose()[user_id].loc[rated_anime.index]) / np.sum(rated_anime)


Unnamed: 0,anime_id,rating
6273,6584,1549.206421
12,15,1151.392944
3369,3519,300.622742
2301,2410,197.896973
3242,3383,188.113022
14996,15759,184.816727
9926,10446,184.282166
5378,5657,173.749573
3988,4185,150.057022
564,638,114.760612


# Please ignore this, obsolete

In [None]:
from google.colab import drive
drive.mount('/content/drive')

folder_path = "/content/drive/MyDrive/Colab Notebooks/Data Mining Project"
anime = folder_path + "/anime.csv.zip"
animelist = folder_path + "/animelist.csv.zip"
anime_recommendations = folder_path + "/anime_recommendations.csv.zip"
rating_complete = folder_path + "/rating_complete.csv.zip"
watching_status = folder_path + "/watching_status.csv"

Need to justify whether zero ratings are legitimate zero ratings or empty. We don't want to predict rating of user which legitimately gave 0 rating. For example, rating is 0 but  

In [None]:
zero_rating = animelist_df2[animelist_df2["rating"] == 0]
zero_rating.groupby("watching_status").agg(["count"])

In [None]:
zero_rating = animelist_df2[animelist_df2["rating"] == 0.5]
zero_rating.groupby("watching_status").agg(["count"])

In [None]:
rating_matrix.fillna(0, inplace=True)

In [None]:
watching_status

In [None]:
animelist_df.groupby(['user_id', 'anime_id'])['rating'].max().unstack()