This cell imports the necessary libraries: pandas and pyarrow. It then loads two datasets, 'anime.csv' and 'rating.csv', into pandas DataFrames. These dataframes are saved as Parquet files to ensure faster loading in future runs.

In [30]:
import pandas as pd
import pyarrow

anime = pd.read_csv("anime.csv")
rating = pd.read_csv("rating.csv")

anime.to_parquet('anime.parquet')
rating.to_parquet('rating.parquet')



This cell checks the shape (number of rows and columns) of the 'rating' DataFrame.

In [31]:
rating.shape

(7813737, 3)

This cell checks for null values in the 'anime' DataFrame. It calculates the number of null values in each column.

In [32]:
anime.isnull().sum()

Unnamed: 0      0
anime_id        0
name            0
genre          62
type           25
episodes        0
rating        230
members         0
dtype: int64

This cell checks for null values in the 'rating' DataFrame. It calculates the number of null values in each column.

In [33]:
rating.isnull().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

This cell checks for duplicate rows in the 'anime' DataFrame. It prints the total number of duplicate rows found.

In [34]:
anime.duplicated().sum()

0

This cell checks for duplicate rows in the 'rating' DataFrame. It prints the total number of duplicate rows found.

In [35]:
rating.duplicated().sum()

1

This cell merges the 'anime' and 'rating' DataFrames on the 'anime_id' column. This creates a unified DataFrame where each row corresponds to a user's rating for a particular anime.

In [36]:
merged = rating.merge(anime,on='anime_id')

This cell prints the merged DataFrame to allow for a visual inspection of its structure and contents.

In [37]:
merged

Unnamed: 0.1,user_id,anime_id,rating_x,Unnamed: 0,name,genre,type,episodes,rating_y,members
0,1,20,-1,841,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
1,3,20,8,841,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
2,5,20,6,841,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
3,6,20,-1,841,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
4,10,20,-1,841,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
...,...,...,...,...,...,...,...,...,...,...
7813722,65682,30450,8,8493,Dr. Slump: Hoyoyo! Arale no Himitsu Dai Koukai...,"Comedy, Sci-Fi, Shounen",Special,1,6.17,248
7813723,69497,33484,10,10256,Shiroi Zou,"Action, Historical, Kids",Movie,1,4.71,45
7813724,70463,29481,-1,9097,Kakinoki Mokkii,"Fantasy, Kids",Special,1,4.33,61
7813725,72404,34412,-1,8777,Hashiri Hajimeta bakari no Kimi ni,Music,Music,1,6.76,239


This cell counts the number of ratings for each anime and stores this in a new DataFrame 'numr'. The column 'rating_x' is renamed to 'num_ratings'. This operation helps to understand the distribution of ratings among different animes.

In [38]:
numr = merged.groupby('anime_id').count()['rating_x'].reset_index()
numr.rename(columns={'rating_x':'num_ratings'},inplace=True)
numr

Unnamed: 0,anime_id,num_ratings
0,1,15509
1,5,6927
2,6,11077
3,7,2629
4,8,413
...,...,...
11192,34367,5
11193,34412,1
11194,34475,4
11195,34476,1


This cell displays the first 5000 rows of the 'numr' DataFrame. This allows for a visual inspection of its structure and contents.

In [39]:
numr.head(5000)

Unnamed: 0,anime_id,num_ratings
0,1,15509
1,5,6927
2,6,11077
3,7,2629
4,8,413
...,...,...
4995,7430,73
4996,7435,13
4997,7436,14
4998,7445,9


This cell creates a boolean Series 'x' which is True for users who have rated more than 150 animes and False for others. This operation is done to ensure that we only consider users who have provided a significant amount of ratings.

In [40]:
x = merged.groupby('user_id').count()['rating_x'] > 150

This cell creates a new DataFrame 'lfinal'. 'lfinal' is a pivot table of the 'final' DataFrame, with 'anime_id' as the index, 'user_id' as the columns, and 'rating_x' as the values. It also fills any NaN values with 0. This operation transforms the DataFrame into a user-item matrix, which is necessary for the recommendation system.

In [41]:
cultured = x[x].index

filtered = merged[merged['user_id'].isin(cultured)]

y = filtered.groupby('anime_id').count()['rating_x']>=150
eanime = y[y].index

final = filtered[filtered['anime_id'].isin(eanime)]

lfinal = final.pivot_table(index='anime_id',columns='user_id',values='rating_x')

lfinal.fillna(0,inplace=True)

This cell saves the 'lfinal' DataFrame as a Parquet file for future use. Parquet is a columnar storage file format that is optimized for speed and efficiency.

In [42]:
lfinal.to_parquet('lfinal.parquet')



This cell calculates the cosine similarity between all pairs of animes based on their ratings. The cosine similarity is a measure of similarity between two non-zero vectors, and it is calculated as the cosine of the angle between them. This operation results in a similarity matrix where each cell represents the similarity between two animes.

In [43]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(lfinal)

similarity_scores.shape

(3691, 3691)

This cell converts the similarity matrix into a DataFrame and saves it as a Parquet file. It then converts the DataFrame back into a numpy array for further processing. The numpy array format is more convenient for mathematical operations.

In [44]:
import numpy as np
similarity_scores
similarity_scores = pd.DataFrame(similarity_scores)
similarity_scores.to_parquet('user_similarity_scores.parquet')
similarity_scores = similarity_scores.values


This cell prints the index of the 'lfinal' DataFrame, which contains the 'anime_id' values. This operation is done to inspect the unique identifiers of the animes in the user-item matrix.

In [45]:

lfinal.index

Index([    1,     5,     6,     7,     8,    15,    16,    17,    18,    19,
       ...
       33201, 33222, 33338, 33421, 33524, 33558, 33569, 33964, 34103, 34240],
      dtype='int64', name='anime_id', length=3691)

This cell defines a function 'get_recommendation' that takes an 'anime_id' as input and returns a list of recommended animes. The recommendations are based on the similarity matrix. Specifically, for a given 'anime_id', the function finds the most similar animes in the similarity matrix and returns them as recommendations.

In [46]:
def get_recommendation(anime_id):
    index = np.where(lfinal.index==anime_id)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:50]
    data = []
    for i in similar_items:
        temp_df = anime[anime['anime_id'] == lfinal.index[i[0]]]
        anime_id = temp_df.drop_duplicates('anime_id')['anime_id'].values[0]
        data.append(anime_id)
    return data[:27]

This cell tests the 'get_recommendation' function with 'anime_id' 20. This operation is done to verify that the function works correctly and provides sensible recommendations.

In [47]:
get_recommendation(20)

[1535,
 1575,
 11757,
 269,
 121,
 5114,
 16498,
 9919,
 2904,
 8074,
 226,
 3588,
 6547,
 442,
 813,
 4224,
 2472,
 6702,
 10620,
 936,
 6880,
 4437,
 223,
 2167,
 356,
 6746,
 2144]

This cell again tests the 'get_recommendation' function with 'anime_id' 1735. This operation is failed since the anime does not meet the above criterias and is filtered, Hence, for more anime like this we need an alternate recommendation system  

In [48]:
get_recommendation(1735)

IndexError: index 0 is out of bounds for axis 0 with size 0

This cell encodes the 'genre' column of the 'anime' DataFrame into a one-hot encoded DataFrame. One-hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. The one-hot encoded DataFrame is then concatenated with the 'anime' DataFrame. This operation is done to prepare the data for genre-based recommendations.

In [None]:

anime['genre'] = anime['genre'].apply(lambda x: str(x).split(", "))
genre_encoded_df = anime['genre'].str.join('|').str.get_dummies()

anime_encoded = pd.concat([anime, genre_encoded_df], axis=1)
anime_encoded


Unnamed: 0.1,Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,Action,Adventure,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri,nan
0,0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",TV,64,9.26,793665,1,1,...,0,0,0,0,0,0,0,0,0,0
2,2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.25,114262,1,0,...,0,0,0,0,0,0,0,0,0,0
3,3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,24,9.17,673572,0,0,...,0,0,0,0,0,1,0,0,0,0
4,4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.16,151266,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,[Hentai],OVA,1,4.15,211,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,12290,5543,Under World,[Hentai],OVA,1,4.28,183,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,12291,5621,Violence Gekiga David no Hoshi,[Hentai],OVA,4,4.88,219,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,[Hentai],OVA,1,4.98,175,0,0,...,0,0,0,0,0,0,0,0,0,0


This cell calculates the cosine similarity using the one-hot encoded genre data and saves it as a Parquet file. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. In this case, it measures the similarity between animes based on their genres.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity using the one-hot encoded genre data
genre_similarity_scores = cosine_similarity(genre_encoded_df)
genre_similarity_scores_df = pd.DataFrame(genre_similarity_scores, index=anime['anime_id'], columns=anime['anime_id'])
genre_similarity_scores_df.to_parquet('genre_similarity_scores_df.parquet')
genre_similarity_scores_df


anime_id,32281,5114,28977,9253,9969,32935,11061,820,15335,15417,...,26031,34399,10368,9352,5541,9316,5543,5621,6133,26081
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
32281,1.000000,0.188982,0.000000,0.000000,0.000000,0.447214,0.000000,0.250000,0.000000,0.000000,...,0.0,0.288675,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5114,0.188982,1.000000,0.285714,0.000000,0.285714,0.338062,0.566947,0.377964,0.285714,0.285714,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28977,0.000000,0.285714,1.000000,0.267261,1.000000,0.338062,0.377964,0.188982,1.000000,1.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9253,0.000000,0.000000,0.267261,1.000000,0.267261,0.000000,0.000000,0.353553,0.267261,0.267261,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9969,0.000000,0.285714,1.000000,0.267261,1.000000,0.338062,0.377964,0.188982,1.000000,1.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9316,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.577350,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5543,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.577350,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5621,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.577350,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6133,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.577350,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


This cell defines a function 'genre_based_recommendation' that takes an 'anime_id' as input and returns a list of recommended animes based on genre similarity. Specifically, for a given 'anime_id', the function finds the most similar animes in the genre similarity matrix and returns them as recommendations.

In [None]:
def genre_based_recommendation(anime_name):
    
    similarity_scores = genre_similarity_scores_df[anime_name]

    
    recommended_animes = similarity_scores.sort_values(ascending=False)[1:]

    return recommended_animes.head(27)


This cell tests the 'genre_based_recommendation' function with 'anime_id' 1735. This operation is done to verify that the function works correctly and provides sensible recommendations.

In [None]:
genre_based_recommendation(1735)

anime_id
20       1.000000
28755    1.000000
6325     1.000000
10659    1.000000
8246     1.000000
32365    1.000000
10075    1.000000
19511    1.000000
1604     0.894427
23933    0.894427
1009     0.894427
30694    0.845154
987      0.845154
14527    0.845154
904      0.845154
22695    0.845154
174      0.845154
25389    0.845154
11761    0.845154
6714     0.845154
22777    0.845154
813      0.845154
6033     0.845154
11787    0.800000
24029    0.800000
269      0.800000
2335     0.800000
Name: 1735, dtype: float64

This cell loads the Parquet files created earlier into pandas DataFrames and defines the functions 'genre_based_recommendation', 'user_based_recommendation', and 'get_recommendation' that provide recommendations based on genre similarity and user similarity. These functions are necessary for the recommendation system to provide recommendations based on either user preferences or anime genre.

In [57]:
import pandas as pd
# from sklearn.metrics.pairwise import cosine_similarity

anime = pd.read_parquet("anime.parquet")
similarity_scores = pd.read_parquet('user_similarity_scores.parquet')
genre_similarity_scores_df = pd.read_parquet('genre_similarity_scores_df.parquet')


def genre_based_recommendation(anime_id):
    similarity_scores = genre_similarity_scores_df[anime_id]
    recommended_animes = similarity_scores.sort_values(ascending=False)[1:]
    return recommended_animes.head(27).index.tolist()

def user_based_recommendation(anime_id):
    index = np.where(lfinal.index==anime_id)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:50]
    data = []
    for i in similar_items:
        temp_df = anime[anime['anime_id'] == lfinal.index[i[0]]]
        anime_id = temp_df.drop_duplicates('anime_id')['anime_id'].values[0]
        data.append(anime_id)
    return data[:27]


def get_recommendation(anime_id):
    if anime_id in lfinal.index:
        print('user')
        return user_based_recommendation(anime_id)
    else:
        print('genre')
        return genre_based_recommendation(anime_id)

print(get_recommendation(10793))


user
[11757, 10620, 6547, 9919, 16498, 8074, 19815, 11617, 14345, 11111, 15809, 11759, 6880, 9253, 22319, 20507, 1575, 10719, 9041, 15583, 20787, 21881, 2904, 4224, 13759, 22199, 8841]


This cell redefines the 'get_recommendation' function to make API requests to get additional information about the recommended animes. The function also prints whether the recommendations are based on user or genre similarity. This operation is done to provide more detailed recommendations to the user.

In [None]:
import pandas as pd
import requests
from sklearn.metrics.pairwise import cosine_similarity

anime = pd.read_parquet("anime.parquet")
user_similarity_scores_df = pd.read_parquet('user_similarity_scores.parquet')
genre_similarity_scores_df = pd.read_parquet('genre_similarity_scores_df.parquet')


def genre_based_recommendation(anime_id):
    similarity_scores = genre_similarity_scores_df[anime_id]
    print(similarity_scores.sort_values(ascending=False))
    recommended_animes = similarity_scores.sort_values(ascending=False)[0:]
    return recommended_animes.head(27).index.tolist()

def user_based_recommendation(anime_id):
    index = np.where(lfinal.index==anime_id)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:50]
    data = []
    for i in similar_items:
        temp_df = anime[anime['anime_id'] == lfinal.index[i[0]]]
        anime_id = temp_df.drop_duplicates('anime_id')['anime_id'].values[0]
        data.append(anime_id)
    return data[:27]

def get_anime_info(anime_id, client_id='Insert MAL Client ID HERE'):
    base_url = f"https://api.myanimelist.net/v2/anime/{anime_id}"
    params = {
        "fields": "synopsis,main_picture,mean,genre,media_type",
    }
    headers = {
        "X-MAL-CLIENT-ID": client_id
    }
    try:   
        response = requests.get(base_url, params=params, headers=headers)
        if response.status_code == 200:
            data = response.json()
            if data:
                return data
            else:
                print(f"No results found for anime ID '{anime_id}'.")
        else:
            print(f"Failed to fetch data from MyAnimeList API. Status code: {response.status_code}")
            print(response.text) 
    except requests.RequestException as e:
        print(f"An error occurred: {e}")




def get_recommendation(anime_id):
    if anime_id in user_similarity_scores_df.columns:
        recommend_ids = user_based_recommendation(anime_id)
        print('user')
    else:
        recommend_ids = genre_based_recommendation(anime_id)
        print('genre')
    recommend_data = []
    for id in recommend_ids:
        recommend_data.append(get_anime_info(id))
    return recommend_data


print(get_recommendation(20))

20      1.000000
18      0.790104
19      0.703293
1358    0.673747
1549    0.651244
          ...   
3320    0.025401
3185    0.024788
2003    0.024366
2234    0.023487
2689    0.018133
Name: 20, Length: 3691, dtype: float64
user
Failed to fetch data from MyAnimeList API. Status code: 404
{"message":"","error":"not_found"}
[{'id': 20, 'title': 'Naruto', 'main_picture': {'medium': 'https://cdn.myanimelist.net/images/anime/13/17405.jpg', 'large': 'https://cdn.myanimelist.net/images/anime/13/17405l.jpg'}, 'synopsis': "Moments prior to Naruto Uzumaki's birth, a huge demon known as the Kyuubi, the Nine-Tailed Fox, attacked Konohagakure, the Hidden Leaf Village, and wreaked havoc. In order to put an end to the Kyuubi's rampage, the leader of the village, the Fourth Hokage, sacrificed his life and sealed the monstrous beast inside the newborn Naruto.\n\nNow, Naruto is a hyperactive and knuckle-headed ninja still living in Konohagakure. Shunned because of the Kyuubi inside him, Naruto struggl