<a href="https://colab.research.google.com/github/vsancnaj/anime_recomender/blob/main/Anime_Recomender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📺 Anime Recomender System Overview

## Introduction
The dataset provides data on the preferences of 73,516 users for 12,294 anime. It includes information about the anime that each user has added to their completed list, along with the ratings they have given. This dataset is a collection of such ratings, and it has been obtained through web scraping using the myanimelist.net API.

## Goal
Building a better anime recommendation system based only on similiar anime.

## Skills
* Data preprocessing
* Exploratory data analysis
* Focus on similarity measurements,particularly `cosine similarity` which measures the similarity between two different vectors. In this case, we are interested in vectors of ratings.





# ⬇️ Downloading the dataset from Kaggle

In [1]:
!pip install kaggle
from google.colab import drive
drive.mount('/content/drive')

def output_div():
  '''Visual divider for code.
  It is meant to help visualize output better'''
  print('\n')
  print('-'*50,'\n\n')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# google colab widget
!pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# Path to the Kaggle.json file in your Google Drive folder
kaggle_json_path = '/content/drive/MyDrive/Springboard/kaggle.json'

# Copy the Kaggle.json file to the appropriate location and set the required permissions
!mkdir -p ~/.kaggle
!cp '{kaggle_json_path}' ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
!kaggle datasets download -d CooperUnion/anime-recommendations-database

anime-recommendations-database.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
# Unzipping the folder
!unzip anime-recommendations-database.zip

Archive:  anime-recommendations-database.zip
replace anime.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: anime.csv               
replace rating.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: rating.csv              


# 🔎 Exploring the dataset

In [6]:
import os #paths to file
import numpy as np # linear algebra
import pandas as pd # data processing
import warnings# warning filter
import scipy as sp #pivot egineering
import matplotlib.pyplot as plt # plotting data
from sklearn.impute import SimpleImputer # estimate missing values
#from sklearn.preprocessing import MinMaxScaler # normalization

In [7]:
#ML model
from sklearn.metrics.pairwise import cosine_similarity

#default theme and settings
pd.options.display.max_columns

#warning handle
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")

In [8]:
# Rating and Anime paths
rating_path = '/content/rating.csv'
anime_path = '/content/anime.csv'

In [9]:
# Rating csv to dataframe
rating_df = pd.read_csv(rating_path)
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [10]:
# Anime csv to dataframe
anime_df = pd.read_csv(anime_path)
anime_df.head() 

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [11]:
# Data shapes
print(f'Anime shape (row, col): {anime_df.shape}\n'
      f'Rating shape (row, col): {rating_df.shape}')

Anime shape (row, col): (12294, 7)
Rating shape (row, col): (7813737, 3)


In [12]:
# Data info
print('Anime Info:')
anime_df.info()

output_div()

print('Rating Info:')
rating_df.info()



Anime Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


-------------------------------------------------- 


Rating Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7813737 entries, 0 to 7813736
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 178.8 MB


# 📊 Preprocessing and Data Analysis

In [13]:
# handling missig values
print('Anime missing values')
anime_na_prop = anime_df.isnull().mean()*100
missing_anime = pd.concat([anime_df.isnull().sum(), anime_na_prop], axis=1)
missing_anime.columns=['count', '%']
print(missing_anime)

output_div()

print('Rating missing values')
rating_na_prop = rating_df.isnull().mean()*100
missing_rating = pd.concat([rating_df.isnull().sum(), rating_na_prop], axis=1)
missing_rating.columns=['count', '%']
print(missing_rating)

Anime missing values
          count         %
anime_id      0  0.000000
name          0  0.000000
genre        62  0.504311
type         25  0.203351
episodes      0  0.000000
rating      230  1.870831
members       0  0.000000


-------------------------------------------------- 


Rating missing values
          count    %
user_id       0  0.0
anime_id      0  0.0
rating        0  0.0


By computing the mode of these columns, we can get an idea of the most common types and genres of anime in the dataset. This information can be useful for understanding the distribution of anime types and genres, and can help us make informed decisions about how to analyze the data further.

## Anime Data Wrangling

In [14]:
print(anime_df['type'].mode())
print(anime_df['genre'].mode())

0    TV
Name: type, dtype: object
0    Hentai
Name: genre, dtype: object


In [15]:
# deleting anime with 0 rating
anime_df=anime_df[~np.isnan(anime_df["rating"])]

In [16]:
# Instantiate SimpleImputer object with strategy as 'mode'
imputer_mode = SimpleImputer(strategy='most_frequent')

# Fit the imputer on the genre column
imputer_mode.fit(anime_df[['genre']])

# Transform the genre column
anime_df['genre'] = imputer_mode.transform(anime_df[['genre']])

print(anime_df.isnull().sum())

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


In [17]:
anime_df['type'].unique()

array(['Movie', 'TV', 'OVA', 'Special', 'Music', 'ONA'], dtype=object)

In [18]:
# Replace episode count of 1 with 'Movie'
anime_df.loc[anime_df['episodes'] == '1', 'type'] = 'Movie'

# Fit the imputer on the genre column
imputer_mode.fit(anime_df[['type']])

# Transform the genre column
anime_df['type'] = imputer_mode.transform(anime_df[['type']])

print(anime_df.isnull().sum())

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


Examining any duplicate anime names in the dataset is important for identification to avoid potential errors in analysis if they are not properly addressed.

In [19]:
# Checking for unique anime names
anime_df['name'].value_counts().head()

Shi Wan Ge Leng Xiaohua    2
Saru Kani Gassen           2
Kimi no Na wa.             1
Within the Bloody Woods    1
Pop                        1
Name: name, dtype: int64

In [20]:
# Concatenate the 'name' and 'episodes' columns and count the occurrences
count_name_ep = (anime_df['name'] + ', ' + anime_df['episodes']).value_counts().head()

# Create a DataFrame with the concatenated string and count columns
result_df = pd.DataFrame({'Anime and Episodes': count_name_ep.index, 'Count': count_name_ep.values})

# Display the result
print(result_df)

    Anime and Episodes  Count
0  Saru Kani Gassen, 1      2
1    Kimi no Na wa., 1      1
2  Nendo no Tatakai, 1      1
3   Vampire Holmes, 12      1
4      No Littering, 1      1


In [21]:
# Concatenate the 'type' and 'name' columns and count the occurrences
count_name_type = (anime_df['type'] + ', ' + anime_df['name']).value_counts().head()

# Create a DataFrame with the concatenated string and count columns
result_df = pd.DataFrame({'Anime Type and Anime Name': count_name_type.index, 'Count': count_name_type.values})

# Display the result
print(result_df)

  Anime Type and Anime Name  Count
0   Movie, Saru Kani Gassen      2
1     Movie, Kimi no Na wa.      1
2   Movie, Nendo no Tatakai      1
3        TV, Vampire Holmes      1
4       Movie, No Littering      1


In [22]:
anime_df[anime_df['name']=='Saru Kani Gassen']

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
10140,22399,Saru Kani Gassen,Kids,Movie,1,5.23,62
10141,30059,Saru Kani Gassen,Drama,Movie,1,4.75,76


In [23]:
anime_df[anime_df['name']=='Shi Wan Ge Leng Xiaohua']

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
10193,33193,Shi Wan Ge Leng Xiaohua,"Comedy, Parody",ONA,12,6.67,114
10194,33195,Shi Wan Ge Leng Xiaohua,"Action, Adventure, Comedy, Fantasy, Parody",Movie,1,7.07,110


The anime names are repeated but have different "types". For instance, "Saru Kani Gassen" has both an OVA and a movie version.

In [24]:
# Checking the data types of the anime df
anime_df.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

In [25]:
# Replace 'Unknown' with NaN
anime_df['episodes'] = anime_df['episodes'].replace('Unknown', np.nan)
anime_df.isnull().sum()

anime_id      0
name          0
genre         0
type          0
episodes    188
rating        0
members       0
dtype: int64

In [26]:
# This line of code first replaces missing values 
# in the 'episodes' column with the median value, 
# and then converts the column to int64 data type.

anime_df['episodes'] = anime_df['episodes'].fillna(anime_df['episodes'].median()).astype('int64')
anime_df.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [27]:
# Checking the data types of the anime df
anime_df.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes      int64
rating      float64
members       int64
dtype: object

In [28]:
max_episodes = anime_df['episodes'].max()
max_episodes

1818

Converting the 'episodes' column to a numeric data type such as int64 would allow for better analysis of the data. For example, calculate summary statistics such as the mean and median number of episodes, or create visualizations such as a histogram or boxplot. Additionally, having the 'episodes' column as a numeric data type would allow for more advanced analysis techniques such as regression modeling or machine learning algorithms.

## Rating Data Wrangling

In [29]:
# change the -1 values to NaN
rating_df['rating'] = rating_df['rating'].apply(lambda x: np.nan if x==-1 else x)
rating_df.tail(20)

Unnamed: 0,user_id,anime_id,rating
7813717,73515,11759,8.0
7813718,73515,11837,9.0
7813719,73515,12031,8.0
7813720,73515,12113,10.0
7813721,73515,12115,10.0
7813722,73515,12293,8.0
7813723,73515,12413,9.0
7813724,73515,12445,8.0
7813725,73515,12461,7.0
7813726,73515,12967,7.0


Now we will engineer our Dataframe in the following steps:

1. We want to recomment anime series only so the the relevant type is TV
2. We make a new Dataframe combining both anime and rating on the anime_id column.
3. Leaving only user_id, name and rating as the Df.
4. For computing purpose only, we compute our Df based only on the first 7500 users.

In [30]:
# step 1
anime_tv = anime_df[anime_df['type']=='TV']

# step 2
rated_anime = rating_df.merge(anime_tv, 
                              left_on = 'anime_id', 
                              right_on = 'anime_id', 
                              suffixes= ['_user', ''])

In [31]:
rated_anime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5283596 entries, 0 to 5283595
Data columns (total 9 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int64  
 1   anime_id     int64  
 2   rating_user  float64
 3   name         object 
 4   genre        object 
 5   type         object 
 6   episodes     int64  
 7   rating       float64
 8   members      int64  
dtypes: float64(2), int64(4), object(3)
memory usage: 403.1+ MB


In [32]:
# step 3
rated_anime =rated_anime[['user_id', 'name', 'rating']]
rated_anime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5283596 entries, 0 to 5283595
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   user_id  int64  
 1   name     object 
 2   rating   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 161.2+ MB


In [33]:
rated_anime.isnull().sum()

user_id    0
name       0
rating     0
dtype: int64

In [34]:
# # step 4 - limiting number of users and taking a sample

rated_anime_7500_sample = rated_anime.loc[rated_anime['user_id'] <= 8500].sample(n=7500, random_state=42)
rated_anime_7500_sample.tail()

rated_anime_7500 = rated_anime.loc[rated_anime['user_id'].isin(rated_anime_7500_sample['user_id'])]

# rated_anime_7500= rated_anime[rated_anime.user_id <= 7500]
rated_anime_7500.tail()

Unnamed: 0,user_id,name,rating
5281588,8299,Badaui Jeonseol Jangbogo,6.35
5281598,8302,Jewelpet Magical Change,6.62
5281604,8446,Sukima no Kuni no Polta,5.67
5281612,8470,Gegege no Kitarou (1971),6.84
5281620,8470,Denkou Chou Tokkyuu Hikarian,6.84


In [35]:
rated_anime_7500

Unnamed: 0,user_id,name,rating
0,1,Naruto,7.81
2,5,Naruto,7.81
8,38,Naruto,7.81
9,39,Naruto,7.81
11,43,Naruto,7.81
...,...,...,...
5281588,8299,Badaui Jeonseol Jangbogo,6.35
5281598,8302,Jewelpet Magical Change,6.62
5281604,8446,Sukima no Kuni no Polta,5.67
5281612,8470,Gegege no Kitarou (1971),6.84


In [36]:
# create pivot table
pivot = rated_anime_7500.pivot_table(index='user_id', columns='name', values='rating')
pivot.tail()

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8487,,,,,,,,,,7.93,...,,,,,,,,,,
8491,,,,,,,,,,,...,,,,,,,,,,
8494,7.06,7.14,6.75,,,,,,,,...,,,,,,,,,8.11,8.34
8496,,,,,,,,,,,...,,,,,,,,,,
8498,,,,,,,,,,,...,,,,,,,,,,


In [37]:
pivot.shape

(3676, 2765)

We will engineer our pivot table in the following steps:

1. Value normalization.
2. Filling Nan values as 0.
3. Transposing the pivot for the next step.
4. Dropping columns with the values of 0 (unrated).
5. Using scipy package to convert to sparse matrix format for the similarity computation.

In [38]:
# step 1
pivot_n = pivot.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)


In [39]:
# other step 1
#scaler = MinMaxScaler()
#pivot_n = pd.DataFrame(scaler.fit_transform(pivot), columns=pivot.columns, index=pivot.index)

In [40]:
# step 2
pivot_n.fillna(0, inplace=True)

In [41]:
# step 3
pivot_n = pivot_n.T
pivot_n.shape

(2765, 3676)

In [42]:
# step 4
pivot_n = pivot_n.loc[:, (pivot_n != 0).any(axis=0)]

In [43]:
# step 5
piv_sparse = sp.sparse.csr_matrix(pivot_n.values)

In [44]:
#model based on anime similarity
anime_similarity = cosine_similarity(piv_sparse)

#Df of anime similarities
ani_sim_df = pd.DataFrame(anime_similarity, index = pivot_n.index, columns = pivot_n.index)

In [45]:
ani_sim_df

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
.hack//Roots,1.000000,0.512102,0.479799,0.042454,0.115094,0.154705,0.006981,0.000000,0.074917,-0.035781,...,0.088583,0.0,-0.152765,-0.148026,0.000000,0.000000,0.083514,0.135678,-0.188732,-0.173252
.hack//Sign,0.512102,1.000000,0.502814,0.032690,0.149356,0.123796,0.005449,0.000000,0.079489,-0.034488,...,0.111663,0.0,-0.151937,-0.162199,0.000000,0.000000,0.060901,0.165664,-0.194562,-0.176041
.hack//Tasogare no Udewa Densetsu,0.479799,0.502814,1.000000,0.062288,0.083898,0.120262,0.008727,0.000000,0.062609,-0.031996,...,0.037062,0.0,-0.155565,-0.174422,0.000000,0.000000,0.083069,0.104267,-0.187576,-0.141375
009-1,0.042454,0.032690,0.062288,1.000000,0.006361,0.071866,0.016598,0.000000,0.003938,-0.039667,...,0.015875,0.0,-0.047748,-0.050399,0.000000,0.000000,0.021019,0.026656,-0.112110,-0.090162
07-Ghost,0.115094,0.149356,0.083898,0.006361,1.000000,0.220877,0.014746,0.000878,0.048132,-0.039178,...,0.009711,0.0,-0.110784,-0.113733,-0.001514,-0.001268,0.035377,0.089958,-0.139658,-0.143925
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
gdgd Fairies 2,0.000000,0.000000,0.000000,0.000000,-0.001268,0.003672,0.000000,0.000000,0.000000,-0.026457,...,0.000000,0.0,-0.009985,-0.008738,0.999324,1.000000,0.017827,0.035691,-0.030151,0.000000
iDOLM@STER Xenoglossia,0.083514,0.060901,0.083069,0.021019,0.035377,0.081243,0.050949,0.000000,0.014034,-0.058105,...,0.026904,0.0,-0.133272,-0.143267,0.021284,0.017827,1.000000,0.005243,-0.121885,-0.086734
s.CRY.ed,0.135678,0.165664,0.104267,0.026656,0.089958,0.040673,0.000000,0.000000,0.020200,-0.014602,...,0.050773,0.0,-0.024951,-0.047068,0.035124,0.035691,0.005243,1.000000,-0.101639,-0.085583
xxxHOLiC,-0.188732,-0.194562,-0.187576,-0.112110,-0.139658,-0.211048,-0.012630,0.000000,-0.120583,0.046977,...,-0.027438,0.0,0.277922,0.293155,-0.029673,-0.030151,-0.121885,-0.101639,1.000000,0.772510


In [46]:
def anime_recommendation(ani_name):
    """
    This function will return the top 5 shows with the highest cosine similarity value and show match percent
    
    example:
    >>>Input: 
    
    anime_recommendation('Death Note')
    
    >>>Output: 
    
    Recommended because you watched Death Note:

                    #1: Code Geass: Hangyaku no Lelouch, 57.35% match
                    #2: Code Geass: Hangyaku no Lelouch R2, 54.81% match
                    #3: Fullmetal Alchemist, 51.07% match
                    #4: Shingeki no Kyojin, 48.68% match
                    #5: Fullmetal Alchemist: Brotherhood, 45.99% match 

               
    """
    # Check if input anime name is present in pivot table
    if anime not in ani_sim_df.columns:
        print(f'{anime} not found in pivot table')
        return
    
    number = 1
    print('Recommended because you watched {}:\n'.format(ani_name))
    
    for anime in ani_sim_df.sort_values(by = ani_name, ascending = False).index[1:6]:
        print(f'#{number}: {anime}, {round(ani_sim_df[anime][ani_name]*100,2)}% match')
        
        number +=1  

In [47]:
anime_recommendation('Death Note')

UnboundLocalError: ignored

# 📖 Summary and conclusion

In [None]:
from ipywidgets import interact, widgets

# Define a function that takes an anime name as input and calls anime_recommendation() function
def get_recommendations(anime_name):
    anime_recommendation(anime_name)

# Create a text input box for the user to input an anime name
anime_input = widgets.Text(
    value='',
    placeholder='Enter an anime name',
    description='Anime:',
    layout=widgets.Layout(width='50%')
)

# Create a button to call the get_recommendations() function when clicked
button = widgets.Button(
    description='Get Recommendations',
    layout=widgets.Layout(width='20%')
)

def on_button_click(b):
    get_recommendations(anime_input.value)

button.on_click(on_button_click)

# Display the input box and button in a vertical layout
display(widgets.VBox([anime_input, button]))