# ANIME RECOMMENDER SYSTEM

In this notebook, I will try to build anime recommender system based on the scrapped data from myanimelist.net that is available <a href="https://www.kaggle.com/hernan4444/anime-recommendation-database-2020">here</a>

## Content-Based Recommender System

In the content-based recommender system, we will only consider the synopsis and metadata of the anime. This recommendation will return the most similar items based of the input anime we gave.

Advantages : 
* Can overcome 'cold start' problem when we're using the collaborative filtering where new item that didn't have enough rating will not be recommended.

Disadvantages : 
* Result may not be satisfying as it's difficult to rate a complex item by its metadata only.
* Tend to return on similar items such as the sequel or the 2nd season of that anime.

In [None]:
!pip install rake-nltk

import pandas as pd
import numpy as np
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#Import dataset with sypnopsis 
sypnopsis = pd.read_csv('../input/anime-recommendation-database-2020/anime_with_synopsis.csv')

#Import dataset with anime type
usecols =['MAL_ID','Type']
anime_type = pd.read_csv('../input/anime-recommendation-database-2020/anime.csv', usecols=usecols)

In [None]:
display(sypnopsis.shape)
display(anime_type.shape)

#Merge the two dataset
df = sypnopsis.merge(anime_type, how='left', on='MAL_ID')

display(df.shape)

In [None]:
#Check the first 5 row of merged dataset
df.head()

In [None]:
#Check the types of the anime 
df['Type'].value_counts()

In [None]:
#Filter out the 'Music', 'Unknown', 'OVA' or 'ONA' type
df.drop(df[df['Type']=='Music'].index, inplace=True)
df.drop(df[df['Type']=='Unknown'].index, inplace=True)
df['Type'].value_counts()

In [None]:
#Check the dataset
df.describe(include='all')

In [None]:
#Drop duplicates anime 
df.drop_duplicates(subset='Name', inplace=True)
df.shape

In [None]:
#Define content with no sypnopsis 
no_sypnopsis = df['sypnopsis'].mode()[0]
no_sypnopsis

In [None]:
#Subset the dataframe to a cleaner and usable dataframe
df = df[(df['sypnopsis'].isnull()==False) & (df['sypnopsis'] != no_sypnopsis)]
df.shape

In [None]:
df.describe(include='all')

In [None]:
#define and drop content with no sypnopsis 
no_sypnopsis = df['sypnopsis'].mode()[0]
df = df[df['sypnopsis'] != no_sypnopsis]
df.shape

In [None]:
#drop anime with duplicate sypnopsis 
df.drop_duplicates(subset='sypnopsis', inplace=True)
df.describe(include='all')

In [None]:
df['sypnopsis'] = df['sypnopsis'].astype(str)
print(df['sypnopsis'].head())

In [None]:
# Initialize empty column
df['Keywords'] = ''

# function to get keywords from a text
def get_keywords(x):
    plot = x
    
    # initialize Rake using english stopwords from NLTK, and all punctuation characters
    r = Rake()
    
    # extract keywords from text
    r.extract_keywords_from_text(plot)
    
    # get dictionary with keywords and scores
    scores = r.get_word_degrees()
    
    # return new keywords as list, ignoring scores
    return(list(scores.keys()))

# Apply function to generate keywords
df['Keywords'] = df['sypnopsis'].apply(get_keywords)
df.head()

In [None]:
# Split the features into list 
def tokenize(x):
    if isinstance(x, list):
        return [i.lower().split(", ") for i in x]
    else:
        if isinstance(x, str):
            return x.lower().split(", ")
        else:
            return ''   

df['Genres'] = df['Genders'].apply(tokenize)
df.head()

In [None]:
df_keys = pd.DataFrame() 

df_keys['title'] = df['Name']
df_keys['bag_of_words'] = ''

def bag_words(x):
    return(' '.join(x['Genres'])+ ' ' + ' '.join(x['Keywords']) + ' ' )
df_keys['bag_of_words'] = df.apply(bag_words, axis = 1)

df_keys.head()

In [None]:
cv = CountVectorizer()
bow = cv.fit_transform(df_keys['bag_of_words'])

In [None]:
cosine_sim = cosine_similarity(bow, bow)

In [None]:
# create list of indices for later matching
indices = pd.Series(df_keys.index, index = df_keys['title'])

def recommend_anime(title, n = 10, cosine_sim = cosine_sim):
    movies = []
    
    # retrieve matching movie title index
    if title not in indices.index:
        print("Movie not in database.")
        return
    else:
        idx = indices[title]
    
    # cosine similarity scores of movies in descending order
    scores = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    # top n most similar movies indexes
    # use 1:n because 0 is the same movie entered
    top_n_idx = list(scores.iloc[1:n].index)
        
    return pd.DataFrame(df_keys['title'].iloc[top_n_idx])

In [None]:
recommend_anime('InuYasha')

In [None]:
recommend_anime('Kaguya-sama wa Kokurasetai: Tensai-tachi no Renai Zunousen')

In [None]:
recommend_anime('Detective Conan')

## Collaborative Filtering

In the collaborative filtering recommender system, we will only consider the rating given by user to give recommendation. This recommendation will return preferred anime by previous users who rated highly on the input anime.

Advantages : 
* Result are often better than content-based, because it can give recommendation without the machine analyzing complex object such as anime from its metadata.

Disadvantages : 
* Will face 'cold start' problem when new item that didn't have enough rating will not be recommended.
* Requires big amount of rating data before it can generate satisfying result.

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

In [None]:
#Import anime_name dataset
anime_name = pd.read_csv('../input/anime-recommendation-database-2020/anime.csv', usecols=['MAL_ID','Name'])
display(anime_name.shape)
anime_name.head()

In [None]:
#Import rating dataset
rating_data =pd.read_csv('../input/anime-recommendation-database-2020/rating_complete.csv')
display(rating_data.shape)
rating_data.head()

In [None]:
#Count number of unique users
users_count = rating_data.groupby("user_id").size().reset_index()
users_count.columns = ["user_id", "anime_count"]
print('Numbers of unique users : ', users_count.shape[0])

#Filter the users
filtered_users = users_count[users_count.anime_count >= 500]
users = list(filtered_users.user_id)
print('Numbers of unique users with 500 or more anime ratings : ', len(users))

In [None]:
#Filter the rating data with users who had rated 500 or more anime 
rating_data = rating_data[rating_data['user_id'].isin(users)]
print ("rating shape:", rating_data.shape)
print (rating_data.info())

In [None]:
#vectorization
unique_users = {int(x): i for i,x in enumerate(rating_data.user_id.unique())}
unique_animes = {int(x): i for i,x in enumerate(anime_name.MAL_ID.unique())}

print(len(unique_animes), len(unique_users))
anime_collaborative_filter = np.zeros((len(unique_animes), len(unique_users)))

for user_id, MAL_ID, rating in rating_data.values:
    anime_collaborative_filter[unique_animes[MAL_ID], unique_users[user_id]] = rating

In [None]:
def get_recommended(title, n_neighbors=10):
    model_knn = NearestNeighbors(metric='cosine', n_neighbors=n_neighbors)
    model_knn.fit(csr_matrix(anime_collaborative_filter))
    
    query_index = anime_name[anime_name['Name']==title].index[0]

    distances, indices = model_knn.kneighbors(anime_collaborative_filter[query_index,:].reshape(1, -1), n_neighbors = n_neighbors)
    result = []
    for i in range(0, len(distances.flatten())):
        index = indices.flatten()[i]
        if index == query_index:
            continue
        result.append(anime_name.iloc[index])
        
    return pd.DataFrame(result)

In [None]:
get_recommended('Fullmetal Alchemist: Brotherhood')

In [None]:
get_recommended('InuYasha')

In [None]:
get_recommended('Kaguya-sama wa Kokurasetai: Tensai-tachi no Renai Zunousen')

In [None]:
get_recommended('Steins;Gate')

## References

This notebook is created based on several references below :
* <a href="https://www.kaggle.com/hernan4444/anime-content-collaborative-knn">Anime Recommended System - Content Based & Collaborative Filtering</a>
* <a href="https://towardsdatascience.com/machine-learning-for-building-recommender-system-in-python-9e4922dd7e97">Machine Learning for Building Recommender System in Python</a>
* <a href="https://www.datacamp.com/community/tutorials/recommender-systems-python">Beginner Tutorial: Recommender Systems in Python</a>
* <a href="https://towardsdatascience.com/techniques-for-content-based-recommender-systems-64f812d2b5a0">Techniques for Content-based Recommender Systems</a>
* <a href="https://towardsdatascience.com/how-to-build-from-scratch-a-content-based-movie-recommender-with-natural-language-processing-25ad400eb243">How to build a content-based movie recommender system with Natural Language Processing</a>