# Anime Recommender System -- Top_K Memory Based Collaborative Filtering 

### First let's import our data visualization and analysis packages

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['patch.force_edgecolor']=True

In [None]:
df_features = pd.read_csv('../input/rating.csv', encoding="ISO-8859-1")
df_anime = pd.read_csv('../input/anime.csv', encoding="ISO-8859-1")

## Some info about our database

### About our user-rating databes

In [None]:
df_features.info()

In [None]:
df_features['rating'].describe(percentiles=[0.5])

Notice that the -1 ratings are messing with the statistics.

In [None]:
df_features.head()

### According to the data description, all entries rated -1 were watched but not rated, we will replace this values for np.nan, that way it won't interfere with our algorithms.

In [None]:
watched_animes = df_features.replace(to_replace=-1, value=np.nan)

Let's check again the rating statistics:

In [None]:
watched_animes['rating'].describe(percentiles=[0.5])

### BUT...
There's a problem, when we drop the na values we end up dropping all the users who have watched but not rated at least one anime; because of that it is possible that we lose some animes too... 

So let's keep that in mind in our EDA and Recoomender System Development. Let's calculate how many users we are dropping.

In [None]:
n_user = df_features['user_id'].nunique()
n_anime = df_features['anime_id'].nunique()
print('Originally we had {0} users and {1} animes'.format(n_user, n_anime))
print('After dropping N/A values we have {0} users and {1} animes'.format(watched_animes.dropna()['user_id'].nunique(), watched_animes.dropna()['anime_id'].nunique()))

## EDA (Exploratory Data Analysis)

### Ratings Distribution

Let's explore the distribution of the ratings in the dataset...

In [None]:
sns.distplot(watched_animes['rating'].dropna(), color='red')

Notice that we have a normal distribution of ratings centered on 7.

### Anime type, episodes, mean rating and members

Let's explore the animes contained in the database. Because we dropped some of the animes on the ratings database so let's just keep the animes there are rated.

In [None]:
df_anime.head()

In [None]:
drop_anime = watched_animes['anime_id'][watched_animes['rating'].isnull()]
df_anime = df_anime.drop(df_anime['anime_id'].isin(drop_anime))

In [None]:
print("We have {0} in the ratings database and {1} in the anime database.".format(n_anime, df_anime['anime_id'].count()))

We have one instance in the ratings database that is not contained in the anime database. That can be an inconsistence between databases.
Because we only have one instance we will disconsider.

In [None]:
sns.countplot(x='type', data=df_anime)

The database consist mainly on TV and OVA animes.

In [None]:
sns.distplot(df_anime['rating'].dropna(), color='red')

As expected (from the ratings database distribution) we have an almost normal distribution of ratings centered on 7. 
Because these ratings are (probably) mean user ratings for each anime we don't have extreme ratings (10 or 0 ratings).

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(df_anime[df_anime['type']=='TV']['episodes'].value_counts(), palette='viridis')
plt.tight_layout()

It is interesting to notice that we have a huge concentration of 1-episode TV animes and for that we have these hypotheses:

>1 - Database error: it could be an error, these animes were putten in the wrong category (instead of OVA or Movies).

>2 - Some 1-episode animes were television movies (not shown at the cinema). -> More likely

In [None]:
plt.figure(figsize=(10,6))
sns.jointplot(x='rating', y='members', data=df_anime, s=10, color='green')

We can observe that most animes on the database have <200k members, and those who have >200k tend to have higher ratings; this can be explained because popular animes tend to be more accepted by the public. But it is also curious to notice that we have some 'not-so-popular' animes with high ratings too.

We will dummify the genres so we can visualize this genre distribution in the dataset; first we have to split the 'genre' column into elements inside a list, then we will fill in the nan (just a precaution) and get the dummies.

In [None]:
genre = df_anime.genre.str.get_dummies(sep=",")

In [None]:
print('We have {0} genres in the database'.format(genre.shape[1]))

In [None]:
sum_genre = pd.DataFrame(genre.sum(), index=genre.columns, columns=['Total'])

In [None]:
plt.figure(figsize=(14,8))
sns.barplot(data=sum_genre.sort_values(by=['Total'],ascending=False).head(15).T)
plt.tight_layout()

The plot shows the 15 most popular genres in the anime; notice that the 5 most popular ones are Action, Comedy, Sci-Fi, Shounen and Fantasy. These are by far the most common genres in anime (common sense).
Also notice that these popular groups includes Romance, Hentai and Mecha genres.

## Similarity Recommender System - The Simplest

Now that we have some knowledge about the data let's start coding our recommender system. These are the steps necessary to build the algorithm:
> 1. Our data must be organized as (user_id x anime_id), in each cell of this matrix we will have the rating, i.e. (user_id, anime_id) = anime_id rated by user_id
> 2. Compute similarity
> 3.  Recommend top k similar animes

We also should consider this: when recommending an anime we should include the  number of ratings. Why? There are a lot of movies only watched once by users who also watched a particular anime, that way we will have a high correlation, even if it doesn't make sense. We will explore all of this...

### Step 1: User-Item Matrix

First, let's create add the name of the animes on the features dataframe and create a new dataframe with the number of ratings for each anime:

In [None]:
names = df_anime[['name', 'anime_id']]
watched_animes = pd.merge(names, watched_animes, on='anime_id')
watched_animes.drop('anime_id', inplace=True,axis=1)
ratings = pd.DataFrame(watched_animes.groupby('name')['rating'].mean())
ratings['Number of Ratings'] = pd.DataFrame(watched_animes.groupby('name')['rating'].count())

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(ratings['Number of Ratings'], kde=False, bins=250, color=sns.color_palette('rocket',1)[0])
plt.xlim((0,5000))

Notice that we few animes have more than 1000 ratings and that can make our recommendations biased (remember in the previous section, perfect correlations can 'appear' in animes that have good ratings but in a small number). 

We will only consider animes with more than 1000 ratings, that way we can filter better suited recommendations (this also results in only more popular animes being recommended).

##### Because we are dealing with high dimensional matrices we will choose randomly 30000 users only (remember we have 69600 users on the database).

In [None]:
user_item = watched_animes.sample(30000)
user_item = user_item.pivot_table(values='rating', index=['user_id'], columns=['name'])
user_item.fillna(value=0, inplace=True)

## Comparing different distance metrics

### Fullmetal Alchemist
#### Pearson's Correlation

In [None]:
fullmetal = user_item['Fullmetal Alchemist']
similar_fullmetal = user_item.corrwith(fullmetal)
corr_fullmetal = pd.DataFrame(similar_fullmetal, columns=['Correlation'])
corr_fullmetal = corr_fullmetal.join(ratings['Number of Ratings'])
corr_fullmetal = corr_fullmetal[corr_fullmetal['Number of Ratings']>1000].sort_values('Correlation', ascending=False).head()
animes = corr_fullmetal.index

In [None]:
df = df_anime.set_index('name')
for name in animes[0:3]:
    fm = df.loc[name]
    print("{0}'s genre is: {1}. It is a {2} anime with {3} episodes and {4} members. The rating is {5}\n".format(name,
      fm['genre'], fm['type'], fm['episodes'], fm['members'], fm['rating']))

#### Cosine Distance

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similar_fullmetal = cosine_similarity(user_item.T,fullmetal.values.reshape(-1,fullmetal.shape[0]))
corr_fullmetal = pd.DataFrame(similar_fullmetal, columns=['Correlation'], index=user_item.columns)
corr_fullmetal = corr_fullmetal.join(ratings['Number of Ratings'])
corr_fullmetal = corr_fullmetal[corr_fullmetal['Number of Ratings']>1000].sort_values('Correlation', ascending=False).head()
animes = corr_fullmetal.index

In [None]:
df = df_anime.set_index('name')
for name in animes[0:3]:
    fm = df.loc[name]
    print("{0}'s genre is: {1}. It is a {2} anime with {3} episodes and {4} members. The rating is {5}\n".format(name,
      fm['genre'], fm['type'], fm['episodes'], fm['members'], fm['rating']))

### One Punch Man
#### Pearson's Correlation

In [None]:
one = user_item['One Punch Man']
similar_one = user_item.corrwith(one)
corr_one = pd.DataFrame(similar_one, columns=['Correlation'])
corr_one = corr_one.join(ratings['Number of Ratings'])
corr_one = corr_one[corr_one['Number of Ratings']>1000].sort_values('Correlation', ascending=False).head()
animes = corr_one.index

In [None]:
for name in animes[0:3]:
    fm = df.loc[name]
    print("{0}'s genre is: {1}. It is a {2} anime with {3} episodes and {4} members. The rating is {5}\n".format(name,
      fm['genre'], fm['type'], fm['episodes'], fm['members'], fm['rating']))

#### Cosine Distance

In [None]:
one = user_item['One Punch Man']
similar_one = cosine_similarity(user_item.T,one.values.reshape(-1,one.shape[0]))
corr_one = pd.DataFrame(similar_one, columns=['Correlation'], index=user_item.columns)
corr_one = corr_one.join(ratings['Number of Ratings'])
corr_one = corr_one[corr_one['Number of Ratings']>1000].sort_values('Correlation', ascending=False).head()
animes = corr_one.index

In [None]:
for name in animes[0:3]:
    fm = df.loc[name]
    print("{0}'s genre is: {1}. It is a {2} anime with {3} episodes and {4} members. The rating is {5}\n".format(name,
      fm['genre'], fm['type'], fm['episodes'], fm['members'], fm['rating']))

## Results
It is interesting to show that we obtained the same results using two different metrics. We could improve this simple recommender system by adding the 'type' column to our ratings dataframe, that way we could use this as filter to only recommend animes from the same type (notice that for One Punch Man we recommended a Movie).
Also notice that the recommendations genres intersect with the chosen anime, this could have been result of a good recommendation system or that animes generally intersect one or two genres (that is, in the data there are too many genres classifications for each anime)