## Content Based Recommendation System for Anime

This is my first attempt on recommender systems. On the next lines, I will build a Content Based Recommendation system using **unsupervised Nearest Neighbors learning**. This recommender is based on item features like genre, type(movie, t.v., etc), number of episodes, rating and few more. 

**A C-B recommender is a non-personalized system. Actually is based on information on the content of items rather than on other users’ opinions/interactions.** A pure content-based recommender system makes recommendations for a user based solely on the profile built up by analyzing the content of items which that user has rated in the past.

#### So, what a CB Recommender is ?

- It can be attributes or characteristics of the item. For example for a film: Genre, Year, Cast, etc.
- It can also be textual content (title, description, table of content etc.). For example NLP techniques to extract content features.
- Can be extracted from the signa itself (audio, image).


#### Pros(+)  &  Cons(-)

Pros :
- No need for data on other users.
    - No cold-start or sparsity problems.
- Able to recommend to users with unique tastes.
-  Able to recommend new and unpopular items
    - No first-rater problem.
- ...

Cons :
- Requires content that can be encoded as meaningful features.
- Some kind of items are not amenable to easy feature extraction methods (e.g. movies, music)
- Users’ tastes must be represented as a learnable function of these content features.
- Hard to exploit quality judgements of other users.
- Easy to overfit.
- ...

### We're ready to analyze  Anime Recommendations Database

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Anime.csv

- anime_id : myanimelist.net's unique id identifying an anime.
- name : full name of anime.
- genre : comma separated list of genres for this anime.
- type : movie, TV, OVA, etc.
- episodes : how many episodes in this show. (1 if movie).
- rating : average rating out of 10 for this anime.
- members : number of community members that are in this anime's "group".

**Feel free to download dataset from [here](https://www.kaggle.com/CooperUnion/anime-recommendations-database)**

In [1]:
# import libraries

import pandas as pd
import numpy as np

anime = pd.read_csv('anime.csv')
anime.head(3)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262


In [2]:
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
anime_id    12294 non-null int64
name        12294 non-null object
genre       12232 non-null object
type        12269 non-null object
episodes    12294 non-null object
rating      12064 non-null float64
members     12294 non-null int64
dtypes: float64(1), int64(2), object(4)
memory usage: 672.4+ KB


First thoughts on attributes : 

- 'genre' is a list of genres. We will take a better view of this feature later.
-  episodes, although seems to be integers, are strings.

We must fix these to prepare our algorithm's inputs!

In [3]:
anime.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


### Mess with 'type' and 'episodes'

In [4]:
anime.type.value_counts()

TV         3787
OVA        3311
Movie      2348
Special    1676
ONA         659
Music       488
Name: type, dtype: int64

In [5]:
print('Anime with unknown episodes :',anime.loc[anime['episodes']=='Unknown']['episodes'].count())
print(anime.loc[anime['episodes']=='Unknown']['type'].value_counts())
anime.loc[anime['episodes']=='Unknown'][:10]

Anime with unknown episodes : 340
TV         209
OVA         50
ONA         46
Special      5
Movie        4
Music        1
Name: type, dtype: int64


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
74,21,One Piece,"Action, Adventure, Comedy, Drama, Fantasy, Sho...",TV,Unknown,8.58,504862
252,235,Detective Conan,"Adventure, Comedy, Mystery, Police, Shounen",TV,Unknown,8.25,114702
615,1735,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,Unknown,7.94,533578
991,966,Crayon Shin-chan,"Comedy, Ecchi, Kids, School, Shounen, Slice of...",TV,Unknown,7.73,26267
1021,33157,Tanaka-kun wa Itsumo Kedaruge Specials,"Comedy, School, Slice of Life",Special,Unknown,7.72,5400
1272,21639,Yu☆Gi☆Oh! Arc-V,"Action, Fantasy, Game, Shounen",TV,Unknown,7.61,17571
1309,8687,Doraemon (2005),"Comedy, Kids, Sci-Fi, Shounen",TV,Unknown,7.59,2980
1928,32410,Dimension W: W no Tobira Online,"Sci-Fi, Seinen",Special,Unknown,7.4,4799
1930,30694,Dragon Ball Super,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,Unknown,7.4,111443
1993,32977,Aggressive Retsuko,"Comedy, Music",TV,Unknown,7.38,5465


It's time consuming to find out every single number of episodes for 340 anime titles. However, we can replace some... **'Special' and 'Movie'** episodes are equals to 1. I think that, there is not a movie with 3,6,10 or more episodes! 

Anime 'Hentai' genre: Hentai anime are OVA's. OVA stands for Original Video Animation. Usually either **stand-alone episodes**, or about under 10-ish episodes long. [Source](http://www.animenewsnetwork.com/encyclopedia/lexicon.php?id=37)

I will replace every unknown Hentai episode number with 1.


In [6]:
# Notice that we replace with a string at this time.
# replace episodes for Specials and Movies

anime.loc[(anime['type'] == 'Special') & (anime['episodes'] == 'Unknown'), 'episodes'] = '1'
anime.loc[(anime['type'] == 'Movie') & (anime['episodes'] == 'Unknown'), 'episodes'] = '1'

# replace episodes for Hentai genre

anime.loc[(anime['genre'] == 'Hentai') & (anime['episodes'] == 'Unknown'), 'episodes'] = '1'


In [7]:
anime.episodes.describe()

count     12294
unique      187
top           1
freq       5722
Name: episodes, dtype: object

Well, we must change the dtype of episodes. Must be numeric feature to be used as an input.

In [8]:
# strings became integers and Unknown are now NaNs.
anime['episodes'] = anime['episodes'].apply(pd.to_numeric, errors='ignore')
anime = anime.replace({'episodes' : { 'Unknown': np.nan}})

anime.episodes.describe()

count    11999.000000
mean        12.339862
std         46.782557
min          1.000000
25%          1.000000
50%          2.000000
75%         12.000000
max       1818.000000
Name: episodes, dtype: float64

### Missing Values and how we can handle them.

Whenever we have to mess with nans, I suggest exploring if their total number is important enough compared to the total rows of dataset.

In [9]:
print('Missing Values : \n', anime.isnull().sum())

nan_percentiles = round(anime.isnull().sum().sort_values(ascending=False)/len(anime)*100,2)
print('')
print('Percentages...\n')
for i in range(len(nan_percentiles)):
    if nan_percentiles[i] > 0:
        print(nan_percentiles.index[i], nan_percentiles[i],'%')


Missing Values : 
 anime_id      0
name          0
genre        62
type         25
episodes    295
rating      230
members       0
dtype: int64

Percentages...

episodes 2.4 %
rating 1.87 %
genre 0.5 %
type 0.2 %


We could search for missing episodes values and replace them manually. Or we could replace the NaNs with the median of episodes. However, I will drop out all the missing values.

In [10]:
# drop missing values

anime = anime.dropna()
anime.isnull().sum()


anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Replace types of anime with integers per type. Everything must be numeric for the nearest neighbors algorithm.

In [11]:
# replace 'type' with int 
anime = anime.replace({'type' : { 'TV': 1, 'OVA': 2, 'Movie' :3, 'Special' :4,
                                 'ONA' :5, 'Music' :6}})


In [12]:
anime.type[:10]

0    3
1    1
2    1
3    1
4    1
5    1
6    1
7    2
8    3
9    1
Name: type, dtype: int64

### Time to mess with some dummies

If we want to use 'genre' attribute we should make it more clear. Instead of replacing genres with numbers manually we will get the dummies. Let's take a look at genre column

In [13]:
anime.genre[:5]

0                 Drama, Romance, School, Supernatural
1    Action, Adventure, Drama, Fantasy, Magic, Mili...
2    Action, Comedy, Historical, Parody, Samurai, S...
3                                     Sci-Fi, Thriller
4    Action, Comedy, Historical, Parody, Samurai, S...
Name: genre, dtype: object

Every movie has more than one characteristic genre. We don't know exactly the total number of genres. Moreover, we must split genres in every row of genre column and separate them from the others.

In [14]:
genre_dummies = anime['genre'].str.get_dummies(sep=', ')
genre_dummies[:5]

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We got 43 genres. Let's take a view of them

In [15]:
genre_dummies.columns

Index(['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama',
       'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror',
       'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music',
       'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai',
       'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen',
       'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power',
       'Supernatural', 'Thriller', 'Vampire', 'Yaoi', 'Yuri'],
      dtype='object')

#### Final Dataset

In [16]:
# new dataset with dummies

final_df = pd.concat([anime, genre_dummies], axis=1)
final_df.head(2)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,Action,Adventure,Cars,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",3,1.0,9.37,200630,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",1,64.0,9.26,793665,1,1,0,...,0,0,0,0,0,0,0,0,0,0


### Build the model

We will follow these 3 simple steps before our job is done and start searching for recommendations.
- Select the features we want to use as input. Anime ID, Name and the initial Genre column are no longer necessary.
- Scale all the features with MaxAbsScaler. 
    - "*This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.*"
- Select number of neighbors and run the model.
    - **We will choose 6 neighbors as a parameter**. The first one is our 'search'. The next 5 neighbors are the nearest and we select them as recommendations.
    
That's all ... let's go!

In [17]:
# select features

features = final_df.drop(['anime_id','name','genre'], axis=1)

In [18]:
# scale features

from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()

scaled_feats = scaler.fit_transform(features)

In [19]:
# nearest neighbors, this may take a few seconds

from sklearn.neighbors import  NearestNeighbors

nbrs = NearestNeighbors(n_neighbors = 6, algorithm = 'ball_tree').fit(scaled_feats)

distances, indices = nbrs.kneighbors(scaled_feats)

In [20]:
# build a recommendation function

def find_movies(movie):
    distances[movie]
    indices[movie]
    print('If you liked ', anime.iloc[movie][1],', we also recommend :')
    for m in indices[movie][1:]:
        print(anime.iloc[m][1],'\nCategory: ', anime.iloc[m][2],
          '\nRating',anime.iloc[m][5])
        if anime.iloc[m][3]==1:
            print('TV')
        elif anime.iloc[m][3]==2:
            print('OVA')
        elif anime.iloc[m][3]==3:
            print('Movie')
        elif anime.iloc[m][3]==4:
            print('Special')
        elif anime.iloc[m][3]==5:
            print('ONA')
        else:
            print('Music')
        
        print('# # #')
    return(movie)

In [21]:
# find movies by its index

find_movies(367)

If you liked  Akagami no Shirayuki-hime 2nd Season , we also recommend :
Akagami no Shirayuki-hime 
Category:  Drama, Fantasy, Romance, Shoujo 
Rating 7.93
TV
# # #
Akagami no Shirayuki-hime: Nandemonai Takaramono, Kono Page 
Category:  Drama, Fantasy, Romance, Shoujo 
Rating 7.77
OVA
# # #
Hanasakeru Seishounen 
Category:  Drama, Romance, Shoujo 
Rating 7.9
TV
# # #
Glass no Kamen 
Category:  Drama, Romance, Shoujo 
Rating 7.52
TV
# # #
Koisuru Tenshi Angelique: Kagayaki no Ashita 
Category:  Drama, Fantasy, Harem, Romance, Shoujo 
Rating 7.19
TV
# # #


367

Well, I think that our model works quite good. The five nearest anime to our search are pretty close to rating, genre and type. The first two recommendations are 100% similar to the anime we've watched. We could choose more neighbors but I think that It get's more confusnig for anime lovers. Like I said from the beginning this is a non-personalized recommendation system. This is to be done on the next collaborative filter projects!