# Content-based Recommender

## Import Libraries

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

## Load Data

Download the data from: https://www.kaggle.com/CooperUnion/anime-recommendations-database

<font color="red">Load in the `anime.csv` data.</font>

In [2]:
df = pd.read_csv("anime.csv")

In [3]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [4]:
df.columns

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')

## Pre Processing

### Let's first make this based only on genres.

In [5]:
df_genres = df[['anime_id','name','genre']]

In [6]:
df_genres.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."


In [7]:
df_genres.shape

(12294, 3)

### Drop nulls

In [8]:
df_genres.isnull().sum()

anime_id     0
name         0
genre       62
dtype: int64

In [9]:
df_genres = df_genres.dropna()

In [10]:
df_genres.shape

(12232, 3)

In [11]:
df_genres.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."


Trim extra whitespaces in the `genre` column.

In [12]:
df_genres['genre'] = df_genres['genre'].str.replace(' ', '')

In [13]:
df_genres.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama,Romance,School,Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action,Adventure,Drama,Fantasy,Magic,Military,..."
2,28977,Gintama°,"Action,Comedy,Historical,Parody,Samurai,Sci-Fi..."
3,9253,Steins;Gate,"Sci-Fi,Thriller"
4,9969,Gintama&#039;,"Action,Comedy,Historical,Parody,Samurai,Sci-Fi..."


Without trimming whitespaces (above), we run into issues when trying to encode (below).

In [14]:
df_enc = df_genres.join(pd.concat([df_genres['genre'].str.get_dummies(sep=',')])).drop('genre',axis=1)

In [15]:
df_enc.head()

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,ShounenAi,SliceofLife,Space,Sports,SuperPower,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,9969,Gintama&#039;,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Content-based Filtering

### Calculate Cosine Similarities

In [16]:
a = df_enc.drop(['anime_id','name'], axis = 1)

We can calculate cosine similarity using the `sklearn` library.  
Notice how we have 1's across the diagonal. This is because each movie is identical to itself.

In [17]:
cos_sim = cosine_similarity(a, a)

In [18]:
print(cos_sim)

[[1.         0.18898224 0.         ... 0.         0.         0.        ]
 [0.18898224 1.         0.28571429 ... 0.         0.         0.        ]
 [0.         0.28571429 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         1.         1.        ]
 [0.         0.         0.         ... 1.         1.         1.        ]
 [0.         0.         0.         ... 1.         1.         1.        ]]


### Save top N most similar items per anime

In `results`, save the N most similar items for each anime.  
We put a pair (score, anime_id)

In [19]:
df_genres = df_genres.reset_index(drop=True)
df_genres

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama,Romance,School,Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action,Adventure,Drama,Fantasy,Magic,Military,..."
2,28977,Gintama°,"Action,Comedy,Historical,Parody,Samurai,Sci-Fi..."
3,9253,Steins;Gate,"Sci-Fi,Thriller"
4,9969,Gintama&#039;,"Action,Comedy,Historical,Parody,Samurai,Sci-Fi..."
...,...,...,...
12227,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai
12228,5543,Under World,Hentai
12229,5621,Violence Gekiga David no Hoshi,Hentai
12230,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai


In [29]:
results = {}
for idx, row in df_genres.iterrows():
    similar_indices = cos_sim[idx].argsort()[:-100:-1]
    # print(df_genres['anime_id'].iloc[[0]].tolist())
    # print(df_genres['anime_id'].iloc[[5803]]) # I was using this to figure out what each individual piece of this chain is doing
    similar_items = [(cos_sim[idx][i], df_genres['anime_id'].iloc[[i]].tolist()[0]) #Okay so I realized for the second element we use this chain ".iloc[[i]].tolist()[0]" to just pull only an int insted of a dataframe or list
                     for i in similar_indices]     
    
    results[row['anime_id']] = similar_items[1:]

{32281: [(1.0, 547),
  (1.0, 546),
  (0.8944271909999159, 14669),
  (0.8660254037844388, 10067),
  (0.8660254037844388, 6351),
  (0.8660254037844388, 1624),
  (0.8660254037844388, 713),
  (0.8660254037844388, 17585),
  (0.8660254037844388, 1607),
  (0.8660254037844388, 18053),
  (0.8660254037844388, 2926),
  (0.8660254037844388, 28725),
  (0.8660254037844388, 2129),
  (0.8660254037844388, 8481),
  (0.8660254037844388, 2105),
  (0.8660254037844388, 18045),
  (0.8660254037844388, 20903),
  (0.8660254037844388, 1039),
  (0.8660254037844388, 12175),
  (0.8660254037844388, 756),
  (0.8660254037844388, 9988),
  (0.8660254037844388, 2179),
  (0.8660254037844388, 2927),
  (0.8164965809277261, 18195),
  (0.8164965809277261, 11887),
  (0.8164965809277261, 2787),
  (0.8164965809277261, 16001),
  (0.8164965809277261, 2167),
  (0.8164965809277261, 6572),
  (0.8164965809277261, 355),
  (0.8164965809277261, 20517),
  (0.7559289460184544, 1019),
  (0.75, 30585),
  (0.75, 9014),
  (0.75, 2476),
  (0.75

### A small helper function

In [21]:
# transform 'anime_id' into its name
def get_name(a_id):
    return df_genres[df_genres['anime_id'] == a_id]['name'].tolist()[0].split(' - ')[0] 

In [59]:
get_name(32281)
df_genres[df_genres['anime_id'] == 32281]['name'].tolist()[0].split(' - ')[0] 

'Kimi no Na wa.'

### Function now to get the top-N Recommendations based on our results

Here, we enter the `id` of an anime of interest, and it returns the top N most similar anime (based on genre) as the recommendations.

In [60]:
def recommend(item_id, N):
    print(f"Recommending {N} anime similar to {get_name(item_id)} ...")
    print("---------------------")
    
    recs = results[item_id][:N]
    for rec in recs:
        print(f"\tRecommended with a score {rec[0]}:\t{get_name(rec[1])} ")

In [61]:
recommend(32281, 5)

Recommending 5 anime similar to Kimi no Na wa. ...
---------------------
	Recommended with a score 1.0:	Wind: A Breath of Heart OVA 
	Recommended with a score 1.0:	Wind: A Breath of Heart (TV) 
	Recommended with a score 0.8944271909999159:	Aura: Maryuuin Kouga Saigo no Tatakai 
	Recommended with a score 0.8660254037844388:	Angel Beats!: Another Epilogue 
	Recommended with a score 0.8660254037844388:	Clannad: After Story 


<font color="red">1. Feel free to go back through and try this with a different dataset (such as the [MovieLens 100k](https://grouplens.org/datasets/movielens/) dataset).</font>  
  
<font color="red">2. To begin, we only used genres. Could any other features be valuable here?</font>

<font color="red">3. As another exercise, go back to the `cosine_similarity` section. Try writing your own function to calculate the cosine similarity and check it using the one from sklearn.</font>

<font color="red">4. Currently, this only returns the top N results based on an entered anime. How could we start to utilize the `rating.csv` data to recommend the top N anime to a given *user* based on their likes?</font>

<font color="red">Start by writing out the steps involved (or pseudocode):</font>
1. ...  
2. ...  
3. ... 