### ANIME DATASET
anime_id :	myanimelist.net's unique id identifying an anime.<br>
name    :	full name of anime.<br>
genre	:   comma separated list of genres for this anime.<br>
type	:   movie, TV, OVA, etc.<br>
episodes:	how many episodes in this show. (1 if movie).<br>
rating  :	average rating out of 10 for this anime.<br>
members :	number of community members that are in this anime's "group".<br>

### Rating Dataset
user_id :	non identifiable randomly generated user id.<br>
anime_id:	the anime that this user has rated.<br>
rating  :	rating out of 10 this user has assigned (-1 if the user watched without assigning)<br>

In [2]:
#IMPORTING LIBRARIES
import os
import numpy as np
import pandas as pd
import warnings
import scipy as sp

from sklearn.metrics.pairwise import cosine_similarity

#warning hadle
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

## PRE-PROCESSING THE DATA

In [3]:
rating_df = pd.read_csv('rating_anime.csv')
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [4]:
anime_df = pd.read_csv('anime.csv')
anime_df.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [5]:
print(f'anime dataset :{anime_df.shape}')
print(f'rating dataset :{anime_df.shape}')

anime dataset :(12294, 7)
rating dataset :(12294, 7)


In [6]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [7]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7813737 entries, 0 to 7813736
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 178.8 MB


<b> To find the number of null values in the attributes

In [8]:
anime_df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

### Handling missing values

In [9]:
print('Anime missing values in %')
round(anime_df.isnull().sum().sort_values(ascending=False)/len(anime_df.index),4)*100

Anime missing values in %


rating      1.87
genre       0.50
type        0.20
members     0.00
episodes    0.00
name        0.00
anime_id    0.00
dtype: float64

<b> Here, we prefer to take mode to replace the missing values

In [10]:
print(anime_df['genre'].mode())
print(anime_df['type'].mode())

0    Hentai
dtype: object
0    TV
dtype: object


<b> This means the most occuring genre is HENTAI and the mode of type is TV

In [11]:
#FILLING MODE VALUE TO THE NULL VALUES
anime_df['genre'] = anime_df['genre'].fillna(anime_df['genre'].mode().values[0])
anime_df['type'] = anime_df['type'].fillna(anime_df['type'].mode().values[0])
anime_df.isnull().sum()

anime_id      0
name          0
genre         0
type          0
episodes      0
rating      230
members       0
dtype: int64

<b> Still there are ratings with null value. For this, we can remove the anime with 0 ratings

In [12]:
anime_df = anime_df.dropna(subset=['rating'])
print(anime_df.isnull().sum())
print('*'*50)

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64
**************************************************


In [13]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12064 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12064 non-null  int64  
 1   name      12064 non-null  object 
 2   genre     12064 non-null  object 
 3   type      12064 non-null  object 
 4   episodes  12064 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12064 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 754.0+ KB


#### FILLING THE NAN VALUES 

-1 in the rating dataframe represents that the user did not register a rating. So this must be replaced by the NAN values.

In [14]:
rating_df['rating'] = rating_df['rating'].apply(lambda x: np.nan if x==-1 else x)
rating_df.head(20)

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,
5,1,355,
6,1,356,
7,1,442,
8,1,487,
9,1,846,


In [15]:
anime_df = anime_df[anime_df['type']=='TV']

rated_anime = rating_df.merge(anime_df,left_on='anime_id',right_on='anime_id',suffixes=['_user',''])
rated_anime =rated_anime[['user_id', 'name', 'rating']]
rated_anime_7500= rated_anime[rated_anime.user_id <= 7500]
rated_anime_7500.tail(50)

Unnamed: 0,user_id,name,rating
5280506,6771,Psychoarmor Govarian,6.69
5280509,6771,Video Senshi Lezarion,6.72
5280518,6771,Genji Tsuushin Agedama,6.58
5280522,6771,Chou Dendou Robo Tetsujin 28-gou FX,6.53
5280527,6771,Tanken Driland: 1000-nen no Mahou,6.32
5280533,6771,Sekai Meisaku Douwa Series,6.71
5280534,6773,Submarine Super 99,6.41
5280535,7249,Submarine Super 99,6.41
5280555,6817,Idol Densetsu Eriko,6.9
5280594,6817,Yume no Crayon Oukoku,7.12


<b>Pivot Table for similarity<br>
We will create a pivot table of users as rows and tv show names as columns. The pivot table will help us will be analized for the calcuations of similarity.

In [16]:
pivot = rated_anime_7500.pivot_table(index=['user_id'], columns=['name'], values='rating')
pivot.head()

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,6.49,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,8.11,


<b>Now we will engineer our pivot table in the following steps:¶<br>
1.Value normalization.<br>
2.Filling Nan values as 0.<br>
3.Transposing the pivot for the next step.<br>
4.Dropping columns with the values of 0 (unrated).<br>
5.Using scipy package to convert to sparse matrix format for the similarity computation.

In [17]:
#Normalizing the values
pivot_n = pivot.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)

# step 2
pivot_n.fillna(0, inplace=True)

# step 3
pivot_n = pivot_n.T

# step 4
pivot_n = pivot_n.loc[:, (pivot_n != 0).any(axis=0)]

# step 5
piv_sparse = sp.sparse.csr_matrix(pivot_n.values)

## Cosine Similarity Model

In [19]:
#model based on anime similarity
anime_similarity = cosine_similarity(piv_sparse)

#Df of anime similarities
ani_sim_df = pd.DataFrame(anime_similarity, index = pivot_n.index, columns = pivot_n.index)

In [20]:
def anime_recommendation(ani_name):
    number = 1
    print('Recommended because you watched {}:\n'.format(ani_name))
    for anime in ani_sim_df.sort_values(by = ani_name, ascending = False).index[1:6]:
        print(f'#{number}: {anime}, {round(ani_sim_df[anime][ani_name]*100,2)}% match')
        number +=1  

In [30]:
input_anime = str(input(''))
anime_recommendation(input_anime)

Usagi Drop
Recommended because you watched Usagi Drop:

#1: Hanasaku Iroha, 41.09% match
#2: Ano Hi Mita Hana no Namae wo Bokutachi wa Mada Shiranai., 40.17% match
#3: Chihayafuru, 37.89% match
#4: Barakamon, 36.89% match
#5: Bakemonogatari, 36.23% match
