# Assignment 11 : Recommendation System

## Data Description :

animme_id : Unique it to each anime.

name : Title of the anime.

genre : Type of Anime.

type : Movie/TV/OVA etc

episodes : no of episodes each anime contains.

rating : average rating to each anime on a scale of 10.

members : number of community members to each anime.

## Objective :

To implement a recommendation system using cosine similarity on an anime anime.


## Task 1 : Data Preprocessing



In [1]:
# Import necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load the dataset
df = pd.read_csv("anime.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
# handling the missing value
# imputating the missing values in genre with the most appropriate words
df.genre.fillna("missing",inplace=True)


In [5]:
df.genre.info()

<class 'pandas.core.series.Series'>
RangeIndex: 12294 entries, 0 to 12293
Series name: genre
Non-Null Count  Dtype 
--------------  ----- 
12294 non-null  object
dtypes: object(1)
memory usage: 96.2+ KB


In [6]:
# handling the type feature
df.type.mode()

0    TV
Name: type, dtype: object

In [7]:
df.type.fillna(df.type.mode()[0],inplace=True)

In [8]:
df.type.info()

<class 'pandas.core.series.Series'>
RangeIndex: 12294 entries, 0 to 12293
Series name: type
Non-Null Count  Dtype 
--------------  ----- 
12294 non-null  object
dtypes: object(1)
memory usage: 96.2+ KB


In [9]:
# Handling missing values in rating column(feature)
df.rating.fillna(df.rating.mean(),inplace=True)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [11]:
# Hence episodes is a numeric data but it is assigned as object so covert its datatype to int.
df.episodes = pd.to_numeric(df.episodes,errors="coerce")

In [12]:
df.episodes.info()

<class 'pandas.core.series.Series'>
RangeIndex: 12294 entries, 0 to 12293
Series name: episodes
Non-Null Count  Dtype  
--------------  -----  
11954 non-null  float64
dtypes: float64(1)
memory usage: 96.2 KB


In [13]:
df.episodes.fillna(df.episodes.mean(),inplace=True)

In [14]:
df.episodes.info()

<class 'pandas.core.series.Series'>
RangeIndex: 12294 entries, 0 to 12293
Series name: episodes
Non-Null Count  Dtype  
--------------  -----  
12294 non-null  float64
dtypes: float64(1)
memory usage: 96.2 KB


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  float64
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 672.5+ KB


In [16]:
# convert type to categorical datatype
df.type = df.type.astype("category")

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   anime_id  12294 non-null  int64   
 1   name      12294 non-null  object  
 2   genre     12294 non-null  object  
 3   type      12294 non-null  category
 4   episodes  12294 non-null  float64 
 5   rating    12294 non-null  float64 
 6   members   12294 non-null  int64   
dtypes: category(1), float64(2), int64(2), object(2)
memory usage: 588.6+ KB


In [18]:
for col in df.select_dtypes(include=["object","category"]).columns:
    print(f"unique values in {col} : \n",df[col].unique(),"\n")

unique values in name : 
 ['Kimi no Na wa.' 'Fullmetal Alchemist: Brotherhood' 'Gintama°' ...
 'Violence Gekiga David no Hoshi'
 'Violence Gekiga Shin David no Hoshi: Inma Densetsu'
 'Yasuji no Pornorama: Yacchimae!!'] 

unique values in genre : 
 ['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen' ...
 'Hentai, Sports' 'Drama, Romance, School, Yuri' 'Hentai, Slice of Life'] 

unique values in type : 
 ['Movie', 'TV', 'OVA', 'Special', 'Music', 'ONA']
Categories (6, object): ['Movie', 'Music', 'ONA', 'OVA', 'Special', 'TV'] 



## Task 2 : Feature Extraction

In [19]:
# feature Extraction

# the key features are Genres, Average Rating,and number of Episodes

# converting genres into numerical columns ( one hot encoding)

df['genre']=df['genre'].str.split(',')


In [20]:
df_expanded = df['genre'].explode().reset_index()

In [21]:
df_expanded = pd.get_dummies(df_expanded,columns = ['genre'])

In [22]:
df_one_hot = df_expanded.groupby('index').sum()

In [23]:
df= df.merge(df_one_hot,left_index=True,right_on="index")

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12294 entries, 0 to 12293
Data columns (total 90 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   anime_id              12294 non-null  int64   
 1   name                  12294 non-null  object  
 2   genre                 12294 non-null  object  
 3   type                  12294 non-null  category
 4   episodes              12294 non-null  float64 
 5   rating                12294 non-null  float64 
 6   members               12294 non-null  int64   
 7   genre_ Adventure      12294 non-null  uint8   
 8   genre_ Cars           12294 non-null  uint8   
 9   genre_ Comedy         12294 non-null  uint8   
 10  genre_ Dementia       12294 non-null  uint8   
 11  genre_ Demons         12294 non-null  uint8   
 12  genre_ Drama          12294 non-null  uint8   
 13  genre_ Ecchi          12294 non-null  uint8   
 14  genre_ Fantasy        12294 non-null  uint8   
 15  ge

In [25]:
print(df.head())

       anime_id                              name  \
index                                               
0         32281                    Kimi no Na wa.   
1          5114  Fullmetal Alchemist: Brotherhood   
2         28977                          Gintama°   
3          9253                       Steins;Gate   
4          9969                     Gintama&#039;   

                                                   genre   type  episodes  \
index                                                                       
0              [Drama,  Romance,  School,  Supernatural]  Movie       1.0   
1      [Action,  Adventure,  Drama,  Fantasy,  Magic,...     TV      64.0   
2      [Action,  Comedy,  Historical,  Parody,  Samur...     TV      51.0   
3                                    [Sci-Fi,  Thriller]     TV      24.0   
4      [Action,  Comedy,  Historical,  Parody,  Samur...     TV      51.0   

       rating  members  genre_ Adventure  genre_ Cars  genre_ Comedy  ...  \
index      

In [26]:
# Normalizing the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['rating','episodes']] = scaler.fit_transform(df[['rating','episodes']])

In [27]:
print(df[['rating','episodes']].head())

         rating  episodes
index                    
0      0.924370  0.000000
1      0.911164  0.034673
2      0.909964  0.027518
3      0.900360  0.012658
4      0.899160  0.027518


In [28]:
df.rating.describe(),df.episodes.describe()

(count    12294.000000
 mean         0.576699
 std          0.122100
 min          0.000000
 25%          0.507803
 50%          0.585834
 75%          0.660264
 max          1.000000
 Name: rating, dtype: float64,
 count    12294.000000
 mean         0.006264
 std          0.025434
 min          0.000000
 25%          0.000000
 50%          0.000550
 75%          0.006264
 max          1.000000
 Name: episodes, dtype: float64)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12294 entries, 0 to 12293
Data columns (total 90 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   anime_id              12294 non-null  int64   
 1   name                  12294 non-null  object  
 2   genre                 12294 non-null  object  
 3   type                  12294 non-null  category
 4   episodes              12294 non-null  float64 
 5   rating                12294 non-null  float64 
 6   members               12294 non-null  int64   
 7   genre_ Adventure      12294 non-null  uint8   
 8   genre_ Cars           12294 non-null  uint8   
 9   genre_ Comedy         12294 non-null  uint8   
 10  genre_ Dementia       12294 non-null  uint8   
 11  genre_ Demons         12294 non-null  uint8   
 12  genre_ Drama          12294 non-null  uint8   
 13  genre_ Ecchi          12294 non-null  uint8   
 14  genre_ Fantasy        12294 non-null  uint8   
 15  ge

In [30]:
df.head()

Unnamed: 0_level_0,anime_id,name,genre,type,episodes,rating,members,genre_ Adventure,genre_ Cars,genre_ Comedy,...,genre_Shounen,genre_Slice of Life,genre_Space,genre_Sports,genre_Super Power,genre_Supernatural,genre_Thriller,genre_Vampire,genre_Yaoi,genre_missing
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,0.0,0.92437,200630,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic,...",TV,0.034673,0.911164,793665,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samur...",TV,0.027518,0.909964,114262,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,0.012658,0.90036,673572,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samur...",TV,0.027518,0.89916,151266,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Task 3 : Recommendation System

In [31]:
# Building Recommendation System
from sklearn.metrics.pairwise import cosine_similarity
features = df.drop(columns=['anime_id','name','type','members','genre'])

In [32]:
cosine_sim = cosine_similarity(features)

In [33]:
cosine_sim_df = pd.DataFrame(cosine_sim,index = df['name'],columns=df['name'])

In [34]:
cosine_sim_df

name,Kimi no Na wa.,Fullmetal Alchemist: Brotherhood,Gintama°,Steins;Gate,Gintama&#039;,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou,Hunter x Hunter (2011),Ginga Eiyuu Densetsu,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare,Gintama&#039;: Enchousen,...,Super Erotic Anime,Taimanin Asagi 3,Teleclub no Himitsu,Tenshi no Habataki Jun,The Satisfaction,Toushindai My Lover: Minami tai Mecha-Minami,Under World,Violence Gekiga David no Hoshi,Violence Gekiga Shin David no Hoshi: Inma Densetsu,Yasuji no Pornorama: Yacchimae!!
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Kimi no Na wa.,1.000000,0.136600,0.136443,0.225308,0.134992,0.344699,0.171341,0.378143,0.134028,0.134189,...,0.132814,0.381156,0.142157,0.127622,0.129360,0.119713,0.125440,0.150858,0.154926,0.173746
Fullmetal Alchemist: Brotherhood,0.136600,1.000000,0.361436,0.174948,0.360629,0.417950,0.622299,0.296135,0.359983,0.360102,...,0.103080,0.102899,0.110331,0.099044,0.100393,0.092906,0.097350,0.117096,0.120233,0.134839
Gintama°,0.136443,0.361436,1.000000,0.174728,0.999993,0.269535,0.459150,0.295940,0.999933,0.999956,...,0.102960,0.102772,0.110202,0.098930,0.100277,0.092799,0.097238,0.116957,0.120095,0.134684
Steins;Gate,0.225308,0.174948,0.174728,1.000000,0.172870,0.200142,0.219603,0.219108,0.171562,0.171786,...,0.170012,0.169677,0.181972,0.163362,0.165587,0.153238,0.160569,0.193117,0.198312,0.222402
Gintama&#039;,0.134992,0.360629,0.999993,0.172870,1.000000,0.268431,0.458146,0.294734,0.999949,0.999970,...,0.101865,0.101679,0.109030,0.097878,0.099211,0.091812,0.096204,0.115713,0.118818,0.133251
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Toushindai My Lover: Minami tai Mecha-Minami,0.119713,0.092906,0.092799,0.153238,0.091812,0.106333,0.116534,0.116309,0.091156,0.091266,...,0.999464,0.090141,0.998414,0.999805,0.999710,1.000000,0.999898,0.996923,0.996054,0.990544
Under World,0.125440,0.097350,0.097238,0.160569,0.096204,0.111420,0.122109,0.121873,0.095517,0.095632,...,0.999829,0.094453,0.999116,0.999985,0.999952,0.999898,1.000000,0.997940,0.997219,0.992402
Violence Gekiga David no Hoshi,0.150858,0.117096,0.116957,0.193117,0.115713,0.134001,0.146909,0.146611,0.114872,0.115014,...,0.998955,0.113598,0.999755,0.998275,0.998521,0.996923,0.997940,1.000000,0.999945,0.998249
Violence Gekiga Shin David no Hoshi: Inma Densetsu,0.154926,0.120233,0.120095,0.198312,0.118818,0.137611,0.150811,0.150521,0.117969,0.118111,...,0.998426,0.116655,0.999470,0.997611,0.997902,0.996054,0.997219,0.999945,1.000000,0.998811


In [39]:
def recommend_anime(anime_name,n_recommendations=5,threshold=0.5):
    sim_scores = cosine_sim_df[anime_name]
    sim_scores = sim_scores[sim_scores > threshold].sort_values(ascending=False)
    sim_scores = sim_scores.drop(anime_name)
    return sim_scores.head(n_recommendations).index.tolist()

In [40]:
recommend_anime("Naruto")

['Naruto: Shippuuden',
 'Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi',
 'Boruto: Naruto the Movie',
 'Naruto x UT',
 'Naruto: Shippuuden Movie 4 - The Lost Tower']

## Task 4 : Evaluation

In [47]:

def evaluate_recommendation_system(df, model_func, n_recommendations=5):
    precisions, recalls, f1s = [], [], []

    for anime in df['name'].sample(20): 
        recommendations = model_func(anime, n_recommendations=n_recommendations)
        relevant_recommendations = len(recommendations)
        
        precision = relevant_recommendations / n_recommendations
        recall = relevant_recommendations / n_recommendations
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)

    # Calculate and print the average metrics
    avg_precision = sum(precisions) / len(precisions)
    avg_recall = sum(recalls) / len(recalls)
    avg_f1 = sum(f1s) / len(f1s)

    print(f'Precision: {avg_precision:.2f}')
    print(f'Recall: {avg_recall:.2f}')
    print(f'F1-Score: {avg_f1:.2f}')


In [48]:
evaluate_recommendation_system(df,recommend_anime)

Precision: 1.00
Recall: 1.00
F1-Score: 1.00


## Task 5: Interview questions:

### 1. Can you explain the difference between user-based and item-based collaborative filtering ?

User- Based : finds people similar to you and recommends the things they liked.

item-Based : Finds items similar to what you liked and recommends those items to you.


### 2.  What is Collaborative filtering, and how does it works ?

Collaborative filtering is a technique used in recommendation systems to suggest items ( like movies, books, or products) to a user based on the preferences of other users.

#### working :

 1. Collect Data : the system gathers data on users behaviors - like ratings, likes, or purchases.
 2. find similarities : it looks for patterns or similarities either between users or items.
 3. Make Recommendations : based on these similarities, it predicts and suggests items that a  user might like.

