# **Sharvari Lahane**
# **Data Science - Batch May 2024 (Baner, Pune) - Assignment 11**
# **Recommendation System**

**Task 1: Data Preprocessing**

Importing Libraries

In [5]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler

Loading the dataset

In [6]:
anime_data = pd.read_csv('anime.csv')
anime_data

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [7]:
anime_data.columns

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')

Checking for missing values

In [8]:
print(anime_data.isnull().sum())

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


Handling missing values (if any)

In [9]:
anime_data['genre'].fillna('', inplace=True)  # Replace missing genres with an empty string
anime_data.dropna(inplace=True)  # Drop rows with any other missing values

**Task 2: Feature Extraction**

Converting 'genre' to a numerical representation using CountVectorizer

In [11]:
import warnings
warnings.filterwarnings('ignore')

count_vectorizer = CountVectorizer(tokenizer=lambda x: x.split(', '))
genre_matrix = count_vectorizer.fit_transform(anime_data['genre'])

In [12]:
count_vectorizer

In [13]:
genre_matrix

<12064x44 sparse matrix of type '<class 'numpy.int64'>'
	with 35641 stored elements in Compressed Sparse Row format>

Normalizing 'rating' feature

In [14]:
scaler = MinMaxScaler()
anime_data['normalized_rating'] = scaler.fit_transform(anime_data[['rating']])

In [15]:
scaler

Converting the genre matrix to array and add normalized rating to it

In [17]:
import numpy as np

features = np.hstack([genre_matrix.toarray(), anime_data[['normalized_rating']].values])
features

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.92436975],
       [0.        , 1.        , 1.        , ..., 0.        , 0.        ,
        0.91116447],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.90996399],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.38535414],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.39735894],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.45498199]])

**Task 3: Recommendation System**

Computing cosine similarity matrix

In [18]:
cosine_sim = cosine_similarity(features)

In [19]:
cosine_sim

array([[1.        , 0.29880771, 0.13644987, ..., 0.15085865, 0.15492584,
        0.1737458 ],
       [0.29880771, 1.        , 0.36135915, ..., 0.11708593, 0.12024259,
        0.13484933],
       [0.13644987, 0.36135915, 1.        , ..., 0.116948  , 0.12010094,
        0.13469047],
       ...,
       [0.15085865, 0.11708593, 0.116948  , ..., 1.        , 0.99994581,
        0.99824985],
       [0.15492584, 0.12024259, 0.12010094, ..., 0.99994581, 1.        ,
        0.99881138],
       [0.1737458 , 0.13484933, 0.13469047, ..., 0.99824985, 0.99881138,
        1.        ]])

Function to recommend anime based on cosine similarity

In [20]:
def recommend_anime(anime_title, cosine_sim=cosine_sim, anime_data=anime_data, top_n=10):
    # Getting the index of the anime that matches the title
    idx = anime_data[anime_data['name'] == anime_title].index[0]

    # Getting the pairwise similarity scores of all anime with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sorting the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Getting the scores of the top_n most similar anime
    sim_scores = sim_scores[1:top_n+1]

    # Getting the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top_n most similar anime
    return anime_data.iloc[anime_indices]

print(recommend_anime('Naruto'))

      anime_id                                               name  \
615       1735                                 Naruto: Shippuuden   
1103     32365  Boruto: Naruto the Movie - Naruto ga Hokage ni...   
486      28755                           Boruto: Naruto the Movie   
1343     10075                                        Naruto x UT   
1472      8246        Naruto: Shippuuden Movie 4 - The Lost Tower   
1573      6325  Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsu...   
2458     19511               Naruto Shippuuden: Sunny Side Battle   
2997     10659  Naruto Soyokazeden Movie: Naruto to Mashin to ...   
175       1604                             Katekyo Hitman Reborn!   
7628     23933                            Kyutai Panic Adventure!   

                                                  genre     type episodes  \
615   Action, Comedy, Martial Arts, Shounen, Super P...       TV  Unknown   
1103  Action, Comedy, Martial Arts, Shounen, Super P...  Special        1   
486   Act

**Task 4: Evaluation**

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

Splitting the dataset into training and testing sets

In [22]:
train_data, test_data = train_test_split(anime_data, test_size=0.2, random_state=42)

In [23]:
train_data

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,normalized_rating
1374,31553,Charlotte: Tsuyoi Monotachi,"School, Super Power",Special,1,7.56,39137,0.707083
3118,1925,Urusei Yatsura Movie 6: Itsudatte My Darling,"Action, Adventure, Comedy, Drama, Romance, Sci-Fi",Movie,1,7.08,2553,0.649460
11559,10392,Pet Life,Hentai,OVA,1,6.43,2374,0.571429
3780,8754,Tales of the Abyss Special Fan Disc,"Comedy, Slice of Life",Special,2,6.89,2975,0.626651
11152,5097,Hatsu Inu 2 The Animation: Strange Kind of Wom...,Hentai,OVA,2,7.29,8112,0.674670
...,...,...,...,...,...,...,...,...
12184,3566,Hika Ryoujoku: Wana ni Hamatta Futari,Hentai,OVA,1,5.32,1062,0.438175
5191,5272,Tondemo Nezumi Daikatsuyaku,Adventure,Movie,1,6.53,252,0.583433
5390,1262,Macross II: Lovers Again,"Adventure, Mecha, Military, Sci-Fi, Shounen, S...",OVA,6,6.47,6760,0.576230
860,22819,Aikatsu! Movie,"Music, School, Shoujo, Slice of Life",Movie,1,7.79,2813,0.734694


In [24]:
test_data

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,normalized_rating
5092,2142,Blue Dragon,"Adventure, Comedy, Fantasy, Supernatural",TV,51,6.55,22718,0.585834
10174,8041,Sennin Buraku,"Comedy, Ecchi",TV,23,5.76,213,0.490996
5003,6555,Pokemon: Pikachu no Kirakira Daisousaku!,"Adventure, Comedy, Fantasy, Kids",Special,1,6.58,3786,0.589436
5952,25641,Monotonous Purgatory,Music,Music,1,6.30,515,0.555822
10433,6583,Super Bikkuriman,"Adventure, Comedy, Demons, Fantasy, Sci-Fi",TV,44,6.55,158,0.585834
...,...,...,...,...,...,...,...,...
6951,21427,Minna Atsumare! Falcom Gakuen,"Comedy, Parody, School, Seinen",TV,13,5.83,3175,0.499400
12022,3953,DNA Hunter,Hentai,OVA,3,5.71,1165,0.484994
4747,471,To Heart 2,"Comedy, Drama, Harem, Romance, School, Slice o...",TV,13,6.65,13877,0.597839
3181,25099,Ore ga Ojousama Gakkou ni &quot;Shomin Sample&...,"Comedy, Ecchi, Harem, Romance, School",TV,12,7.06,77774,0.647059


Checking and Reset DataFrame Index

In [35]:
# Resetting the index of anime_data to ensure a sequential index
anime_data.reset_index(drop=True, inplace=True)

# Recomputing cosine similarity after resetting index
cosine_sim = cosine_similarity(features)

Updating the recommend_anime Function

In [36]:
def recommend_anime(anime_title, cosine_sim=cosine_sim, anime_data=anime_data, top_n=10):
    # Checking if the anime_title exists in the DataFrame
    if anime_title not in anime_data['name'].values:
        print(f"Anime title '{anime_title}' not found in the dataset.")
        return pd.DataFrame()  # Return empty DataFrame if not found

    # Getting the index of the anime that matches the title
    idx = anime_data[anime_data['name'] == anime_title].index[0]

    # Getting the pairwise similarity scores of all anime with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sorting the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Getting the scores of the top_n most similar anime
    sim_scores = sim_scores[1:top_n+1]

    # Getting the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top_n most similar anime
    return anime_data.iloc[anime_indices]

Adjusting the evaluate_recommendation_system Function

In [37]:
def evaluate_recommendation_system(test_data, cosine_sim, anime_data, top_n=10):
    precision_list, recall_list, f1_list = [], [], []

    for _, row in test_data.iterrows():
        recommended_anime = recommend_anime(row['name'], cosine_sim, anime_data, top_n)

        # Skiping if no recommendations could be made (e.g., anime not found)
        if recommended_anime.empty:
            continue

        # Simulating ground truth for this example (list of animes watched by user)
        ground_truth = test_data[test_data['name'] == row['name']]['name'].tolist()

        recommended_titles = recommended_anime['name'].tolist()

        # Calculating True Positives (TP), False Positives (FP), False Negatives (FN)
        tp = len(set(recommended_titles) & set(ground_truth))
        fp = len(set(recommended_titles) - set(ground_truth))
        fn = len(set(ground_truth) - set(recommended_titles))

        # Calculating precision, recall, and F1-score
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        # Append to lists
        precision_list.append(precision)
        recall_list.append(recall)
        f1_list.append(f1)

    # Computing average precision, recall, and F1-score
    avg_precision = np.mean(precision_list) if precision_list else 0
    avg_recall = np.mean(recall_list) if recall_list else 0
    avg_f1 = np.mean(f1_list) if f1_list else 0

    return avg_precision, avg_recall, avg_f1

Here is the output of the code

In [39]:
# Recomputing cosine similarity matrix if indices are reset
cosine_sim = cosine_similarity(features)

# Runing evaluation
avg_precision, avg_recall, avg_f1 = evaluate_recommendation_system(test_data, cosine_sim, anime_data)
print(f"Average Precision: {avg_precision:.4f}")
print(f"Average Recall: {avg_recall:.4f}")
print(f"Average F1 Score: {avg_f1:.4f}")

Average Precision: 0.0126
Average Recall: 0.1260
Average F1 Score: 0.0229


# **Interview Questions**

**Can you explain the difference between user-based and item-based collaborative filtering?**

User-based and item-based collaborative filtering are two common approaches used in recommendation systems to predict a user's interest in an item (like a product, movie, or song) based on past behavior.

Both methods utilize the concept of finding similarities but differ in how they do it.

1. User-Based Collaborative Filtering:

User-based collaborative filtering (UBCF) focuses on finding similarities between users.

The basic idea is that if two users have similar preferences or have rated items in a similar way in the past, then the items that one user likes can be recommended to the other user.

2. Item-Based Collaborative Filtering:

Item-based collaborative filtering (IBCF) focuses on finding similarities between items instead of users.

The idea here is that if two items are similar (i.e., users rate them similarly), then a user who has liked or interacted with one item is likely to like the other.

**What is collaborative filtering, and how does it work?**

Collaborative filtering is a technique used in recommendation systems to predict the preferences of a user by collecting preferences from multiple users.

The assumption is that users who have agreed in the past will agree in the future or that a user will prefer items similar to what they liked in the past.

Collaborative filtering works by creating a user-item matrix where each row represents a user and each column represents an item.

The values in this matrix are usually ratings or interaction scores.

The system finds patterns within this matrix to predict user preferences for items they haven't interacted with yet.

This can be done using either user-based or item-based methods, as explained above.