## Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).

Handle missing values, if any.

Explore the dataset to understand its structure and attributes.


In [2]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [3]:
# URL of the raw CSV file
url = "https://raw.githubusercontent.com/vijaykalore/DS_Assignments/refs/heads/main/anime.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(url)

In [4]:
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [189]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [190]:
df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


In [191]:
df.shape

(12294, 7)

In [192]:
df.isnull().sum()

Unnamed: 0,0
anime_id,0
name,0
genre,62
type,25
episodes,0
rating,230
members,0


In [193]:
# Handle missing values
df.fillna('', inplace=True)

  df.fillna('', inplace=True)


In [194]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   anime_id  12294 non-null  int64 
 1   name      12294 non-null  object
 2   genre     12294 non-null  object
 3   type      12294 non-null  object
 4   episodes  12294 non-null  object
 5   rating    12294 non-null  object
 6   members   12294 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 672.5+ KB


In [195]:
df.isnull().sum()

Unnamed: 0,0
anime_id,0
name,0
genre,0
type,0
episodes,0
rating,0
members,0


In [196]:
df.shape

(12294, 7)

In [197]:
# Explore unique values in categorical columns
for col in df.select_dtypes(include=['object']):
    print(f"Unique values in column '{col}': {df[col].unique()}")

Unique values in column 'name': ['Kimi no Na wa.' 'Fullmetal Alchemist: Brotherhood' 'Gintama°' ...
 'Violence Gekiga David no Hoshi'
 'Violence Gekiga Shin David no Hoshi: Inma Densetsu'
 'Yasuji no Pornorama: Yacchimae!!']
Unique values in column 'genre': ['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen' ...
 'Hentai, Sports' 'Drama, Romance, School, Yuri' 'Hentai, Slice of Life']
Unique values in column 'type': ['Movie' 'TV' 'OVA' 'Special' 'Music' 'ONA' '']
Unique values in column 'episodes': ['1' '64' '51' '24' '10' '148' '110' '13' '201' '25' '22' '75' '4' '26'
 '12' '27' '43' '74' '37' '2' '11' '99' 'Unknown' '39' '101' '47' '50'
 '62' '33' '112' '23' '3' '94' '6' '8' '14' '7' '40' '15' '203' '77' '291'
 '120' '102' '96' '38' '79' '175' '103' '70' '153' '45' '5' '21' '63' '52'
 '28' '145' '36' '69' '60' '178' '114' '35' '61' '34' '109' '20' '9' '49'
 '366' '97' '4

## Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

Convert categorical features into numerical representations if necessary.

Normalize numerical features if required.


In [198]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [199]:
# Convert 'rating' to numeric, coercing any errors (non-numeric values) to NaN
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

# Drop rows with NaN values in 'rating'
df = df.dropna(subset=['rating'])

# Convert to integer
df['rating'] = df['rating'].astype(int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating'] = df['rating'].astype(int)


In [200]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import MinMaxScaler

# Feature Selection: We will use 'genre' and 'rating' for similarity
features = ['genre', 'rating']

# Convert 'genre' (categorical) to numerical using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['genre_encoded'] = mlb.fit_transform(df['genre'].str.split(', ')).tolist()

# Normalize 'rating' using MinMaxScaler
scaler = MinMaxScaler()
df['rating_normalized'] = scaler.fit_transform(df['rating'].values.reshape(-1, 1))

# Select the final features for similarity computation
selected_features = ['genre_encoded', 'rating_normalized']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['genre_encoded'] = mlb.fit_transform(df['genre'].str.split(', ')).tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating_normalized'] = scaler.fit_transform(df['rating'].values.reshape(-1, 1))


In [181]:
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,genre_encoded,rating_normalized
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9,200630,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",0.888889
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9,793665,"[0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...",0.888889
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9,114262,"[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",0.888889
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9,673572,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.888889
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9,151266,"[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",0.888889
...,...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4,211,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",0.333333
12290,5543,Under World,Hentai,OVA,1,4,183,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",0.333333
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4,219,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",0.333333
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4,175,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",0.333333


## Recommendation System:

Design a function to recommend anime based on cosine similarity.

Given a target anime, recommend a list of similar anime based on cosine similarity scores.

Experiment with different threshold values for similarity scores to adjust the recommendation list size.


In [201]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [202]:
def recommend_anime(target_anime_name, df, threshold=0.8):
  # Find the index of the target anime
  target_index = df[df['name'] == target_anime_name].index[0]

  # Calculate cosine similarity scores for all anime
  similarity_scores = cosine_similarity(df['genre_encoded'].tolist())

  # Get similarity scores for the target anime
  target_similarity = similarity_scores[target_index]

  # Sort anime by similarity score and exclude the target anime
  recommended_indices = np.argsort(target_similarity)[::-1][1:]

  # Filter recommendations based on threshold
  recommended_anime = []
  for idx in recommended_indices:
    if target_similarity[idx] >= threshold:
      recommended_anime.append(df.iloc[idx]['name'])

  return recommended_anime

In [203]:
print(recommend_anime('Naruto', df))

['Naruto Shippuuden: Sunny Side Battle', 'Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsugu Mono', 'Naruto Soyokazeden Movie: Naruto to Mashin to Mitsu no Onegai Dattebayo!!', 'Boruto: Naruto the Movie', 'Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi', 'Naruto x UT', 'Naruto', 'Naruto: Shippuuden Movie 4 - The Lost Tower', 'Kyutai Panic Adventure!', 'Battle Spirits: Ryuuko no Ken', 'Katekyo Hitman Reborn!', 'Tenjou Tenge', 'Dragon Ball Z Movie 11: Super Senshi Gekiha!! Katsu no wa Ore da', 'Dragon Ball Kai (2014)', 'Dragon Ball Z: Atsumare! Gokuu World', 'Dragon Ball Z: Summer Vacation Special', 'Dragon Ball Kai', 'Dragon Ball Super', 'Medaka Box', 'Dragon Ball GT: Goku Gaiden! Yuuki no Akashi wa Suushinchuu', 'Medaka Box Abnormal', 'Dragon Ball Z', 'Dragon Ball Z Movie 15: Fukkatsu no F']


## Interview Questions:

1. Can you explain the difference between user-based and item-based collaborative filtering?

User-Based Collaborative Filtering:

Concept: User-based collaborative filtering recommends items to a user based on the preferences of other users who are similar to them. The underlying idea is that if User A has similar preferences to User B, then the items that User B liked (but User A hasn’t interacted with yet) should be recommended to User A.
How It Works:
Identify Similar Users: Calculate the similarity between the target user and other users based on their ratings or interactions with items.
Recommend Items: Recommend items that similar users have liked but the target user hasn't interacted with yet.
Example: If User A and User B both like similar movies, and User B has rated a movie highly that User A hasn’t seen, that movie might be recommended to User A.
Item-Based Collaborative Filtering:

Concept: Item-based collaborative filtering recommends items to a user based on the similarity between items. The assumption here is that if a user liked an item, they will likely enjoy other similar items.
How It Works:
Identify Similar Items: Calculate the similarity between items based on user ratings. Items that are rated similarly by the same users are considered similar.
Recommend Items: Recommend items that are similar to those the user has already liked or interacted with.
Example: If a user liked a specific movie, the system will recommend other movies that are similar to that movie.
Key Differences:

Focus: User-based filtering focuses on finding similar users, while item-based filtering focuses on finding similar items.
Scalability: Item-based filtering is often more scalable than user-based filtering because the number of items tends to be smaller and more stable than the number of users.
Data Sparsity: Item-based filtering usually handles data sparsity better because items generally have more ratings than individual users


2. What is collaborative filtering, and how does it work?

Collaborative Filtering:

Definition: Collaborative filtering is a technique used in recommendation systems that relies on the preferences and behaviors of users to make recommendations. It predicts the interests of a user by collecting preferences from many users (collaborating).
How It Works:

Data Collection: The system collects data on users' interactions with items, such as ratings, clicks, purchases, etc.
Similarity Calculation:
User-Based: Calculate the similarity between users based on their ratings or interactions.
Item-Based: Calculate the similarity between items based on user ratings or interactions.
Prediction:
User-Based: Predict a user’s interest in an item based on the preferences of similar users.
Item-Based: Predict a user’s interest in an item based on the similarity of that item to other items the user has interacted with.
Recommendation Generation: The system recommends items that are most likely to be of interest to the user, based on the predictions.
Types of Collaborative Filtering:

Memory-Based Collaborative Filtering:
User-Based: As described above, it relies on the similarity between users.
Item-Based: As described above, it relies on the similarity between items.
Model-Based Collaborative Filtering:
Approach: Uses machine learning algorithms like Matrix Factorization, SVD (Singular Value Decomposition), or Neural Networks to learn latent factors from the interaction data and make predictions.
Advantages: Often better at handling large-scale data and capturing complex patterns in user-item interactions.
Advantages:

No Need for Domain Knowledge: Collaborative filtering only requires user interaction data and doesn’t need detailed item content or user attributes.
Dynamic Adaptation: The model adapts as more data becomes available, improving over time.
Challenges:

Cold Start Problem: Difficulties in making recommendations for new users or new items due to a lack of data.
Sparsity: Many systems have a large number of items but only a few ratings or interactions, leading to sparse data.
Scalability: As the number of users and items grows, the computational cost can become significant, particularly for memory-based methods.