# Task 1: Data Preprocessing

Q1. Load the dataset into a suitable data structure (e.g., pandas DataFrame).

Q2. Handle missing values, if any.

Q3. Explore the dataset to understand its structure and attributes.

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.metrics.pairwise import linear_kernel
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score

import random

import warnings
warnings.filterwarnings('ignore')

# Load the dataset
anime_df = pd.read_csv('anime.csv')

# Display the first few rows
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [2]:
# Check for any missing values

null_features = anime_df.columns[anime_df.isna().any()]
anime_df[null_features].isna().sum()

genre      62
type       25
rating    230
dtype: int64

In [3]:
# Handle missing values

# For numerical columns like 'rating' fill missing values with mean
anime_df['rating'].fillna(anime_df['rating'].mean(), inplace=True)

# For categorical columns like 'genre' or 'type', fill with mode (most frequent value)
anime_df['genre'].fillna(anime_df['genre'].mode()[0], inplace=True)
anime_df['type'].fillna(anime_df['type'].mode()[0], inplace=True)

# Check for null values after processing
anime_df.isna().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [4]:
# Exploring the dataset

# Display the first few rows of the DataFrame
print("First 5 rows of the dataset:")
print(anime_df.head())

# Get the number of rows and columns in the DataFrame
print("\nNumber of rows and columns:")
print(anime_df.shape)

# Get the column names
print("\nColumn names:")
print(anime_df.columns)

# Get data types of columns
print("\nData types of columns:")
print(anime_df.dtypes)

# Summary of the DataFrame
print("\nSummary statistics:")
print(anime_df.describe()) # Include all columns for a comprehensive summary

First 5 rows of the dataset:
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  

Number of rows and columns:
(12294, 7)

Column names:
Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'ratin

# Task 2: Feature Extraction

Q1. Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

Q2. Convert categorical features into numerical representations if necessary.

Q3. Normalize numerical features if required.

In [5]:
# Getting rid of all the special characters


# Define a function to remove Japanese special characters using regex
def remove_japanese_chars(text):
    # Japanese characters range in Unicode
    japanese_pattern = re.compile("[\u3000-\u303F\u3040-\u309F\u30A0-\u30FF\uFF00-\uFFEF]")
    return japanese_pattern.sub(r"", text)

# Apply the function to 'name' column
anime_df['name'] = anime_df['name'].apply(remove_japanese_chars)

# Removing other special characters
anime_df['name'] = anime_df['name'].map(lambda name: re.sub('[.,@#$%^&*{}°;?!]',' ',name))
anime_df['genre'] = anime_df['genre'].transform(lambda x: ' '.join(x.split(', ')))
anime_df.head()

# Removing extra spaces between words
anime_df['name'] = anime_df['name'].str.replace(r'\s+', ' ', regex=True)

# Strip extra spaces at the end
anime_df['name'] = anime_df['name'].str.strip()

anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa,Drama Romance School Supernatural,Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,Action Adventure Drama Fantasy Magic Military ...,TV,64,9.26,793665
2,28977,Gintama,Action Comedy Historical Parody Samurai Sci-Fi...,TV,51,9.25,114262
3,9253,Steins Gate,Sci-Fi Thriller,TV,24,9.17,673572
4,9969,Gintama 039,Action Comedy Historical Parody Samurai Sci-Fi...,TV,51,9.16,151266


In [6]:
# Combine all features into a single string format

# Combine all textual columns into a single string format
combined_text = anime_df['name'] + ' ' + anime_df['genre'] + ' ' + anime_df['type']
# Convert numeric columns to string and concatenate
combined_text += ' ' + anime_df['episodes'].astype(str) + ' ' + anime_df['rating'].astype(str)

combined_text

0        Kimi no Na wa Drama Romance School Supernatura...
1        Fullmetal Alchemist: Brotherhood Action Advent...
2        Gintama Action Comedy Historical Parody Samura...
3                   Steins Gate Sci-Fi Thriller TV 24 9.17
4        Gintama 039 Action Comedy Historical Parody Sa...
                               ...                        
12289    Toushindai My Lover: Minami tai Mecha-Minami H...
12290                        Under World Hentai OVA 1 4.28
12291     Violence Gekiga David no Hoshi Hentai OVA 4 4.88
12292    Violence Gekiga Shin David no Hoshi: Inma Dens...
12293    Yasuji no Pornorama: Yacchimae Hentai Movie 1 ...
Length: 12294, dtype: object

In [7]:
# Vectorizing the 'combined_text' column

# Initialize TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), min_df=0.0)

# Fit and transform the combined text data
tfidf_matrix = tfidf.fit_transform(combined_text)

tfidf_matrix.shape

(12294, 12263)

# Task 3: Recommendation System

Q1. Design a function to recommend anime based on cosine similarity.

Q2. Given a target anime, recommend a list of similar anime based on cosine similarity scores.

Q3. Experiment with different threshold values for similarity scores to adjust the recommendation list size.

In [8]:
# Calculating the cosine similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [9]:
# Function to recommend Top 5 animes using cosine similarity

# Define an empty DataFrame with specified columns
columns = ['Name', 'Similarity', 'Rating', 'Type', 'Genre']
result_df = pd.DataFrame(columns=columns)

# Function to recommend anime based on cosine similarity
def recommend_anime(title, cosine_sim=cosine_sim, anime_df=anime_df):
    # Get the index of the anime title
    idx = anime_df[anime_df['name'] == title].index[0]
    
    # Get the pairwise similarity scores with other anime
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the top 5 most similar anime (excluding itself)
    sim_scores = sim_scores[1:6]
    
    # Get the anime indices and scores
    anime_indices = [i[0] for i in sim_scores]
    scores = [i[1] for i in sim_scores]
    
    # Populate result DataFrame with recommended anime details
    for index, anime_index in enumerate(anime_indices):
        anime_name = anime_df.loc[anime_index, 'name']
        anime_rating = anime_df.loc[anime_index, 'rating']
        anime_type = anime_df.loc[anime_index, 'type']
        anime_genre = anime_df.loc[anime_index, 'genre']
        similarity_score = scores[index]
        
        result_df.loc[index] = [anime_name, similarity_score, anime_rating, anime_type, anime_genre]
    
    return result_df

In [10]:
# Example of recommending anime based on a title
anime_title = 'Steins Gate'
recommended_anime = recommend_anime(anime_title)

print(f"Recommended anime for '{anime_title}':")
recommended_anime

Recommended anime for 'Steins Gate':


Unnamed: 0,Name,Similarity,Rating,Type,Genre
0,Steins Gate 0,0.791959,6.473902,TV,Sci-Fi Thriller
1,Steins Gate: Oukoubakko no Poriomania,0.576001,8.46,Special,Sci-Fi Thriller
2,Steins Gate Movie: Fuka Ryouiki no Déjà vu,0.46887,8.61,Movie,Sci-Fi Thriller
3,Steins Gate: Kyoukaimenjou no Missing Link - D...,0.458091,8.34,Special,Sci-Fi Thriller
4,Gate Keepers,0.406274,7.07,TV,Action Comedy Fantasy Mecha Sci-Fi Shounen


In [11]:
# Modifying the function to experiment with different threshold values for similarity scores and removing limit of return values


def recommend_anime_threshold(title, threshold, cosine_sim=cosine_sim, anime_df=anime_df):
    # Initialize an empty DataFrame to store the result
    result_df = pd.DataFrame(columns=['name', 'similarity_score', 'rating', 'type', 'genre'])
    
    # Get the index of the anime title
    idx = anime_df[anime_df['name'] == title].index[0]
    
    # Get the pairwise similarity scores with other anime
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Filter anime based on the threshold
    sim_scores = [score for score in sim_scores if score[1] > threshold]
    
    # Get the anime indices and scores
    anime_indices = [i[0] for i in sim_scores]
    scores = [i[1] for i in sim_scores]
    
    # Populate result DataFrame with recommended anime details
    for index, anime_index in enumerate(anime_indices):
        anime_name = anime_df.loc[anime_index, 'name']
        anime_rating = anime_df.loc[anime_index, 'rating']
        anime_type = anime_df.loc[anime_index, 'type']
        anime_genre = anime_df.loc[anime_index, 'genre']
        similarity_score = scores[index]
        
        result_df.loc[index] = [anime_name, similarity_score, anime_rating, anime_type, anime_genre]
    
    # Reset index and drop the existing index column
    result_df.reset_index(drop=True, inplace=True)
    
    return result_df


In [12]:
# Example of recommending anime based on a title with different thresholds
anime_title = 'Steins Gate'
threshold = 0.3

print(f"Recommended anime for '{anime_title}' with threshold {threshold}:")
recommended_anime = recommend_anime_threshold(anime_title, threshold)
recommended_anime

Recommended anime for 'Steins Gate' with threshold 0.3:


Unnamed: 0,name,similarity_score,rating,type,genre
0,Steins Gate,1.0,9.17,TV,Sci-Fi Thriller
1,Steins Gate 0,0.791959,6.473902,TV,Sci-Fi Thriller
2,Steins Gate: Oukoubakko no Poriomania,0.576001,8.46,Special,Sci-Fi Thriller
3,Steins Gate Movie: Fuka Ryouiki no Déjà vu,0.46887,8.61,Movie,Sci-Fi Thriller
4,Steins Gate: Kyoukaimenjou no Missing Link - D...,0.458091,8.34,Special,Sci-Fi Thriller
5,Gate Keepers,0.406274,7.07,TV,Action Comedy Fantasy Mecha Sci-Fi Shounen
6,Steins Gate: Soumei Eichi no Cognitive Computing,0.354638,7.45,ONA,Comedy
7,Divine Gate,0.346705,5.88,TV,Action Fantasy Sci-Fi
8,Gankutsuou,0.327971,8.27,TV,Drama Mystery Sci-Fi Supernatural Thriller
9,Gate Keepers 21,0.313349,6.9,OVA,Action Drama Mecha Sci-Fi Shounen


In [13]:
# Example of recommending anime based on a title with different thresholds
anime_title = 'Steins Gate'
threshold = 0.4

print(f"Recommended anime for '{anime_title}' with threshold {threshold}:")
recommended_anime = recommend_anime_threshold(anime_title, threshold)
recommended_anime

Recommended anime for 'Steins Gate' with threshold 0.4:


Unnamed: 0,name,similarity_score,rating,type,genre
0,Steins Gate,1.0,9.17,TV,Sci-Fi Thriller
1,Steins Gate 0,0.791959,6.473902,TV,Sci-Fi Thriller
2,Steins Gate: Oukoubakko no Poriomania,0.576001,8.46,Special,Sci-Fi Thriller
3,Steins Gate Movie: Fuka Ryouiki no Déjà vu,0.46887,8.61,Movie,Sci-Fi Thriller
4,Steins Gate: Kyoukaimenjou no Missing Link - D...,0.458091,8.34,Special,Sci-Fi Thriller
5,Gate Keepers,0.406274,7.07,TV,Action Comedy Fantasy Mecha Sci-Fi Shounen


In [14]:
# Example of recommending anime based on a title with different thresholds
anime_title = 'Steins Gate'
threshold = 0.5

print(f"Recommended anime for '{anime_title}' with threshold {threshold}:")
recommended_anime = recommend_anime_threshold(anime_title, threshold)
recommended_anime

Recommended anime for 'Steins Gate' with threshold 0.5:


Unnamed: 0,name,similarity_score,rating,type,genre
0,Steins Gate,1.0,9.17,TV,Sci-Fi Thriller
1,Steins Gate 0,0.791959,6.473902,TV,Sci-Fi Thriller
2,Steins Gate: Oukoubakko no Poriomania,0.576001,8.46,Special,Sci-Fi Thriller


In [15]:
# Example of recommending anime based on a title with different thresholds
anime_title = 'Steins Gate'
threshold = 0.6

print(f"Recommended anime for '{anime_title}' with threshold {threshold}:")
recommended_anime = recommend_anime_threshold(anime_title, threshold=threshold)
recommended_anime

Recommended anime for 'Steins Gate' with threshold 0.6:


Unnamed: 0,name,similarity_score,rating,type,genre
0,Steins Gate,1.0,9.17,TV,Sci-Fi Thriller
1,Steins Gate 0,0.791959,6.473902,TV,Sci-Fi Thriller


# Task 4: Evaluation

Q1. Split the dataset into training and testing sets.

Q2. Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.

Q3. Analyze the performance of the recommendation system and identify areas of improvement.

In [16]:
# Assuming 'name' is the column containing anime titles or identifiers


X = anime_df[['name','genre', 'type', 'episodes']]  # Features used for similarity calculation
y = anime_df['rating']  # Rating column for y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Display the shapes of the training and testing sets
print(f"Training set shape: {X_train.shape}, Testing set shape: {X_test.shape}")

Training set shape: (9835, 4), Testing set shape: (2459, 4)


In [17]:
# Combine all textual columns into a single string format
combined_text1 = X_train['name'] + ' ' + X_train['genre'] + ' ' + X_train['type']
# Convert numeric columns to string and concatenate
combined_text1 += ' ' + X_train['episodes'].astype(str)

combined_text1

9960                       PePePePengiin Comedy TV Unknown
2799     Mirai Shounen Conan (Movie) Adventure Drama Sc...
4036     Sasami: Mahou Shoujo Club Fantasy Magic School...
5909     Triangle Heart: Sweet Songs Forever Adventure ...
460      Macross F Movie 2: Sayonara no Tsubasa Action ...
                               ...                        
4859     Sukitte Ii na yo : Mei and Marshmallow Romance...
3264     One Piece Film: Gold Episode 0 - 711 ver Actio...
9845             Omedetou Jesus-sama Historical Kids OVA 1
10799                              Yanesenondo Music ONA 1
2732     Tenchi Muyou Ryououki 2nd Season Picture Drama...
Length: 9835, dtype: object

In [18]:
# Vectorizing the 'combined_text' column
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), min_df=0.0)

# Fit and transform the combined text data
tfidf_matrix1 = tfidf.fit_transform(combined_text1)

tfidf_matrix1.shape

(9835, 10758)

In [19]:
# Calculating the cosine similarity

cosine_sim1 = linear_kernel(tfidf_matrix1, tfidf_matrix1)

In [20]:
def recommendation_system(title, cosine_sim=cosine_sim1, anime_df=anime_df):
    # Initialize an empty DataFrame to store the result
    result_df = pd.DataFrame(columns=['name', 'similarity_score', 'rating', 'type', 'genre'])
    
    # Check if the anime title exists in anime_df
    if title not in anime_df['name'].values:
        print(f"Anime '{title}' not found in the dataset.")
        return result_df  # Return empty result DataFrame
    
    # Get the index of the anime title
    idx = anime_df[anime_df['name'] == title].index[0]
    
    # Get the pairwise similarity scores with other anime
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the top 5 most similar anime (excluding itself)
    sim_scores = sim_scores[1:6]
    
    # Get the anime indices and scores
    anime_indices = [i[0] for i in sim_scores]
    scores = [i[1] for i in sim_scores]
    
    # Populate result DataFrame with recommended anime details
    for index, anime_index in enumerate(anime_indices):
        anime_name = anime_df.loc[anime_index, 'name']
        anime_rating = anime_df.loc[anime_index, 'rating']
        anime_type = anime_df.loc[anime_index, 'type']
        anime_genre = anime_df.loc[anime_index, 'genre']
        similarity_score = scores[index]
        
        result_df.loc[index] = [anime_name, similarity_score, anime_rating, anime_type, anime_genre]
    
    return result_df

In [21]:
# Generate a random index within the range of X_test
random_index = random.randint(0, len(X_test) - 1)

# Retrieve the anime name at the random index
random_anime_name = X_test.iloc[random_index]['name']

# Print the randomly selected anime name
print("Random Anime Name from X_test:", random_anime_name)

Random Anime Name from X_test: Minihams no Kekkon Song


In [22]:
# Use this Anime Title to get recommendations

anime_title = 'Kantoku Fuyuki Todoki'
recommended_anime = recommendation_system(anime_title)

print(f"Recommended anime for '{anime_title}':")
recommended_anime

Recommended anime for 'Kantoku Fuyuki Todoki':


Unnamed: 0,name,similarity_score,rating,type,genre
0,Lupin III: Part II,0.778339,7.93,TV,Action Adventure Comedy Shounen
1,Queen 039 s Blade: Rurou no Senshi,0.586283,6.32,TV,Action Adventure Ecchi Fantasy
2,Battle Spirits: Ryuuko no Ken,0.580732,4.89,OVA,Action Comedy Martial Arts Shounen
3,Kikumana,0.491779,6.1,ONA,Dementia Psychological
4,Motto Ojamajo Doremi: Kaeru Ishi no Himitsu,0.477698,7.33,Movie,Kids Magic Shoujo


In [23]:
# Evaluate the recommendation system using scores

def evaluate_recommendation_system(recommended_anime, y_test):
    actual_ratings_dict = dict(zip(y_test.index, y_test))  # Dictionary of actual ratings
    
    # Fetch actual ratings and convert to binary labels
    actual_labels = [1 if actual_ratings_dict.get(anime['name'], 0) >= 7 else 0 for idx, anime in recommended_anime.iterrows()]
    
    # Example: Predicting relevance scores based on some criteria
    # Here, I'm assuming you're using predicted relevance scores scaled to a 0-1 range
    predicted_scores = [rating / 10.0 for rating in actual_labels]  # Scale actual labels to a 0-1 range
    
    # Convert predicted scores to binary labels based on a threshold
    predicted_labels = [1 if score >= 0.7 else 0 for score in predicted_scores]  # Example threshold 0.7
    
    # Calculate metrics
    precision = precision_score(actual_labels, predicted_labels, average='micro')
    recall = recall_score(actual_labels, predicted_labels, average='micro')
    f1 = f1_score(actual_labels, predicted_labels, average='micro')
    
    print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1-score: {f1:.2f}")

In [24]:
evaluate_recommendation_system(recommended_anime, y_test)

Precision: 1.00, Recall: 1.00, F1-score: 1.00


#### Perfect Scores (Precision: 1.00, Recall: 1.00, F1-score: 1.00)

After implementing the recommendation system evaluation, we are getting perfect scores for precision, recall, and F1-score.
This usually means that every recommendation made by our system perfectly matches the relevant items (anime in this case) in the test set.

##### Potential Causes:

1. Perfect Match in Test Set: The test set (X_test) might contain exactly the same anime titles that were used during training 
    (X_train). This situation can occur when the test set is not properly split or when there's data leakage (overlap) between 
    training and test data.
2. High Threshold for Relevance: If we are using a threshold to define relevant items (e.g., ratings above a certain value), and 
    all items in the test set exceed this threshold, then all recommendations would naturally match, resulting in perfect 
    scores.
3. Small Test Set: If our test set is very small, with only a few items, and all recommendations happen to perfectly match 
    those few items, it can lead to perfect scores.
    
##### Implications:

1. While perfect scores might initially seem desirable, they often indicate an unrealistic evaluation scenario. In real-world 
    scenarios, perfect matches are rare due to varying user preferences and the diversity of recommendations needed.
2. Perfect scores can mask issues such as overfitting to the test set or lack of diversity in recommendations.

##### Verification Steps:

1. Check Data Split: Ensure that our dataset is properly split into training (X_train, y_train) and test (X_test, y_test) sets. 
    There should be no overlap in anime titles between X_train and X_test.
2. Threshold Sensitivity: Review how we set the relevance threshold (e.g., rating threshold). Ensure it reflects realistic user 
    preferences and aligns with our test set (y_test).
3. Evaluate Diversity: Verify that our recommendation system is capable of recommending a diverse set of anime beyond what is 
    seen in the training or test sets.

##### Next Steps:

1. Expand Evaluation: Consider using additional evaluation metrics like coverage, diversity, or novelty to provide a more 
    comprehensive view of our recommendation system's performance.
2. Realistic Scenarios: Aim to simulate more realistic user behaviors and preferences in our evaluation setup to better assess 
    the effectiveness of our recommendation system.