In [1]:
pip show lightgbm

Name: lightgbm
Version: 4.6.0
Summary: LightGBM Python-package
Home-page: https://github.com/microsoft/LightGBM
Author: 
Author-email: 
License: The MIT License (MIT)

Copyright (c) Microsoft Corporation

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS O

In [2]:
pip show xgboost

Name: xgboost
Version: 3.0.2
Summary: XGBoost Python Package
Home-page: 
Author: 
Author-email: Hyunsu Cho <chohyu01@cs.washington.edu>, Jiaming Yuan <jm.yuan@outlook.com>
License: Apache-2.0
Location: C:\Users\snjvm\anaconda3\Lib\site-packages
Requires: numpy, scipy
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from lightgbm import LGBMClassifier
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
anime=pd.read_csv(r"./data/anime.csv")
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


Data Description:

- anime_id: Unique ID of each anime.
- name: Anime title.
- type: Anime broadcast type, such as TV, OVA, etc.
- genre: anime genre.
- episodes: The number of episodes of each anime.
- rating: The average rating for each anime compared to the number of users who gave ratings.
- member: Total number of members

In [6]:
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [7]:
anime.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [8]:
# preprocessing imputation 

anime['genre']=anime['genre'].fillna('')
anime['rating'] = anime['rating'].fillna(anime.groupby('type')['rating'].transform('mean'))
anime['type']=anime['type'].fillna(anime['type'].mode()[0])
anime.isnull().sum()

anime_id     0
name         0
genre        0
type         0
episodes     0
rating      25
members      0
dtype: int64

In [9]:
anime.describe(include='all')

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
count,12294.0,12294,12294,12294,12294.0,12269.0,12294.0
unique,,12292,3265,6,187.0,,
top,,Shi Wan Ge Leng Xiaohua,Hentai,TV,1.0,,
freq,,2,823,3812,5677.0,,
mean,14058.221653,,,,,6.476641,18071.34
std,11455.294701,,,,,1.019233,54820.68
min,1.0,,,,,1.67,5.0
25%,3484.25,,,,,5.89,225.0
50%,10260.5,,,,,6.57,1550.0
75%,24794.5,,,,,7.17,9437.0


In [10]:
anime.nunique()

anime_id    12294
name        12292
genre        3265
type            6
episodes      187
rating        603
members      6706
dtype: int64

In [11]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


episode column is showing as object class but its top values are number and that make sense because it is number of episodes.

In [13]:
anime['episodes'].unique()

array(['1', '64', '51', '24', '10', '148', '110', '13', '201', '25', '22',
       '75', '4', '26', '12', '27', '43', '74', '37', '2', '11', '99',
       'Unknown', '39', '101', '47', '50', '62', '33', '112', '23', '3',
       '94', '6', '8', '14', '7', '40', '15', '203', '77', '291', '120',
       '102', '96', '38', '79', '175', '103', '70', '153', '45', '5',
       '21', '63', '52', '28', '145', '36', '69', '60', '178', '114',
       '35', '61', '34', '109', '20', '9', '49', '366', '97', '48', '78',
       '358', '155', '104', '113', '54', '167', '161', '42', '142', '31',
       '373', '220', '46', '195', '17', '1787', '73', '147', '127', '16',
       '19', '98', '150', '76', '53', '124', '29', '115', '224', '44',
       '58', '93', '154', '92', '67', '172', '86', '30', '276', '59',
       '72', '330', '41', '105', '128', '137', '56', '55', '65', '243',
       '193', '18', '191', '180', '91', '192', '66', '182', '32', '164',
       '100', '296', '694', '95', '68', '117', '151', '130',

In [14]:
anime[~anime['episodes'].str.isnumeric()]['episodes'].value_counts()

Unknown    340
Name: episodes, dtype: int64

In [15]:
anime.isnull().sum()

anime_id     0
name         0
genre        0
type         0
episodes     0
rating      25
members      0
dtype: int64

In [16]:
# We have 340 records where its episode is 'unknown' 
# for unknown number of episode i am imputing it with mode 

anime['episodes']=anime['episodes'].replace('Unknown',np.nan)
anime['episodes']=anime['episodes'].astype(float)
anime['episodes']=anime['episodes'].fillna(anime['episodes'].mode()[0])
anime['episodes']=anime['episodes'].astype(int)
anime['episodes'].unique()

# We have 25 null rating so we will fill it with mean() of its type
anime['rating'] = anime['rating'].fillna(anime.groupby('type')['rating'].transform('mean'))


In [17]:
# Feature Extraction
# Genre Dummies
genre_dummies=anime['genre'].str.get_dummies(sep=", ")
genre_dummies.sum()

Action           2845
Adventure        2348
Cars               72
Comedy           4645
Dementia          240
Demons            294
Drama            2016
Ecchi             637
Fantasy          2309
Game              181
Harem             317
Hentai           1141
Historical        806
Horror            369
Josei              54
Kids             1609
Magic             778
Martial Arts      265
Mecha             944
Military          426
Music             860
Mystery           495
Parody            408
Police            197
Psychological     229
Romance          1464
Samurai           148
School           1220
Sci-Fi           2070
Seinen            547
Shoujo            603
Shoujo Ai          55
Shounen          1711
Shounen Ai         65
Slice of Life    1220
Space             381
Sports            543
Super Power       465
Supernatural     1037
Thriller           87
Vampire           102
Yaoi               39
Yuri               42
dtype: int64

In [18]:
type_dummies=pd.get_dummies(anime['type'])
type_dummies.value_counts()

Movie  Music  ONA  OVA  Special  TV
0      0      0    0    0        1     3812
                   1    0        0     3311
1      0      0    0    0        0     2348
0      0      0    0    1        0     1676
              1    0    0        0      659
       1      0    0    0        0      488
dtype: int64

In [19]:
# Normalizing Episodes, Rating and Members
scaler = MinMaxScaler()
anime[['episodes_scaled','rating_scaled', 'members_scaled']] = scaler.fit_transform(anime[['episodes','rating', 'members']])


In a recommendation system, we categorize data into two types:

Metadata (Identifiers): anime_id, name. These are used for labeling.

Features (Descriptors): genre_dummies,type_dummies, episodes, type, rating, members. These are used for calculation.

anime_id,name

In [22]:
# We exclude 'anime_id' and 'name' here
features=pd.concat([genre_dummies,type_dummies,anime[['episodes_scaled','rating_scaled', 'members_scaled']]],axis=1)



In [23]:
features.isnull().sum()

Action             0
Adventure          0
Cars               0
Comedy             0
Dementia           0
Demons             0
Drama              0
Ecchi              0
Fantasy            0
Game               0
Harem              0
Hentai             0
Historical         0
Horror             0
Josei              0
Kids               0
Magic              0
Martial Arts       0
Mecha              0
Military           0
Music              0
Mystery            0
Parody             0
Police             0
Psychological      0
Romance            0
Samurai            0
School             0
Sci-Fi             0
Seinen             0
Shoujo             0
Shoujo Ai          0
Shounen            0
Shounen Ai         0
Slice of Life      0
Space              0
Sports             0
Super Power        0
Supernatural       0
Thriller           0
Vampire            0
Yaoi               0
Yuri               0
Movie              0
Music              0
ONA                0
OVA                0
Special      

In [24]:
# --- STEP 3: RECOMMENDATION ENGINE (Calculating Similarity) ---
# Create the similarity matrix (The "Twin" Finder)
cos_sim = cosine_similarity(features)
# Create a lookup for indices
indices = pd.Series(anime.index, index=anime['name']).drop_duplicates()
indices

name
Kimi no Na wa.                                            0
Fullmetal Alchemist: Brotherhood                          1
Gintama°                                                  2
Steins;Gate                                               3
Gintama&#039;                                             4
                                                      ...  
Toushindai My Lover: Minami tai Mecha-Minami          12289
Under World                                           12290
Violence Gekiga David no Hoshi                        12291
Violence Gekiga Shin David no Hoshi: Inma Densetsu    12292
Yasuji no Pornorama: Yacchimae!!                      12293
Length: 12294, dtype: int64

In [40]:
def get_similar_indices(anime_name, cosine_sim_matrix, indices_series, top_n=10):
    # 1. Selection: Get the index of the anime that matches the name
    idx = indices_series[anime_name]

    # 2. Selection: Get the pairwise similarity scores for that specific index
    # This slices one row out of the whole matrix
    sim_scores = list(enumerate(cosine_sim_matrix[idx]))

    # 3. Sorting: Sort the anime based on the similarity scores (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # 4. Indices Selection: Get the indices of the top_n similar anime
    # We start from 1 because index 0 is the anime itself (100% match)
    top_indices = [i[0] for i in sim_scores[1:top_n+1]]
    
    return anime.iloc[top_indices][['name', 'genre', 'rating', 'type']]

# Usage:
top_10_indices = get_similar_indices("Naruto", cos_sim, indices)
top_10_indices

Unnamed: 0,name,genre,rating,type
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",7.94,TV
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",8.37,TV
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",8.32,TV
582,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",7.95,TV
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",7.95,TV
1930,Dragon Ball Super,"Action, Adventure, Comedy, Fantasy, Martial Ar...",7.4,TV
2615,Medaka Box,"Action, Comedy, Ecchi, Martial Arts, School, S...",7.21,TV
3038,Tenjou Tenge,"Action, Comedy, Ecchi, Martial Arts, School, S...",7.1,TV
1209,Medaka Box Abnormal,"Action, Comedy, Ecchi, Martial Arts, School, S...",7.63,TV
515,Dragon Ball Kai (2014),"Action, Adventure, Comedy, Fantasy, Martial Ar...",8.01,TV


In [45]:
# --- 1. Split the Data ---
# We split the indices so we can keep the full rows
train_indices, test_indices = train_test_split(anime.index, test_size=0.2, random_state=42)

# We use the full similarity matrix we already built
# But we will only "query" using test_indices and look for matches in train_indices

def evaluate_model(test_idx_list, train_idx_list, sim_matrix, k=10):
    precisions = []
    recalls = []
    
    # Test on a sample of the test set to save time
    for idx in test_idx_list[:100]:
        target_genres = set(anime.iloc[idx]['genre'].split(', '))
        
        # Get scores for this test anime against all TRAIN anime
        # We filter the similarity row to only include training indices
        scores = sim_matrix[idx][train_idx_list]
        
        # Get indices of the top K highest scores in the training set
        top_k_local_indices = np.argsort(scores)[-k:][::-1]
        top_k_global_indices = train_idx_list[top_k_local_indices]
        
        hits = 0
        for rec_idx in top_k_global_indices:
            rec_genres = set(anime.iloc[rec_idx]['genre'].split(', '))
            
            # RELEVANCE RULE: Share at least 2 genres
            if len(target_genres.intersection(rec_genres)) >= 2:
                hits += 1
        
        # Calculate Precision and Recall for this one item
        precision = hits / k
        
        # (For Recall, we'd need to know the total number of 'relevant' items 
        # in the whole train set, which is very high. Usually, Precision is 
        # the main focus for content-based systems.)
        precisions.append(precision)

    avg_precision = np.mean(precisions)
    return avg_precision

# Run Evaluation
avg_p = evaluate_model(test_indices, train_indices, cos_sim, k=10)
print(f"Average Precision@10: {avg_p:.2%}")

Average Precision@10: 69.00%


### Interview Question 1: Difference between user-based and item-based collaborative filtering?
- User-Based Collaborative Filtering: The Idea: "Tell me what users similar to me liked."

- How it works: It finds other users who have given similar ratings to you in the past. If User A and User B both liked Naruto and Death Note, and User B also liked Bleach, the system recommends Bleach to User A.

- Cons: It is hard to keep up with when you have millions of users (tastes change fast).

- Item-Based Collaborative Filtering: The Idea: "Tell me what items are similar to what I just watched."

- How it works: It looks at the ratings of an item. If most people who gave Naruto a 10 also gave One Piece a 10, the system concludes those two items are "similar."

- Pros: It is more stable because item characteristics (like a movie's genre or quality) don't change as often as a person's mood.

### Interview Question 2: What is collaborative filtering, and how does it work?
Collaborative Filtering is a method of making automatic predictions about the interests of a user by collecting preferences from many users (collaborating).

- How it works:

- The Matrix: It creates a massive grid (User-Item Matrix) where rows are users and columns are anime. The cells are filled with ratings.

- Finding Patterns: It ignores the "features" (like genre or type). It doesn't care if an anime is "Action" or "Drama." It only looks at the behavior.

- The Logic: If 1,000 people liked both Anime A and Anime B, the system assumes they are related.

- The Result: It fills in the "blanks" in the grid by guessing what rating you would give an anime you haven't seen yet based on what similar people thought of it.

Final Checklist for your Assignment Submission:
Data Preprocessing: Cleaned the NaN values, fixed the episodes strings, and imputed ratings using the "Type Mean."

Feature Extraction: Created Dummy Variables for genres/type and used MinMaxScaler for the numbers.

Recommendation System: get_similar_indices() function that takes an anime name, uses Cosine Similarity, and filters by a Threshold.

